aics00_false_cognates_paper1_final.rtf
{\rtf1\ansi \deff4\deflang1033{\fonttbl{\f1\froman\fcharset2\fprq2 Symbol;}{\f3\fmodern\fcharset0\fprq1 Courier;}{\f4\froman\fcharset0\fprq2 Times New Roman;}{\f8\froman\fcharset0\fprq2 Times;} {\f68\froman\fcharset0\fprq2 Times New Roman ISO 8859_2;}}{\colortbl;\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;\red255\green255\blue0;\red255\green255\blue255; \red0\green0\blue128;\red0\green128\blue128;\red0\green128\blue0;\red128\green0\blue128;\red128\green0\blue0;\red128\green128\blue0;\red128\green128\blue128;\red192\green192\blue192;}{\stylesheet{\qj\sl360\slmult1\widctlpar \f4\lang2057 \snext0 Normal;}{ \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 \sbasedon0\snext0 heading 1;}{\s2\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs26\lang2057 \sbasedon0\snext0 heading 2;}{\s3\qj\li360\sl360\slmult1\widctlpar \b\f8\lang2057 \sbasedon0\snext0 heading 3;} {\\cs10 \additive Default Paragraph Font;}{\s15\qj\sl360\slmult1\widctlpar\tqc\tx4153\tqr\tx8306 \f4\lang2057 \sbasedon0\snext15 footer;}{\s16\qj\li720\sl360\slmult1\widctlpar \f4\lang2057 \sbasedon0\snext0 Normal Indent;}{\s17\fi-820\li820\sl360\slmult1 \widctlpar \f8\cf1 \sbasedon0\snext17 reference;}{\\cs18 \additive\fs16 \sbasedon10 annotation reference;}{\s19\qj\sl360\slmult1\widctlpar \f4\fs20\lang2057 \sbasedon0\snext19 annotation text;}{\s20\qj\sl360\slmult1\widctlpar\tqc\tx4320\tqr\tx8640 \f4\lang2057 \sbasedon0\snext20 header;}{\\cs21 \additive\sbasedon10 page number;}{\s22\qj\li302\ri259\sb20\sa20\keep\nowidctlpar \f4\fs22\lang2057 \sbasedon0\snext22 Abstract;}{\s23\qj\sl360\slmult1\widctlpar \f4\fs20\lang2057 \sbasedon0\snext23 footnote text;}{\\cs24 \additive\super \sbasedon10 footnote reference;}}{\info{\title D27: Create Semantic Lexicon}{\author Dr. Richard F. E. Sutcliffe}{\operator Dept of Computer Science}{\creatim\yr2000\mo8\dy2\hr17\min26} {\revtim\yr2000\mo8\dy4\hr20\min47}{\printim\yr2000\mo8\dy4\hr18\min49}{\version36}{\edmins197}{\nofpages14}{\nofwords3057}{\nofchars17425}{\\company Information Technology Department, UL.}{\vern57431}} \paperw11909\paperh16834\margl1134\margr1134\margt1134\margb1134 \widowctrl\ftnbj\aenddoc\hyphhotz0\noextrasprl\prcolbl\cvmme\sprsspbf\brkfrm\swpbdr\hyphcaps0\fracwidth \fet0\sectd \psz9\pgnrestart\linex0\headery706\footery706\colsx709\endnhere {\footer \pard\plain \s15\qj\sl360\slmult1\widctlpar\tqc\tx4153\tqr\tx8306\pvpara\phmrg\posxc\posy0 \f4\lang2057 {\field{\\fldinst {\cs21 PAGE }}{\fldrslt {\cs21\lang1024 14}}}{\cs21 \par }\pard \s15\qc\sl360\slmult1\widctlpar\tqc\tx4153\tqr\tx8306 \par }{\\pnseclvl1\pnucrm\pnstart1\pnindent720\pnhang{\pntxta .}}{\\pnseclvl2\pnucltr\pnstart1\pnindent720\pnhang{\pntxta .}}{\\pnseclvl3\pndec\pnstart1\pnindent720\pnhang{\pntxta .}}{\\pnseclvl4\pnlcltr\pnstart1\pnindent720\pnhang{\pntxta )}}{\\pnseclvl5 \pndec\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\\pnseclvl6\pnlcltr\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\\pnseclvl7\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}{\\pnseclvl8\pnlcltr\pnstart1\pnindent720\pnhang {\pntxtb (}{\pntxta )}}{\\pnseclvl9\pnlcrm\pnstart1\pnindent720\pnhang{\pntxtb (}{\pntxta )}}\pard\plain \qc\widctlpar \f4\lang2057 {\b\fs36 An Experiment in the Semi-Automatic Identification \par of False-Cognates between English and Polish} \par \pard \qc\widctlpar \par \pard \qc\sa120\widctlpar {\b\fs28 Gosia Barker and Richard F. E. Sutcliffe}{\fs28 \'86} \par \pard \qc\widctlpar \par Department of Languages and Cultural Studies \par Department of Computer Science\'86 \par and Information Systems \par University of Limerick \par Limerick, Ireland \par \par \pard \qc\widctlpar {\fs22 +353 61 202039 (Direct) \par +353 61 202706 (Direct\'86) \par +353 61 330876 (Fax) \par \par Gosia.Barker@ul.ie (Email) \par Richard.Sutcliffe@ul.ie (Email) \par www.csis.ul.ie/staff/richard.sutcliffe (URL) \par \par \par }\pard\plain \s1\qc\sa120\widctlpar \b\f4\fs32\lang2057 Abstract{\fs18 \par }\pard\plain \s22\qj\li720\ri720\sb20\sa20\keep\nowidctlpar \f4\fs22\lang2057 {\fs20 False cognates are morphologically similar words which occur in two languages with different meanings. We present a simple algorithm for assisting in the detection of false cognates between English and Polish. It uses a set of morphological transformation rules to convert each English word into a a number of candidate Polish \lquote words\rquote . Each such \lquote word\rquote is accepted as a candidate if it occurs in a Polish word list. The list of candidates was then sorted manually into four categories: false cognate, true cognate, unrelated and not a Polish word. Pr ior to this study, only 96 false cognates had been documented between English and Polish. Working with an English word list of 26,871 entries and a Polish word list of 109,862 entries, 294 completely new false cognates were discovered.} \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 {\fs22 \par }\pard\plain \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 1. Introduction \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 A false cognate{\cs24\super \chftn {\footnote \pard\plain \s23\qj\sl360\slmult1\widctlpar \f4\fs20\lang2057 {\cs24\super \chftn } There are many different definitions of {\i false cognate} . Some (e.g. Moss, 1992; Topalova, 1996) insist that false c ognates must be etymologically related. Others (e.g. Carroll, 1992; Stella Martinez, 1994; Frantzen, 1998) argue that morphological similarity is sufficient. None that we have so far found take into account the precise polisemy of each word.}} may be defined as a word which exists in two languages, {\i A} and {\i B}, where the most frequently occurring semantic sense of the word in {\i A} is not the same as the most frequently occurring semantic sense in {\i B} . For example the most common sense of gymnasium in English (henceforth gymnasium.eng) is \lquote a room for sports\rquote whereas the most common sense of gimnazjum in Polish (i.e. gimnazjum.pol) is \lquote grammar school\rquote . False cognates are of interest because they present significant difficulties to native speakers of {\i A} who wish to learn {\i B} and vice versa. This is because such learners will tend to assume that the meaning of the word in {\i B} is the one with which they are familiar in {\i A} . For this reason, dictionaries of false cognates exist for certain language pairs for the benefit of second language learners. In order to produce such a dictionary, it is necessary first of all to enumerate all the false cognates which exist between a p air of languages. Traditionally, this is accomplished by speakers of {\i A} consulting dictionaries of {\i B} looking for words which they recognise. The task is made more complicated by the fact that the spelling to sound rules of languages differ as do their systems of inflectional morphology. For example cravat.eng and krawat.pol are similar phonetically even though their spelling is different. As the most common meanings of these words are not the same they can be considered as false cognates. Similarly, project.eng and projectowa{\f68 \'e6}.pol have different endings because verbs decline differently in English and Polish. However, they are still likely to be confused. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar There are essentially two stages in the production of a list of false cognates. Firstly, a list of candidate word pairs {\i X.A} and {\i Y.B} must be produced where {\i X} appears similar to {\i Y} . Secondly, for each such pair, it must be determined whether {\i X} and {\i Y} are false cognates or true cognates. Various automated methods have been proposed for carrying out the first stage, as discussed in the next section. The approach adopted here is to create a serie s of morphological transformation rules which can be applied to a word in language {\i A}. These are then applied in different combinations to each word {\i X.A} . Each member of the resulting set of words is then accepted as a candidate false cognate if it occurs in a word list for language {\i B}. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar The identification of an effective method for carry out the second stage (separation of true and false cognates) remains a research topic. \par \pard \qj\sl360\slmult1\widctlpar \par \pard\plain \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 2. Previous Work \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 Over the years there have been a number of different approaches to th e identification of morphologically similar words for purposes such as cognate recognition. Adamson and Boreham (1974) use Dice's Coefficient (Sokal and Sneath, 1963) working with character bigrams to identify semantically similar pairs of words for appli cations in information retrieval. Guy (1984) discusses a method by which correspondences between words can be detected between languages which share 40% or more cognates. The method is statistically based and involves three main stages. Firstly, sound cor re spondences are identified based on the frequency of co-occurrence of pairs of letters. This data is then processed further to identify possible null correspondences where a letter in one language corresponds to a null string in the other. Finally, a relat ive measure of cognation is computed for each word pair, based on the correspondence and non-correspondence data at the letter level. The problem with this approach is that it only works for a language pair where the proportion of cognates is very high. O n the other hand it still works even if the spelling-to-sound rules for the languages are quite different. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar Brew and McKelvie (1996) present an approach for detecting candidate false cognates. They first carried out sentence alignment on a parallel English-French corpus to produce a set of sentence pairs. From each one they then extracted all possible pairs of words which had a part-of-speech of verbal kind. Each member of the resulting set of candidate word pairs was then compared using six different methods . The first five were variants of Dice's Coefficient (Sokal and Sneath, 1963) working with character bigrams or extended bigrams (trigrams with the centre character removed). The last method was based on the longest common character sequence found between the two words. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar Our method is different from the above in that we produce a set of candidate target language words from a source language word by carrying out a set of morphological transformations on it. We then accept a candidate for manual inspection on ly if it matches an actual target language word exactly. \par \pard \qj\sl360\slmult1\widctlpar \par \pard\plain \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 3. Method \par \pard\plain \s2\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs26\lang2057 3.1 Outline \par \pard\plain \qj\sa120\sl360\slmult1\widctlpar \f4\lang2057 The main stages of the work were: \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sa120\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}the selection of English and Polish word lists, \par {\pntext\pard\plain\f1 \'b7\tab}the development of a set of morphological transformation rules, \par {\pntext\pard\plain\f1 \'b7\tab}the application of the rules to each English word to produce a candidate Polish word, \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}the analysis of the resulting candidates. \par \pard \qj\sl360\slmult1\widctlpar \par These stages are discussed in turn. \par \par \pard\plain \s2\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs26\lang2057 3.2 Selection of English and Polish Word Lists \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 The English word list used in this work was obtained from the Oxford Text Archive and contains 26,871 entries. It can be obtained by ftp (English Word List, 1999). Proper names and acronyms are included, as are the singular forms of nouns. Several (but not all) inflections of each verb are present. Thus, for example, \lquote go\rquote , \lquote goes\rquote and \lquote gone\rquote are all present in the list, while \lquote going\rquote is not present. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar The Polish word list used is attributed to Rafal Maszkowski and was also obtained from the Oxford Text Archive (Polish Word List, 1999). \par \pard \qj\sl360\slmult1\widctlpar \par \pard\plain \s2\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs26\lang2057 3.3 Development of a Set of Morphological Transformation Rules \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 The starting point for this work was the observation that simple rules can be used to transform an English word into a Polish word. For example, both \lquote c\rquote and \lquote k\rquote occur in English words, while only \lquote k\rquote occurs in Polish words. Thus in situations where an English word has been adopted in Polish, a single \lquote c\rquote is often converted into a single \lquote k\rquote . This can be expressed by the transformation rule {\f3 c -> k.} An example of its use is the conversion of cocktail.eng into koktajl.pol. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sa120\sl360\slmult1\widctlpar Each transformation rule takes the following form: \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sa120\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}The left hand side is a sequence of one or more letters occurring in an English word; \par {\pntext\pard\plain\f1 \'b7\tab}The right hand side is a sequence of one or more letters occurring in a Polish word; \par {\pntext\pard\plain\f1 \'b7\tab}A tilde {\f3 ^} is used to mark the start of a word; \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}A dollar sign {\f3 $} is used to mark the end of a word. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sa120\sl360\slmult1\widctlpar A rule is applied to an English word in the following manner: \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sa120\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}} The left hand side of the rule is inspected and the sequence of letters found there is searched for in the word, starting at the left hand side; \par {\pntext\pard\plain\f1 \'b7\tab}As soon as a match is found, the sequence of letters is replaced by those found on the right hand side of the rule; \par {\pntext\pard\plain\f1 \'b7\tab}In the case where the sequence of letters starts with {\f3 ^}, it only matches if that sequence is at the very start of the word; \par {\pntext\pard\plain\f1 \'b7\tab}In the case where the sequence of letters ends with a {\f3 $}, it only matches if that sequence is at the very end of the word; \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}Only the first sequence found in the word is ever transformed by one application of the rule. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sa120\sl360\slmult1\widctlpar Consider the following rules: \par \pard \qj\sl360\slmult1\widctlpar {\f3 c -> k \par ^ch -> ^cz \par i -> y \par phe$ -> fa$ \par s -> z \par }\pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sa120\sl360\slmult1\widctlpar Here are three examples of transformations carried out using these rules: \par \pard \qj\sl360\slmult1\widctlpar {\f3 architect -> architekt\tab \tab c -> k \par }\pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar {\f3 catastrophe -> katastrophe\tab c -> k \par katastrophe -> katastrofa\tab phe$ -> fa$ \par }\pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar {\f3 chartism -> czartism\tab \tab ^ch -> ^cz \par czartism -> czartysm\tab \tab i -> y \par czartysm -> czartyzm\tab \tab s -> z \par }\pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar In the first example, a single rule {\f3 c -> k} is used to transform \lquote architect\rquote into \lquote architekt\rquote . Note that there is another application of this rule to \lquote architect\rquote giving \lquote arkhitect\rquote . \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar In the second example, two rules are used. \lquote catastrophe\rquote is transformed into \lquote katastrophe\rquote using {\f3 c -> k}, and then \lquote katastrophe\rquote is transformed into \lquote katastrofa \rquote using {\f3 phe$ -> fa$}. The dollar sign in the latter rule indicates that it will only match the sequence \lquote phe\rquote if it occurs at the very end of a word. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar The final example involves three rules. \lquote chartism\rquote first becomes \lquote czartism\rquote using {\f3 ^ch -> ^cz}. The tilde in this rule indicates that it only applies to a word which starts with \lquote ch \rquote . \lquote czartism\rquote is now transformed into \lquote czartysm\rquote using {\f3 i -> y}. Finally, \lquote czartysm\rquote becomes czartyzm by means of the rule {\f3 s -> z}. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar The initial set of rules was created by the first author who observed types of transformation between Engish and Polish words over a number of years. These were then augmented by further rules as additional types of transformation came to light. The final set used in this experiment contains 43 rules. \par \pard \qj\sl360\slmult1\widctlpar \par \pard\plain \s2\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs26\lang2057 3.4 Application of the Rules \par \pard\plain \qj\sa120\sl360\slmult1\widctlpar \f4\lang2057 The rules were then applied to each word in the English word list as follows: \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sa120\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}The li st of rules was scanned and a rule whose left hand side matched part of the English word was chosen and applied to it, giving a word {\i X}. \par {\pntext\pard\plain\f1 \'b7\tab}The previous step was carried out using all possible rules to produce a list of {\i X}s each one created using a different rule. The resulting list is denoted L{\sub 1.} \par {\pntext\pard\plain\f1 \'b7\tab}For each word {\i X} in L{\sub 1} the list of rules was scanned again and a rule whose left hand side matched part of {\i X} was chosen and applied to it. \par {\pntext\pard\plain\f1 \'b7\tab}The previous step was carried out using all possible rules to produce a list of {\i Y}s each created by applying a different rule to an {\i X.} The list of all such {\i Y}s is denoted L{\sub 2}. \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sa120\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}A list L{\sub 3} was created by applying all possible single rules to members of L{\sub 2}. \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}Similarly L{\sub 4}, L{\sub 5} up to L{\sub n} were created, where {\i n} is a pre-defined constant. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar The result of the above steps when carried out on each English word {\i X} is a set of lists L{\sub 1}, L{\sub 2}, ... L{\sub n}. L{\sub 1} is a set of lexemes created by applying one rule to {\i X}. L{\sub 2} is a set of lexemes created by applying two rules to X, and so on. The final stage in producing a list of candidate false correlates was to search for each member of one of the lists L{\sub 1} ... L{\sub n} in a Polish word list and accept it as a candidate if present and reject it otherwise. \par \pard \qj\sl360\slmult1\widctlpar \par \pard\plain \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 4. Results and Analysis \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 The result of carrying out the above procedure on the list of 26,871 English words and limiting the number of rules used (i.e. {\i n} above) to 2 was a list of 5,745 possible false correlates. This list only included the first candidate found for each English word. Thus for example if a word {\i X}.eng could be transformed into {\i Y}.pol using one rule and into {\i Z} .pol using two rules, then only {\i Y} occurs in the list. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sa120\sl360\slmult1\widctlpar Each of the 5,745 candidates was then inspected manually and labelled using one of the following codes: \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sa120\sl360\slmult1\widctlpar{\\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}\lquote f\rquote if the candidate word was a genuine false cognate relative to the original English word,
\par {\pntext\pard\plain\f1 \'b7\tab}\lquote t\rquote if the candidate word was a true cognate relative to the original English word, \par {\pntext\pard\plain\f1 \'b7\tab}\lquote e\rquote if the candidate word exists in Polish but is not related to the original English word, \par {\pntext\pard\plain\f1 \'b7\tab}\pard \qj\fi-360\li360\sl360\slmult1\widctlpar{\*\pn \pnlvlblt\pnf1\pnstart1\pnindent360\pnhang{\pntxtb \'b7}}\lquote x\rquote if the candidate word does not exist in Polish. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar 305 in category \lquote f\rquote , 715 in category \lquote t\rquote and 308 in category \lquote e\rquote were found. The remaining words were in category \lquote x\rquote . \par \pard \qj\sl360\slmult1\widctlpar \par \pard\plain \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 5. Conclusions \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 Before this study the longest dictionary of false cognates between English and Polish was Proctor (1995, p1073) and contained 96 word pairs of which 11 were also discovered in the present work. This preliminary experiment in word mapping has therefore led to the discovery of a further 294 false cognates. However a number of important questions have also come to light which warrant further investigation. We will now outline these in turn. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar Firstly, the definition of a false cognate needs to be more precise in future experiments so that particular candidates can be accepted or rejected according to objective criteria. Issues which need to be clarified include the notion of orthographic or ph onetic similarity, the specification of word senses and their frequencies, the importance or otherwise of etymology, the treatment of non-divisible phrases such as {\i in} {\i toto} and {\i tabular rasa} , the matching of words in different inflections or parts of speech, and the type of language use being considered (e.g. general vs. topic specific). \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar Secondly, it is necessary to decide how many conversion rules can be used at a time to be produce a candidate. In the present study a maximum of 2 rules was used. In addition only the first candidate derivable from a particular English word was considered , rather than all possible candidates. The restrictions may be resulting in the loss of genuine false cognates. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar Thirdly, imperfections in our word lists have come to light. The English word list is quite short which limits the candidates which can be generated. The Polish word list is also relatively incomplete and moreover contains many English words which do not occur in Polish. The latter resulted in a large number of false-positive matches in the present experiment which it would be desirable to eliminate prior ro manual inspection of the results. \par \pard \qj\sl360\slmult1\widctlpar \par \pard \qj\sl360\slmult1\widctlpar Finally, the set of 43 conversion rules used to convert English words into candidate Polish words is clearly incomplete. A more complete set therefore needs to be generated before further experiments are carried out.
\par \pard \qj\sl360\slmult1\widctlpar \par \pard\plain \s1\qj\sa168\sl360\slmult1\keepn\widctlpar \b\f4\fs32\lang2057 6. References \par \pard\plain \qj\fi-432\li432\sl360\slmult1\widctlpar \f4\lang2057 Adamson, G. W., & Boreham, J. (1974). The use of an association measure based on character structure to identify semantically related pairs of words and document titles. {\i Information Storage and Retrieval}, {\b 10}, 253-60. \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Brew, C., & McKelvie, D. (1996). Word-pair extraction for lexicography. {\i Proceedings of the Second International Conference on New Methods in Language Processing, NeMLaP 96, Bilkent University} , 45-55. http://www.ltg.ed.ac.uk/~chrisbr/papers/nemlap96/ \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Carroll, S. E. (1992). On Cognates. {\i Second Language Research}, {\b 8}(2), 93-119. \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar English Word List (1999). Formerly available at ftp://ota.ox.ac.uk/pub/wordlists/dictionaries/ words-english.gz . Originally this word list came from ftp.uu.net:/doc/dictionaries/English/ words.English.Z .
\par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Frantzen, D. (1998). Intrinsic and Extrinsic Factors that Contribute to the Difficulty of Learning False Cognates. {\i Foreign Language Annals}, {\b 31}(2), 243-254. \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Guy, J. B. M. (1984). An Algorithm for Identifying Cognates between Related Languages. {\i Proceedings of the 10th International Conference on Computational Linguistics, Coling'84, 2-6 July 1984, Stanford University, California}, 448-451. \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Moss, G. (1992). Cognate Recognition: Its Importance in the Teaching of ESP Reading Courses to Spanish Speakers. {\i English for Specific Purposes}, {\b 11}, 141-158. \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Polish Word List (1999). Formerly available at ftp://ota.ox.ac.uk/pub/wordlists/polish/ . \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Proctor, P. (1995). {\i Cambridge International Dictionary of English}. Cambridge, UK: Cambridge University Press. \par Sokal, R. R., & Sneath, P. H. A. (1963). {\i Principles of Numerical Taxonomy}. San Francisco, CA: Freeman. \par \pard \qj\fi-432\li432\sl360\slmult1\widctlpar Stella Martinez, M. (1994). Spanish-English Cognates in the Subtechnical Vocabulary Found in Engineering Magazine Texts. {\i English for Specific Purposes}, {\b 13}(1), 81-91. \par Topalova, A. (1996). \lquote False Friends\rquote in Translation Work: An Empirical Study. {\i Perspectives: Studies in Translatology}, {\b 4}(2), 215-222. \par \pard\plain \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 \page Appendix 1: Conversion Rules \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 {\f3 \sect }\sectd \psz9\sbknone\linex0\headery706\footery706\colsx709\endnhere \pard\plain \qj\widctlpar \f4\lang2057 {\f3\fs22 c -> k \par e.g. column -> kolumna, architect -> architekt, Africa -> Afryka \par \par ck$ -> k$ \par e.g. frock -> frak \par \par c -> s \par e.g. race -> rasa \par \par ^ch -> ^cz \par e.g. chartism -> czartyzm \par \par d$ -> t$ \par e.g. trend -> trent \par \par g -> z. \par e.g. budget -> dudz.et \par \par ^g -> ^dz. \par e.g. gin -> dz.in \par \par z -> z. \par e.g. bulldozer -> buldoz.er \par \par gh$ -> j$ \par e.g. bobsleigh -> bobslej \par \par ^j -> ^dz.et \par e.g. jet -> dz.et \par \par ^qu -> ^kw \par e.g. quaker -> kwakier \par \par i -> y \par e.g. civil -> cywil \par \par ph -> f \par e.g. pamphlet -> pamflet \par \par phe$ -> fa$ \par e.g. catastrophe -> katastrofa \par \par ^rh -> ^r \par e.g. rhumb -> rumb \par \par s -> z \par e.g. sclerosis -> skleroza \par \par ^sh -> ^sz \par e.g. shock -> szok \par \par tch -> cz \par e.g. fletcher -> fleczer \par \par th -> t \par e.g. jonathan -> jonatan \par \par th$ -> s$ \par e.g. tamworth -> tamwors \par \par ^v -> ^w \par e.g. velvet -> welwet \par \par ^v -> ^f \par e.g. vauxhall -> foksal \par \par w -> u \par e.g. clown -> klaun \par \par wl$ -> l~$ \par e.g. trawl -> tral~ \par \par ^wh -> ^w \par e.g. whig -> wig \par \par x$ -> ks$ \par e.g. orthodox -> ortodoks \par \par bb -> b \par e.g. bubble -> buble \par \par dd -> d \par e.g. paddock -> padok \par \par ff -> f \par e.g. offside -> ofsajd \par \par gg -> g \par e.g. bootlegger -> butleger \par \par ss -> s \par e.g. ambassador -> ambasador \par \par a -> o \par e.g. hall -> hol \par \par a -> ej \par e.g. interface -> interfejs \par \par a -> e \par e.g. backhand -> bekhend \par \par ai -> e \par e.g. trainer -> trener \par \par ea -> e \par e.g. sealskin -> selskiny ?? \par \par ai -> aj \par e.g. cocktail -> koktajl \par \par ai -> ej \par e.g. trailer -> trejler \par \par ai -> e \par e.g. drain -> dren \par \par e -> ie \par e.g. quaker -> kwakier \par \par e$ -> a$ \par e.g. atmosphere -> atmosfera \par }\pard\plain \s1\qj\sa168\sl360\slmult1\widctlpar \b\f4\fs32\lang2057 {\f3 \page }Appendix 2: False Cognates Discovered in this Study \par \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 {\f3 \sect }\sectd \psz9\sbknone\linex0\headery706\footery706\cols2\endnhere \pard\plain \qj\sl360\slmult1\widctlpar \f4\lang2057 {\f3\fs22 adept adept \par arbiter arbiter \par argument argument \par aura aura \par back bak \par bandy bandy \par blank blank \par blat blat \par block blok \par bock bok \par bona bona \par bony bony \par buck buk \par bunt bunt \par bury bury \par cabal kabel \par calumny kolumny \par camber comber \par camera kamera \par cant kant \par canto konto \par cantor kantor \par cap cap \par caper kaper \par capital kapital \par car car \par card kard \par cargo sorgo \par cart kart \par carton karton \par cat cat \par cathedra katedra \par cell cell \par censure cenzure \par chalet szalet \par char czar \par character character \par chard czart \par check czek \par chef szef \par china china \par chord czort \par chore chore \par chub czub \par chum szum \par city syty \par closet klozet \par clutch klucz \par cock kok \par cock sok \par cod kod \par codex kodeks \par cold sold \par colon kolon \par colt kolt \par coma koma \par combinator kombinator \par compleat komplet \par complement komplement \par complex kompleks \par concern koncern \par confidant konfident \par confident konfident \par cop kop \par copy kopy \par cord kord \par cordial kordial \par cornet kornet \par corona korona \par corpus corpus \par corral korral \par cot kot \par cox koks \par cram kram \par crate krate \par cream krem \par credit kredyt \par crock krok \par crotch krocz \par cry kry \par crypto krypto \par cud cud \par cuddly kudly \par cup cup \par cur kur \par cymbal cymbal \par czar czar \par dal dal \par dam dam \par dame dame \par data data \par debit debit \par decant desant \par director director \par dole dole \par drab drab \par dragon dragon \par drink drink \par element element \par facet facet \par fain fen \par fair fair \par fest fest \par festival festival \par final final \par fix fix \par flak flek \par fleck flek \par gad gad \par gag gag \par gape gape \par gar gar \par garb garb \par gecko dziecko \par gem gem \par glans glans \par glob glob \par glut glut \par gnat gnat \par go go \par grad grad \par graph graph \par gripe grype \par grubby gruby \par grunt grunt \par gust gust \par habit habit \par hack hack \par hale hale \par hall hall \par hangar hangar \par hardy hordy \par hart hart \par he he \par hell hell \par helm helm \par hem hem \par hen hen \par herb herb \par hey hey \par his his \par hit hit \par hop hop \par huck huk \par hurt hurt \par idea idea \par index index \par intnedant intendent \par island island \par jar jar \par jest jest \par jot jot \par karma karma \par kill kill \par kiosk kiosk \par kit kit \par kite kite \par kulak kulak \par lack lack \par lick lik \par local local \par loch loch \par lock lock \par lot lot \par luck luk \par lump lump \par mall moll \par manna manna \par market market \par mat mat \par material material \par mayor mayor \par medium medium \par metro metro \par most most \par my my \par net net \par nit nit \par note note \par oh oh \par on on \par order order \par owe owe \par pa pa \par pack pak \par pan pan \par panama panama \par papa papa \par paragon paragon \par parapet parapet \par pardon pardon \par partner partner \par pasty pasty \par pat pat \par pause pauze \par pedal pedal \par persona persona \par pilot pilot \par pion pion \par plaster plaster \par pluton pluton \par pod pod \par pole pole \par polka polka \par post post \par pot pot \par premier premier \par present present \par prim prim \par prima prima \par process process \par professor professor \par prom prom \par prominent prominent \par psi psi \par puck puk \par puff puf \par pulpit pulpit \par pupa pupa \par pupil pupil \par quota kwota \par ?rabat rabat \par rack rak \par raid rajd \par raj raj \par rant rant \par rap rap \par ?rasa rasa \par rata rata \par record record \par regal regal \par reserve reserve \par resort resort \par ret ret \par rim rym \par rondo rondo \par rope rope \par ropy ropy \par row row \par rubble rubble \par ruddy rudy \par rude rude \par rug rug \par sale sale \par salon salon \par same same \par scum szum \par sima zima \par sin syn \par sine sine \par smack smak \par sock sok \par song song \par spec spec \par stadia stadia \par strata strata \par strop strop \par stuck stuk \par student student \par support support \par supreme supreme \par sure sure \par swat swat \par swine swinie \par talon talon \par tam tam \par tan tan \par tank tank \par tariff tariff \par tart tort \par tax tax \par ten ten \par termini terminy \par testament testament \par testy testy \par ton ton \par tony tony \par tort tort \par tory tory \par ? toto toto \par trans trans \par transparent transparent \par trap trep \par trap trop \par tuba tuba \par uniform uniform \par vagary wagary \par vase waze \par vet wet \par vie wie \par villa willa \par vole wole \par von won \par wadi wady \par waffle wafle \par warty warty \par was was \par we we \par weak wek \par whir wir \par wig wig \par wino wino \par winy winy \par won won \par wrack wrak \par }}
