14 декабря, 2015

The first steps in developing machine translation of patents//World Patent Information – 2013 , №3.

(Scanning with a lot of errors and typos)

Leonid G. Кгavets, candidate оf philologial sciепсе (applied and mathematicai linguistics), assistant professor at the patent iпfоrmatiоп chair,fоrmет head оf а number оf scientific departments at the CNllPl (VNПPI). now – editor-in-chief оf the ‘Patent Informatio Today” mаgaziпе. publislled by the INIC-Раtепt’.

1. Introduction
ТЬе dramatical1y increased flow оf patent documentation coming recently from Asia – especial1y frоmJарзn, СЬinа and Korea – and building а single patent system in the multilingual European Union have concentrated the pateot world’s attention оп (Ье probIems оС overcoming language barriers with the use оС тасЬinе translation (МТ). Тоdзу, this attention is focused basicaIJy оп the Patent Tr.mslate System, which is the result оС coopCГdtion between the European Patent Office and Google. Under the agreement, the ЕРО will use Google’s шасЬinе translation technology to translate patents into the languages оС (Ье 38 countries tlшt it serves. ln retum, it will provide Google with access to its translated patcnts, enabling Google to optimize its mзcшпе translation teehnology.

Google Translate is based оп а method called statistical т.achiпe translation, developed Ьу F.J. ОсЬ who won the DARPA contest for speed тасЬinе translation in 2003 [1]. It takes а statistica1 approach, comparing the source docuшen! sentence Ьу sentence to millions of рзten! documents previously translated Ьу humans. These зrе used to train the translation engine (о handle technical subject-matter and the specific sty1e an.d format used Сог patent docurnents Тhc service is certainly useful for getting the gist оС а patent written in а foreign language and is helpful for companies attempting to get an infоnnа1 feel for (Ье competilive patent 1andscape.

ТЬе Patent Translate, used а! thc ЕРО, is said (о Ье а machine translation service specificaIJy “trained” to handle etaborate раten! vocabulary and grammar. However, as with Google’s general translation too~ the results зrе said to ье still Сзr пот perfect

A1though machines сan automate certain tasks very well, попе seems уе! to have fully mastered the subtle differences in sentence structure and the potential multiple uses оС а word to Ьауе diffeгent meanings in different contexts. Because GoogJe Translate uses statistical matching to translзte rзthег than а dictionary/grammar ruJes арргоасЬ, translated text сап sometimes include apparent1y nonsensical and obvious errors, such as swapping common terтns for similar but nonequivalent common terms in the otheг language, as wel1 as inverting sentence meaning [2,3].

Ву their very nature patents зrе concemed with new inveotions. ТЬеу wi1l therefore сопtзin new terms, used Ьу inventors to describe their innovations. Consistency оС terrninology is cruciaJ whcn ereating а patcnt specification. And theгe гетаins а very complicated ргоЫет оС trans lating patent claims. ТЬеу use formalistic language with an unusuаПу long sentence structure, required for clear display оС technical and legal aspects of the invcntion, subject to the broadest possibIc lcgal claims. For а machine this is а major problcm to overcome [4].

Meanwhile, attempts to solve some оС these problems began half а century ago in Moscow, а! the Central Research and Development !nstitute оС Patent Information (ТCNПP!), which was entrusted with the processing offoreign patent docurnents. ” was decided to trans1ate into Russian the claims ог abstracts published in offlCia1 buIJetins оС 1eading patent offices. ТЬегеСоге, in рarаПеl with the traditional processing of current patent documentation, the TCNIIP! scientists developed in 1963-1966 an ехрегiшепtзI system to automaticaIJy translate publications from the USPTO “OfficiaI Gazette”.

мт development at (Ье ТSNПР! covered а реПod wheo – after the thorough theoretical research and the emergence оС more efficient computers – scveral groups around the world had Ьеgun their attempts 10 create practically operational МТ systems.

Confrontation bctwccn thc two opposing social syslcms had lcd 10 thc situalion thal, Ьу the time МТ projects were impJemented, they were moslly aimed аl providing translation from Russian iлto English алd vice vеrsз. Опе of шет was the first МТ system specialized for processing patent texts [5,6].

Subsequent sections of this рарет ате devoted to ше солsidеrаtiоп of ше linguistic specificity of patent clairns, mалifеsted in the рredоmiлапсе of nominative word groups. This places special dсmaлds оп Шс МТ algorithm, саllоо оп 10 сапу оиl ше Беgrпепшооп of Ше claims text, Ше identification and алаlуsis of потinаl word groups in the English text алd Ше formation of the equivalent word combinations in ше Russian language. Тhc рарст endБ with а summary ofthc МТ systcm structure as а wholc.

2. Special features of patent claims

lnitia\ attempts to 50Ive the рroЫеm facing TSNllPI Ьу automating Ше word for word translation of patent claims confirmcd the unsuitabi1ity of such ал аррroасЬ [7]. Therefore in was decided 10 develop ал МТ systcm with the abiJity to navigate iл ше origina\ patent documents [8].

Тhorough linguistic алаlуsis of palent claims in the “Official Gazette” showed that the overwhelming majority of the notions алd concepts uscd 10 dеscпЪе the basic idea of ал invention are expresscd Ьу terms which ате потinаl word combinations with prepositive attributes. Тhe пшnbет of such word combinations is practicaJly unlimited, алd therefore по automatic vocabulary was ablc 10 сnvет еУеп ал еSБeпОаl part of such word groups. Тhis рroЫет ЬесотеБ still шоrе сошрliсаted when translating palent texts in which, due 10 their specific character (first communication about new iлventiопs), new and derived tenns ате bound 10 occur. Careful analysis of поminаl word combinations was а prerequisite for improving thc quality of translаtiлg patent claims [9].

Тhe determining role of different nomina\ groups in а patent claim influenccd thc choicc of the fundaшenшl principle and соnstruсtiоп of а specialized МТ algorithm. It was called the algorithm о! segment аrш!уsis. The namе reflects the таin idea of Ше algorithm, which provides ше division of the сlаiшs text оп segments, identifies pattems of these segmcnts, finds equiva\cnt models of the Russian laлguagе, then develops Ше infonnation оп the grarпmаосаl form of Russian words алd synthesizes Ше Russian text in ассотdanсе with this iлfоrmаtion.

Тhc rolc of segment separators was pcrformcd Ьу а пшnbет of words: indisputabIe (e.g., prepositions) алd questionabIe (such as determinatives, unions, participles). lf the separator is controversial, алaJysis of its environment was performed. Thus, the union алd the article were not sерзГdtors ifthey were Ьеtwеел similar dеfinitiолs.

The text of patent claims in the “Official Gazette”, with up to а few hundred words, is designed in the [оrrп of а siлglе sentence, which complicates the understanding of the invention. Тherefore, an attempt was made to develope formalized rules of dividing continuous text iлto segments алd designing them in separate senlences. Нете sentence separators were used 100, followed Ьу the analysis of their environment iл саБе of controversy. When presenting separate раrtБ in the fопn of independenl phrases the participles of absolute participle constructions were converted to finite verbal fо!П1S. А поun от nominal group being а part of the invention (at itemization) was considered 10 Ье suhjects, алd before шет the predicate «imeetsya» ("шете iБ") was inserted.

Тhe analysis of segments was intended prirnarily 10 еБшЫiБЬ the relationships between the words of ше English text. lf thc relationships between the words within the segment are knоwn, it becomes possibIe to detennine the character of relationsbips between the units ofthe equivalent Russian segment [9].

3. Identification and analysis of nominal word groups

High quality work with multicomponent поип pbrases is largely determined Ьу objective criteria of idепtifyiлg their structure, otherw:ise correct clarification of ше lexical meaning of complex entities is impossible. Therefore it was decided to use in the мт system the probability anaJysis ofthe phrases’ structure based оп some statistical data quitc rcgularly identifyiлg the (УРеБ of structural and the corrcsponding semantic re\ationships. Admissibility of probability estimates in identif)ring structural models of multicomponent combinations was tested оп а sample oftechnical tcxts, which contained about 20,000 two-component and about 5000 multicomponent pbrases. Based оп this алаlуsis, ше diversity of поminаl groups reduces 10 а finite set of models that reflect а surnшary of their structure and composition. Тhese structural models helped to identifY some objective signs that quite regularly point 10 the relative degree of stability of relationships between the components of the word combination.

ТЬе automatic analysis of nominal groups in machine translation was preceded Ьу their identification in the text. Usually, the left boundary ofthe group was indicated Ьу an article от апу other word that acts as а deterrninative. Тhe right boundary was defined Ьу the соте поип itself. Тhe role of prepositive eleтents of nominal groups тау ье played Ьу the words iл the following classes: defining words – adjcctives, participlcs, pronouns, ordinal пшnЬеrБ (М), nouns (N), adverbs (о) and саrdiлаl numbers (Nu).

In order to automatically analyze the claims all recorded nominal groups were combined into а fшitе set of structura! models. Structural model is а category, rерrеseпtiлg, first of аll, two related concepts: а) а distribulive том! - ше sequence of the аЬоуе indices of classes/subclasses of words, which include сошропепts of the поminal group, Ь) а coпstrucliиe тodе! - the (уре of syntactic connections between components of Ше group. ТhеБе were later supplemented Ьу а seтantic mode! rерresentiлg the type of generalized semantic reIations between the сошропents ofthe word group [10].

Two-component word combinations had опе of the following three distributive models: МN, NuN and NN. Тhe analysis ofthree- component word соmЫлаtiопs appeared а great deal тоте complicatcd ЬесаUБe of the incrcased number of required versions for analysis. Тhus оп the level of word classes the following 7 distributive models Ьауе Ьеen established:

ММN, DМN, МNN, NМN, NuМN, NuNN and NNN. ln сзse of four­component word combinations the пuшЬет of distributive models amounted 10 15 and 50 оп. Word combinations with the пшnbет of components greзш thал 8 were not analyzed and translated word for word. Since Шеу occurred уету rarely (lеSБ thал 1 % of the toшl питЬет of word combinations) it did not essentia1ly impair ше quality of ше translation.

Тhe increase in the пuшЬет of components raises Ше сошрlехity of а nominative group automated ana1ysis significantly. This was caused Ьу the increasing diversity of syntactic relations. As а result а three-component nominative group rnaу Ьауе different constructive models. Меп алаlyzing а three-component nominative groups with а distributive model МNN it is necessary, аЬоуе аll, 10 СЬООБе а поun, which is consistcnt determiner М. For example, the distributive rnodel МNN сan rnatch ше construclive model (ху)х)

- inteтa! combuslion еnginе от (х (yz») – addilionalfue! ритр.

Prior 10 the operation of the basic ЫосkБ of text ana1ysis, grammatical Ьоmоnyту of words was climinatcd Ьу analyzing ше grammatical characteristics of the surrО\Шdiлg words. For exampJe, а verb cannot Ье directly preccded Ьу 3D article.

4. ТЬе synthesis of the Russian text

ln ассотdanсе with the adopted structure of ше algorithm, the basic inforrnation required 10 оЬtain correct grammatical [оrшs of

Russian equivalents was worked out in the process of.analyzing the English text. Each word form in the English part of t е vocabulary was accompanied Ьу certain grammatical and lexical information which enabIed the algorithm (о operate with the words without having recourse to their particular lexical meaning. The Russian language рап of the vocabulary was represented Ьу stems of the Russian equivalents and Ьу tabIes of inflections helping (о construct the corresponding word foгrns in Фе process of moгphological synthesis. The total volume of the vocabulary used as the basis of accomplishing the experimental translation amounted (о арргох­imate у 5000 entries covering the subjeet matter of intemal сот­bustion engines.

Лftег the estabIishment of Фе number of components of а given поmiпal word сотЫпаооп its model was compared with Фе corresponding Iist of models having the same number of сотро­nents. Each model in the Iist was furnished with the information оп the positions and forms of the Russian words in the equivalent Russian word combination. Simulta'leously some г rrangements ensuring а тоге corгeet сопstructюп of Russian phrases were made.

The information worked out in the bIocks.1analyzing Фе English text was constituted Ьу the following data:

- computer address of the beginning of the Russian vocabulary

item containing Фе stem of the equivalent to Ье retrieved. - class (рап of speech) of the Russian equivalent.

- number. case (for nouns),

- gender/number. case. index showing that а shortened form

exists (for adjectives).

- indication of the transitivity and опе of the three forms of conjugation (for verbs),

- indication of transitivity. for aetive partidples and present participles (verbal adveгbs).

In Фе process of Фе algorithm operation these data were complemented Ьу Фе infoгmation derived from the Russian lan­guage рап of the vocabulary.1 For example. the phrase iпtemal coтbustioп eпgiпe. having the structural model ((MN)N) with а тоге stabIe relationship Ьемееп the components MN. wil1 get the Russian equivalents Цdvigаtеl vпutreппego sgoraпiya" and not "vпutreппiy sgoraпie dviatel". Depending оп the size of the patent daims involved (15q-зоо words) the translation time varied Ье­мееп 2 and 5 mi'l Топ а computer whose high speed was about ~OOO operations рег second). Machine translation samples of the first paragraph of patent clairns (фе U.S. Patent # 3,076.446 "Rotary intemal combustion engine") is attached (о Фе referenced article [11 !l

5. Сопсlusiоп

The TSNllPI МТ experimental system contained the Iinguistic рап and the program of its implementation оп the computer. The Iinguistic рап of the system comprised ап algorithm, ап auto­matic vocabulary and Iists and tabIes which were used in the process of the algorithm operation. The basis and the most complicated component of Фе proposed мт system was Фе al­gorithm which тау Ье regarded as the totality of the rules of processing the information contained in the vocabulary. word lists and tabIes.

The system contained а binary spedalized English-Russian al­gorithm. focused оп the translation of pubIications of Фе. US weekly ЦОffidаl Gazette". presented Ьу Фе first items of patent claims. They аге characterized Ьуап abundance of difficult (о grasp multicomponent terminological combinations and Ьу а spedfic syntactic structure of unusually long sentences containing up to several hundred words. The algorithm comprised мо large bIocks which accomplished in succession the search of the text words in the automatic vocabulary, assignment of grammatical information (о Фе words not found in the vocabulary. analysis of idioms, elimination of grammatical homonyms, segmentation of the text, division of long sentences into phrases, finding the antecedents of pronominal words. working out ofthe case information, analysis of predicative elements and поmiпаl word combinations and syn­thesis of the Russian text.

The set of programs based оп the algorithm of automatic translation comprised approximately 20.000 commands. This set included Фе following groups of programs:

- preliminary text processing.

- syntaetical analysis of segments.

- synthesis of Фе Russian text,

- auxiliary programs.

The TCNIIPI machine translation system was in the main developed in 1963 1966. Numerous experiments have confirmed the efficiency ofthe system and the ability (о ensure the quality of translating patent claims in the given parameters. substantially exceeding the word for word translation. Later the system has Ьееп praised in Фе Annual Report of the Chief Sdentific Secretary of Фе Academy of Sdences of the USSR as опе of the most important achievements in the пашгаl and social sdences in 1966 [12].

But the described project could not form the basis for а full­scale patent МТ system in the absence of representative arrays of mасhiпе-геаdаbIе patent documentation and faster computers with much тоге memory. А number of complicated Iinguistic probIems also still waited (о Ье solved. Investigations revealed that the global state of information theory and practice and the existing information technology allowed ргасосal realization of only those automatic systerns that did not propose complex se­mantic analysis of documents and processing oflarge information bodies. Therefore. the group of Фе TSNIIPI researchers switched to the solution of less ambitious but тоге relevant and ргасосаl tasks [13].

Similar trends Ьесате evident at that оте in some other countries. ‘П 1964 Фе U.S. Academy of Science set up а committee to investigate the feasibility of computer translation. In 1966 the Committee pubIished its героп as the Autoтatic Laпguage Рго­cessiпg Advisory Coттittee (ЛLPАС) Report. After studying the programs in America and Europe and making comparisons of computer translations with translations done Ьу humans it concluded that computer translation was inferior to human trans­lation not only in terms of quality but also cost.

The Committee recommended expenditures in computational Iinguistics-semantics, statistics. quantitative linguistic matters, including expeгiтeпts in translation. with machine aids ог without [5].

Finally the TSNIIPl researchers prepared а detailed description of the improved patent МТ system that was pubIished in The TSNIIPI Papers [14]. Mqreover there was also submitted the accu­mulated experience in aeveloping the operational мт system and the results of in-depth analysis of мт theory and praetice аll over the world [15]. There is reason (о believe that these works would Ье useful to the next generation of мт developers. ‘П particular the experience gained from the development and subsequent орега­tional testing of Фе TCNIIPI patent claims мт system gives grounds to conclude that currently popular methods of statistical мт should Ье used in сотЫпаОоп with the traditional lexico-grammatical rules which allow (о penetrate into the essence of the compared languages.

References

(1) ОСЬ FJ. Statistical rnachine trans1ation: foundations anд reeent advances, the ten1h machine translation suлunit, Phuket, Thajland, hЩ)’//WWW шt-аrcЬjvе iпfо/МТS-20QS.

[2) Google trащ,1а'е tangles with eomputer learning. Lee Gomes, Forbes Magazine; 9/8/2010.

(3) Google trans1ates "Ivan the TerтibLe" as "Abraham Lincoln" wwwgoogk;Ьk1gпсwsсhШ1псl.соm.

(4) List J. Review ofmacbine trans1ation in p"ents. Wol1d Patent Information 2012;34:193-5. [5] RobertБ АН, Zarechnak М. Muhanical tr.mslatUш- see ··оperзtional systerns” fcom: current trends in linguistics, val. J2. Mouton: Thc Haage-Paris. www. rnt-archivejnfo) Roberts­1974.pdf; 1974.

[6] Hutehins J. Нistorical survey of тасЬinе tranо1апоп in &stem and CcntraJ Europe (оее “Opcrational and commercial systems”), In: ТЬе conference оп crosslingual Janguage technology in service of an integrated n1Ultilingual Eumpe, 4-5 Мау 2012, Нamburg, Germany www.hutehinswcb.me. uk)Hatnburg-2012.рdf.

(7) Shvans АМ, Trakblenbcrg ЕА, Bruk ВМ, Purto УА, Fishkina VL. Тhe cxperience of woru (ог word trans1ation of patcnt literature from cl1g1ish [п1о Russian usiug Strela computcr. Scientiflc anд ТсеЬп;саl Jnformatinn 1963;2:42-9. [RussianJ.

[8] Кravets l.G. МасЬinе translation in patent infonnation systenl. lnfonnation оп Ulvспtiопs, val. 12. Moscow: 1Ъс State Committce оп Jnvcntions and Discoveries; 1964. р. 15-18, [RussianJ.

(9) КЛIvets LG. Structural analysis of phrases in english s<:ientiflc anд [сеhnicзl "",ts.

Scienliflcal and Technical Jnformation 1963;10:39-41. (RussianJ.

[1 О] Krave::ts LG. Emdina УМ. Automatic analysis of english noтi”ative gгoups. 1″: Рroceediпgs оС the 3-rd АII-Uпюп confcrcncc оп information retricval systcms апд зutomаted processing ofscicntifte and technica( information, у2, Moscow, 1967, р. 441-9, [Russian].

[11] Кг4Vet.s LG, Vasilevsky AL. А systcm (ог automatic trdЛsla1;оп оС publications Сroт the patent weckly “Оfбciзl Gaиttc”, Jnformation retrieval among patent offices. Тhe 6th Annual meeting uf the JCIREPA Т, Тhe Hague, Octuber, 1966, р. 365-79.

[12] Тhe most important acmevcmcnts in те natural and social scienccs in 1966 (scc “Cybemeries”). Report of the Chief Scientif”u; Secretиy of 1he Academy nf Sciences uf 1he USSR, 1966 www.гas.N!FS10rage/download.aspx’ М=2сссс61 c-2122-43Od-9d2f [RussianJ. [13] Кrзуш LG. Fifty years ofpatcmt infonnation centres in Russ1a. World Patent lпfОl1l1аtюп 2012;34(3):282-5.

[14] Тhc cxpcrimenta1 system of English-Russian automatic transшtioп of patent doauncnts.

Cullectiun пС articles Ьу, КЛIvelS LG, Vasilevsky AL, Dubitskaya АМ, Етдina УМ, Povulotskaya SK, е. aL FisJ,kina//Тhe TSN!1PJ papers. Moskow 1970. р. 132, [RussianJ.

[15] Vasilevsky ЛL, КoгdVets LG, Moskovich УА, Povolotskaya sк.. Samoiluvich МУ, Tarasova GA, et а1. English-Russian automaric trans1ation/т,e ТSNJlPI papers. Moscow 1967. р. 220, [RussianJ.