Closed Bug 983250 Opened 10 years ago Closed 6 years ago

[GSoC2014] [Week 8] Evaluate terminology replacement methods

Categories

(Intellego Graveyard :: General, defect)

Production
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: GPHemsley, Unassigned)

References

()

Details

Evaluate each terminology replacement method (all-at-once [bug 983146] and on-the-fly [bug 983148]) for efficiency, based on the criteria established in bug 983149, and analyze whether it would be beneficial to use one method over the other, or whether it would be better to offer a choice of either.
Over these weeks I experimented with various python libraries used to parse HTML and XML, such as Beautifulsoup and lxml. Using a parser worked well, but it missed some elements. I found that using regex was more reliable for the purpose of replacing text inside elements, as it was a more generic.

This week I backtracked to one of the project’s earlier objectives and focused on extracting terminology from the TMX files in Transvision. Through the research on NLTK and reading the Natural Language Processing with Python (http://www.nltk.org/book/) book, I decided to tackle this problem again.

I started working on a python script to do bilingual term extraction with the help of NLTK. NLTK has a module called align, it uses statistical methods to predict translated pairs. There are a couple of algorithms that NLTK provides for doing aligned sentence term extraction (https://github.com/nltk/nltk/wiki/Machine-Translation), I decided to the use IBMModel for now. The script parses the TMX file, and creates a list containing tuples of aligned pairs of translated sentences. Iterating through these pairs, they can be cleaned up using NLTK before being added to a corpus of AlignSent objects (http://www.nltk.org/howto/align.html). The IBMModel is then passed this corpus of aligned sentences, so that it can train the model and figure out which words occur most often aligned with their translated counterpart. Each match is given a rank indicating a value of how sure it is that the pair is correct.

I continued optimising the script such as lower casing all the tokens, eliminating any tokens that are less than 2 characters, eliminating any words that are known to be stop words in each respective language and including tokens that have only letters.

Through out these changes I saved the output of the model to CSV files to track the improvements of the script.
I imported the CSV of the extracted bilingual terms into the database before showing my mentor the translation results. On the surface it did seem that we were translating more words than before. It also meant that the early goal of using the Transvision TMX files to create a corpus of terminology pairs was near completion. 
My mentor had pointed out that words needed to be converted between singular and plural based on their source. E.g. If a source word in english is a plural but the translation for this term was only in singular we needed to detect this and convert the target word to a plural before replacement.

I had received a couple of links to research:
http://cldr.unicode.org/index/cldr-spec/plural-rules
https://developer.mozilla.org/en-US/docs/Mozilla/Localization/Localization_and_Plurals#Usage

While I was researching how to utilize these plural rules in the existing python term extraction script I came across the pattern library (http://www.clips.ua.ac.be/pages/pattern). It served a similar function to NLTK but supported more features and other languages. The library is written in Python and it also contains methods to convert words between their singular and plural equivalents. 

Utilising the parsetree module I analysed the Part-Of-Speech tree of a segment, looking for tags (http://www.clips.ua.ac.be/pages/mbsp-tags) that indicated whether a word was a singular or plural. A tag of NN (noun, singular or mass)  or NNP (noun, proper singular) indicated that the source word is a singular and in this case we would convert the word to a singular in the target language and visa versa for plurals.
Continuing on from last week with the small improvements of accommodating singular and plurals words. We continued to think of ways to improve the translation engine. I suggested that there were certain words that were still incorrect due to the alignment in the transvision file.

	<tu tuid="mail/chrome/messenger-newsblog/feed-subscriptions.dtd:subscriptionDesc.label" srclang="en-US">
  		<tuv xml:lang="en-US"<segNote: Removing or changing the folder for a feed will not affect previously downloaded articles.</seg</tuv
  		<tuv xml:lang="es-ES"<segNota: eliminar o cambiar la carpeta de un canal no afectará a los artículos descargados previamente.</seg</tuv
  	</tu>

> articles,caducados,0.665205148335

In the above example the word articles and artículos should be paired together, as they are the correct translated pair. However, in the above sentence you can see that the positions of the two words are not exactly aligned, thus the algorithm returns the incorrect pair. My mentor suggested that I could develop an algorithm to align the words based on the results of a POS tagger.

I developed a function that would take both sentences and parse them through a POS tagger. It would then compare the tag of the source word and match that to a word that has the same tag. It would continue doing this while iterating through the sentence. These results are then passed back and appended to the corpus. I appended these results to the corpus instead of replacing because it would give us an increased accuracy in the results while not reducing the amount of translated pairs.

After this change I noticed that the previous incorrectly detected pairs are fixed.

	articles,artículos,0.752740413515
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.