Closed Bug 983140 Opened 10 years ago Closed 6 years ago

[GSoC2014] [Week 1] Bilingual termbase creation

Categories

(Intellego Graveyard :: General, defect)

Production
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gueroJeff, Unassigned)

References

()

Details

Create a bilingual termbase of terminology consisting of Mozilla-specific terminology from Mozilla l10n resources.

This can be done by exporting TMX files from transvision, performing terminology extraction, and inserting terminology units into TBX format.
Keywords: meta
Summary: [meta] Bilingual termbase creation → [GSoC2014] [Week 1] Bilingual termbase creation
Keywords: meta
Comparison of tools here:
http://michellez1231.blogspot.co.uk/2010/07/research-on-terminology-extraction.html

Ive come across the Bilingual Extraction Tool from Heartsome Europe before - its seems to take TMX as inputs.

Most of these tools however are not free, and if they do offer a free version there is usually a limit on usage or time.
It seems tmxeditor is doing a 1:1 conversion of the TMX file to TBX. There does not seem to be any terms extracted.
This looks interesting. However it seems to be windows only - https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I can test it out.
(In reply to Tharshan from comment #1)
> Comparison of tools here:
> http://michellez1231.blogspot.co.uk/2010/07/research-on-terminology-
> extraction.html
> 
> Ive come across the Bilingual Extraction Tool from Heartsome Europe before -
> its seems to take TMX as inputs.
> 
> Most of these tools however are not free, and if they do offer a free
> version there is usually a limit on usage or time.

Limited usage is fine for the purposes of this project, however, 1:1 conversion between TMX and TBX seems counter intuitive.
(In reply to Tharshan from comment #4)
> This looks interesting. However it seems to be windows only -
> https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I
> can test it out.

Let me know if you need any help :-)
(In reply to Jeff Beatty [:gueroJeff] from comment #6)
> (In reply to Tharshan from comment #4)
> > This looks interesting. However it seems to be windows only -
> > https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I
> > can test it out.
> 
> Let me know if you need any help :-)

Testing it out on Windows 7 VM. The program can be launched but I am unsuccessful in creating a corpus. It seems like it is having trouble reading the TMX file. The source is available and since its based on python, I should be able to debug it and run from source. 

We might have to circle back to this if we do not find another working alternative.
From the article linked above this looked like what we wanted -  http://www.heartsome.de/en/termextraction.php. However, I cannot seem to open the program in my VM. I have the dependancies installed (>Java 6). It was a very limited trial though (20 pairs).
Yikes! That is very limited.
(In reply to Jeff Beatty [:gueroJeff] from comment #10)
> Here are a bunch of others to look at:
> https://code.google.com/p/maui-indexer/
> https://pypi.python.org/pypi/topia.termextract/
> http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm
> http://ngram.sourceforge.net/
> http://texlexan.sourceforge.net/

Thanks! I have been taking a look at these.

I got some promising results with the okapi term extraction. I was able to get the program to extract the terms in order of frequency. It generates a text file like so:
EN:
430	message
360	You
349	file
332	server
305	page
304	messages
291	brandShortName
277	want
ES:
614	de
347	en
290	mensaje
289	que
234	página
232	web
220	para
201	conexión
(In reply to Tharshan from comment #11)
> I got some promising results with the okapi term extraction. I was able to
> get the program to extract the terms in order of frequency. It generates a
> text file like so:
> EN:
> 430	message
> 360	You
> 349	file
> 332	server
> 305	page
> 304	messages
> 291	brandShortName
> 277	want
> ES:
> 614	de
> 347	en
> 290	mensaje
> 289	que
> 234	página
> 232	web
> 220	para
> 201	conexión

Better check that it's doing entity extraction/resolution correctly. AIUI, "brandShortName" is actually part of "&brandShortName;", which resolves to something like "Firefox" in the final product.
-- Blog Entry Week 1

This summer I will be working on a GSOC project with Intellego team at Mozilla to lay out the foundation for an automatic terminology translation tool for websites. During the first weeks I will be researching the basics of how terminology extraction works.

The first step is to create a bilingual term base consisting of Mozilla-specific terminology from Mozilla l10n resources (http://transvision.mozfr.org/downloads/). The Transvision site holds TMX files that are utilised in the Mozilla browser and FirefoxOS, which contain translations of text between many pairs of languages.

Here is a snippet from the memoire_en-US_es-ES.tmx file. 
```
<tu tuid="browser/chrome/browser/browser.dtd:bookmarkThisPageCmd.label" srclang="en-US">
	<tuv xml:lang="en-US"><seg>Bookmark This Page</seg></tuv>
	<tuv xml:lang="es-ES"><seg>Añadir esta página a marcadores</seg></tuv>
</tu>
```
The goal over the coming weeks is to extract the key terms in these phrases, statistically analyse them to build up a corpus of text that we know are direct translations between the languages. Since this involves many different techniques we decided it was best to involve third party software that was preferably open source to carry out this task for us. 

Using this corpus, a web interface would be created to dynamically replace the DOM contents of a web page with the one-to-one translation mappings collected.

Ideally we want to convert the TMX files to TBX files that contain translation of key terms that we can use to automatically translate websites.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.