Closed
Bug 983140
Opened 10 years ago
Closed 6 years ago
[GSoC2014] [Week 1] Bilingual termbase creation
Categories
(Intellego Graveyard :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gueroJeff, Unassigned)
References
()
Details
Create a bilingual termbase of terminology consisting of Mozilla-specific terminology from Mozilla l10n resources. This can be done by exporting TMX files from transvision, performing terminology extraction, and inserting terminology units into TBX format.
Updated•10 years ago
|
Summary: [meta] Bilingual termbase creation → [GSoC2014] [Week 1] Bilingual termbase creation
Comparison of tools here: http://michellez1231.blogspot.co.uk/2010/07/research-on-terminology-extraction.html Ive come across the Bilingual Extraction Tool from Heartsome Europe before - its seems to take TMX as inputs. Most of these tools however are not free, and if they do offer a free version there is usually a limit on usage or time.
This looks promising: http://www.heartsome.net/tmxeditor8/index.html
It seems tmxeditor is doing a 1:1 conversion of the TMX file to TBX. There does not seem to be any terms extracted.
This looks interesting. However it seems to be windows only - https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I can test it out.
Reporter | ||
Comment 5•10 years ago
|
||
(In reply to Tharshan from comment #1) > Comparison of tools here: > http://michellez1231.blogspot.co.uk/2010/07/research-on-terminology- > extraction.html > > Ive come across the Bilingual Extraction Tool from Heartsome Europe before - > its seems to take TMX as inputs. > > Most of these tools however are not free, and if they do offer a free > version there is usually a limit on usage or time. Limited usage is fine for the purposes of this project, however, 1:1 conversion between TMX and TBX seems counter intuitive.
Reporter | ||
Comment 6•10 years ago
|
||
(In reply to Tharshan from comment #4) > This looks interesting. However it seems to be windows only - > https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I > can test it out. Let me know if you need any help :-)
(In reply to Jeff Beatty [:gueroJeff] from comment #6) > (In reply to Tharshan from comment #4) > > This looks interesting. However it seems to be windows only - > > https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I > > can test it out. > > Let me know if you need any help :-) Testing it out on Windows 7 VM. The program can be launched but I am unsuccessful in creating a corpus. It seems like it is having trouble reading the TMX file. The source is available and since its based on python, I should be able to debug it and run from source. We might have to circle back to this if we do not find another working alternative.
From the article linked above this looked like what we wanted - http://www.heartsome.de/en/termextraction.php. However, I cannot seem to open the program in my VM. I have the dependancies installed (>Java 6). It was a very limited trial though (20 pairs).
Reporter | ||
Comment 9•10 years ago
|
||
Yikes! That is very limited.
Reporter | ||
Comment 10•10 years ago
|
||
Here are a bunch of others to look at: https://code.google.com/p/maui-indexer/ https://pypi.python.org/pypi/topia.termextract/ http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm http://ngram.sourceforge.net/ http://texlexan.sourceforge.net/
Comment 11•10 years ago
|
||
(In reply to Jeff Beatty [:gueroJeff] from comment #10) > Here are a bunch of others to look at: > https://code.google.com/p/maui-indexer/ > https://pypi.python.org/pypi/topia.termextract/ > http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm > http://ngram.sourceforge.net/ > http://texlexan.sourceforge.net/ Thanks! I have been taking a look at these. I got some promising results with the okapi term extraction. I was able to get the program to extract the terms in order of frequency. It generates a text file like so: EN: 430 message 360 You 349 file 332 server 305 page 304 messages 291 brandShortName 277 want ES: 614 de 347 en 290 mensaje 289 que 234 página 232 web 220 para 201 conexión
Comment 12•10 years ago
|
||
(In reply to Tharshan from comment #11) > I got some promising results with the okapi term extraction. I was able to > get the program to extract the terms in order of frequency. It generates a > text file like so: > EN: > 430 message > 360 You > 349 file > 332 server > 305 page > 304 messages > 291 brandShortName > 277 want > ES: > 614 de > 347 en > 290 mensaje > 289 que > 234 página > 232 web > 220 para > 201 conexión Better check that it's doing entity extraction/resolution correctly. AIUI, "brandShortName" is actually part of "&brandShortName;", which resolves to something like "Firefox" in the final product.
Comment 13•10 years ago
|
||
-- Blog Entry Week 1 This summer I will be working on a GSOC project with Intellego team at Mozilla to lay out the foundation for an automatic terminology translation tool for websites. During the first weeks I will be researching the basics of how terminology extraction works. The first step is to create a bilingual term base consisting of Mozilla-specific terminology from Mozilla l10n resources (http://transvision.mozfr.org/downloads/). The Transvision site holds TMX files that are utilised in the Mozilla browser and FirefoxOS, which contain translations of text between many pairs of languages. Here is a snippet from the memoire_en-US_es-ES.tmx file. ``` <tu tuid="browser/chrome/browser/browser.dtd:bookmarkThisPageCmd.label" srclang="en-US"> <tuv xml:lang="en-US"><seg>Bookmark This Page</seg></tuv> <tuv xml:lang="es-ES"><seg>Añadir esta página a marcadores</seg></tuv> </tu> ``` The goal over the coming weeks is to extract the key terms in these phrases, statistically analyse them to build up a corpus of text that we know are direct translations between the languages. Since this involves many different techniques we decided it was best to involve third party software that was preferably open source to carry out this task for us. Using this corpus, a web interface would be created to dynamically replace the DOM contents of a web page with the one-to-one translation mappings collected. Ideally we want to convert the TMX files to TBX files that contain translation of key terms that we can use to automatically translate websites.
Reporter | ||
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•