983140 - [GSoC2014] [Week 1] Bilingual termbase creation

Reporter

Description

•

10 years ago

Create a bilingual termbase of terminology consisting of Mozilla-specific terminology from Mozilla l10n resources.

This can be done by exporting TMX files from transvision, performing terminology extraction, and inserting terminology units into TBX format.

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Keywords: meta

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Summary: [meta] Bilingual termbase creation → [GSoC2014] [Week 1] Bilingual termbase creation

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Keywords: meta

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Blocks: 983266

Tharshan

Comment 1

•

10 years ago

Comparison of tools here:
http://michellez1231.blogspot.co.uk/2010/07/research-on-terminology-extraction.html

Ive come across the Bilingual Extraction Tool from Heartsome Europe before - its seems to take TMX as inputs.

Most of these tools however are not free, and if they do offer a free version there is usually a limit on usage or time.

Tharshan

Comment 2

•

10 years ago

This looks promising:
http://www.heartsome.net/tmxeditor8/index.html

Tharshan

Comment 3

•

10 years ago

It seems tmxeditor is doing a 1:1 conversion of the TMX file to TBX. There does not seem to be any terms extracted.

Tharshan

Comment 4

•

10 years ago

This looks interesting. However it seems to be windows only - https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I can test it out.

Jeff Beatty [:gueroJeff]

Reporter

Comment 5

•

10 years ago

(In reply to Tharshan from comment #1)
> Comparison of tools here:
> http://michellez1231.blogspot.co.uk/2010/07/research-on-terminology-
> extraction.html
> 
> Ive come across the Bilingual Extraction Tool from Heartsome Europe before -
> its seems to take TMX as inputs.
> 
> Most of these tools however are not free, and if they do offer a free
> version there is usually a limit on usage or time.

Limited usage is fine for the purposes of this project, however, 1:1 conversion between TMX and TBX seems counter intuitive.

Jeff Beatty [:gueroJeff]

Reporter

Comment 6

•

10 years ago

(In reply to Tharshan from comment #4)
> This looks interesting. However it seems to be windows only -
> https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I
> can test it out.

Let me know if you need any help :-)

Tharshan

Comment 7

•

10 years ago

(In reply to Jeff Beatty [:gueroJeff] from comment #6)
> (In reply to Tharshan from comment #4)
> > This looks interesting. However it seems to be windows only -
> > https://code.google.com/p/extract-tmx-corpus/. Putting together a VM so I
> > can test it out.
> 
> Let me know if you need any help :-)

Testing it out on Windows 7 VM. The program can be launched but I am unsuccessful in creating a corpus. It seems like it is having trouble reading the TMX file. The source is available and since its based on python, I should be able to debug it and run from source. 

We might have to circle back to this if we do not find another working alternative.

Tharshan

Comment 8

•

10 years ago

From the article linked above this looked like what we wanted -  http://www.heartsome.de/en/termextraction.php. However, I cannot seem to open the program in my VM. I have the dependancies installed (>Java 6). It was a very limited trial though (20 pairs).

Jeff Beatty [:gueroJeff]

Reporter

Comment 9

•

10 years ago

Yikes! That is very limited.

Jeff Beatty [:gueroJeff]

Reporter

Comment 10

•

10 years ago

Here are a bunch of others to look at:
https://code.google.com/p/maui-indexer/
https://pypi.python.org/pypi/topia.termextract/
http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm
http://ngram.sourceforge.net/
http://texlexan.sourceforge.net/

Tharshan

Comment 11

•

10 years ago

(In reply to Jeff Beatty [:gueroJeff] from comment #10)
> Here are a bunch of others to look at:
> https://code.google.com/p/maui-indexer/
> https://pypi.python.org/pypi/topia.termextract/
> http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm
> http://ngram.sourceforge.net/
> http://texlexan.sourceforge.net/

Thanks! I have been taking a look at these.

I got some promising results with the okapi term extraction. I was able to get the program to extract the terms in order of frequency. It generates a text file like so:
EN:
430	message
360	You
349	file
332	server
305	page
304	messages
291	brandShortName
277	want
ES:
614	de
347	en
290	mensaje
289	que
234	página
232	web
220	para
201	conexión

Gordon P. Hemsley [:GPHemsley]

Comment 12

•

10 years ago

(In reply to Tharshan from comment #11)
> I got some promising results with the okapi term extraction. I was able to
> get the program to extract the terms in order of frequency. It generates a
> text file like so:
> EN:
> 430	message
> 360	You
> 349	file
> 332	server
> 305	page
> 304	messages
> 291	brandShortName
> 277	want
> ES:
> 614	de
> 347	en
> 290	mensaje
> 289	que
> 234	página
> 232	web
> 220	para
> 201	conexión

Better check that it's doing entity extraction/resolution correctly. AIUI, "brandShortName" is actually part of "&brandShortName;", which resolves to something like "Firefox" in the final product.

Tharshan

Comment 13

•

10 years ago

-- Blog Entry Week 1

This summer I will be working on a GSOC project with Intellego team at Mozilla to lay out the foundation for an automatic terminology translation tool for websites. During the first weeks I will be researching the basics of how terminology extraction works.

The first step is to create a bilingual term base consisting of Mozilla-specific terminology from Mozilla l10n resources (http://transvision.mozfr.org/downloads/). The Transvision site holds TMX files that are utilised in the Mozilla browser and FirefoxOS, which contain translations of text between many pairs of languages.

Here is a snippet from the memoire_en-US_es-ES.tmx file. 
```
<tu tuid="browser/chrome/browser/browser.dtd:bookmarkThisPageCmd.label" srclang="en-US">
	<tuv xml:lang="en-US"><seg>Bookmark This Page</seg></tuv>
	<tuv xml:lang="es-ES"><seg>Añadir esta página a marcadores</seg></tuv>
</tu>
```
The goal over the coming weeks is to extract the key terms in these phrases, statistically analyse them to build up a corpus of text that we know are direct translations between the languages. Since this involves many different techniques we decided it was best to involve third party software that was preferably open source to carry out this task for us. 

Using this corpus, a web interface would be created to dynamically replace the DOM contents of a web page with the one-to-one translation mappings collected.

Ideally we want to convert the TMX files to TBX files that contain translation of key terms that we can use to automatically translate websites.

Jeff Beatty [:gueroJeff]

Reporter

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

[GSoC2014] [Week 1] Bilingual termbase creation

Categories

(Intellego Graveyard :: General, defect)

Tracking

(Not tracked)

People

(Reporter: gueroJeff, Unassigned)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated