Closed Bug 983143 Opened 10 years ago Closed 6 years ago

[GSoC2014] [Week 2] Extract text from websites

Categories

(Intellego Graveyard :: General, defect)

Production
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gueroJeff, Unassigned)

References

()

Details

Back-end piece to web interface (983142) for automatic terminology-based translation.

Given a specific URL, translatable source text for analysis in the given web page will be identified, extracted, and temporarily stored for source-target terminology match processing.
Depends on: 983146
Depends on: 983148
Keywords: meta
Summary: [meta] Extract text from websites → [GSoC2014] [Week 2] Extract text from websites
Keywords: meta
No longer depends on: 983146, 983148
-- Blog Entry Week 3

We were still on the hunt for good terminology extraction tool. My mentor had sent me a list of resources to look into for extracting terminology. 

    https://code.google.com/p/maui-indexer/
    https://pypi.python.org/pypi/topia.termextract/
    http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm
    http://ngram.sourceforge.net/
    http://texlexan.sourceforge.net/

The most promising tool I came across for our purpose is the okapi term extraction tool. I was able to extract the terms in order of the frequency. However, it produced two separate files - one for the target language and one source language. The problem was that these extracted terms were not aligned.

    EN:
    430	message
    360	You
    349	file
    332	server
    305	page
    304	messages
    291	brandShortName
    277	want
    
    ES:
    614	de
    347	en
    290	mensaje
    289	que
    234	página
    232	web
    220	para
    201	conexión

I was unable to produce an aligned terminology extraction tool. I discussed the issues I was facing with my mentor.

My mentor mentioned that he has a CSV file of extracted terminology (https://www.transifex.com/projects/p/gaia-l10n/glossary/l/es/) and we could use that as a starting point, allowing us to skip the current step. We decided that we would use the CSV to continue with the project and I would continue researching the methods of terminology extraction.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.