Closed
Bug 983143
Opened 10 years ago
Closed 6 years ago
[GSoC2014] [Week 2] Extract text from websites
Categories
(Intellego Graveyard :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gueroJeff, Unassigned)
References
()
Details
Back-end piece to web interface (983142) for automatic terminology-based translation. Given a specific URL, translatable source text for analysis in the given web page will be identified, extracted, and temporarily stored for source-target terminology match processing.
Updated•10 years ago
|
Summary: [meta] Extract text from websites → [GSoC2014] [Week 2] Extract text from websites
Updated•10 years ago
|
Updated•10 years ago
|
-- Blog Entry Week 3 We were still on the hunt for good terminology extraction tool. My mentor had sent me a list of resources to look into for extracting terminology. https://code.google.com/p/maui-indexer/ https://pypi.python.org/pypi/topia.termextract/ http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm http://ngram.sourceforge.net/ http://texlexan.sourceforge.net/ The most promising tool I came across for our purpose is the okapi term extraction tool. I was able to extract the terms in order of the frequency. However, it produced two separate files - one for the target language and one source language. The problem was that these extracted terms were not aligned. EN: 430 message 360 You 349 file 332 server 305 page 304 messages 291 brandShortName 277 want ES: 614 de 347 en 290 mensaje 289 que 234 página 232 web 220 para 201 conexión I was unable to produce an aligned terminology extraction tool. I discussed the issues I was facing with my mentor. My mentor mentioned that he has a CSV file of extracted terminology (https://www.transifex.com/projects/p/gaia-l10n/glossary/l/es/) and we could use that as a starting point, allowing us to skip the current step. We decided that we would use the CSV to continue with the project and I would continue researching the methods of terminology extraction.
Reporter | ||
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•