Closed Bug 983143 Opened 10 years ago Closed 6 years ago

[GSoC2014] [Week 2] Extract text from websites

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: gueroJeff, Unassigned)

References

(
URL
)

Details

Jeff Beatty [:gueroJeff]

Reporter

Description

•

10 years ago

Back-end piece to web interface (983142) for automatic terminology-based translation.

Given a specific URL, translatable source text for analysis in the given web page will be identified, extracted, and temporarily stored for source-target terminology match processing.

Jeff Beatty [:gueroJeff]

Reporter

Updated

•

10 years ago

Depends on: 983146

Jeff Beatty [:gueroJeff]

Reporter

Updated

•

10 years ago

Depends on: 983148

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Keywords: meta

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Summary: [meta] Extract text from websites → [GSoC2014] [Week 2] Extract text from websites

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Keywords: meta

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

Blocks: 983257

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

No longer depends on: 983146, 983148

Gordon P. Hemsley [:GPHemsley]

Updated

•

10 years ago

URL: https://wiki.mozilla.org/Intellego/GS...

Tharshan

Comment 1

•

10 years ago

-- Blog Entry Week 3

We were still on the hunt for good terminology extraction tool. My mentor had sent me a list of resources to look into for extracting terminology. 

    https://code.google.com/p/maui-indexer/
    https://pypi.python.org/pypi/topia.termextract/
    http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm
    http://ngram.sourceforge.net/
    http://texlexan.sourceforge.net/

The most promising tool I came across for our purpose is the okapi term extraction tool. I was able to extract the terms in order of the frequency. However, it produced two separate files - one for the target language and one source language. The problem was that these extracted terms were not aligned.

    EN:
    430	message
    360	You
    349	file
    332	server
    305	page
    304	messages
    291	brandShortName
    277	want
    
    ES:
    614	de
    347	en
    290	mensaje
    289	que
    234	página
    232	web
    220	para
    201	conexión

I was unable to produce an aligned terminology extraction tool. I discussed the issues I was facing with my mentor.

My mentor mentioned that he has a CSV file of extracted terminology (https://www.transifex.com/projects/p/gaia-l10n/glossary/l/es/) and we could use that as a starting point, allowing us to skip the current step. We decided that we would use the CSV to continue with the project and I would continue researching the methods of terminology extraction.

Jeff Beatty [:gueroJeff]

Reporter

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[GSoC2014] [Week 2] Extract text from websites

Categories

(Intellego Graveyard :: General, defect)

Tracking

(Not tracked)

People

(Reporter: gueroJeff, Unassigned)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Updated

Updated

Updated

Updated

Comment 1

Updated