Closed Bug 437342 Opened 16 years ago Closed 16 years ago

Create .html<->.po conversion script

Categories

(Webtools Graveyard :: Verbatim, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: clouserw, Assigned: dschafer)

References

Details

Attachments

(5 files, 4 obsolete files)

We should think this one through because we have to figure out how to split up a .html file into chunks that are useful to localize without losing context.  We'll also need to pull out img alt tags, ignore php but remember to pull in titles, etc.  This script may be pretty specific to Mozilla.

It looks like there is an html2po script in the translate toolkit - we might look at that for ideas or maybe we could use it.

We'll need this for files on mozilla.com (example: http://svn.mozilla.org/projects/mozilla.com/branches/firefox3/pt-BR/index.html) and the full pages on AMO (example: http://svn.mozilla.org/addons/trunk/site/app/locale/de/pages/policy.thtml)
can't you use l10n2 instead? :)

FWIW, Bugzilla uses TT to do localizations; IMO it's a disaster.

I've seen people try to use and maintain po things, and the results are not pretty.

I've also seen groups try to use out of line po files for applications, and the result on average is that so much context is lost that the localizers introduce inconsistencies which ruin the product...
Translating marked up text isn't the immediate target for l20n. I guess that we'll learn a bit about what it takes to represent marked up text in a reasonable manner.

Note, as for exposure in pootle, converting html to xliff might be enabling better UI paradigms. Might depend on xliff version, though. Placables might be a better way to represent markup, if we find some UI paradigm for editing that, than HTML. Keeping that information is likely going to matter for the inline markup.
That output is pretty close to what we're looking for and due to pootle working with .po I think we should stick with it (rather than xliff - I think Pootle's support for xliff is mainly a po2xliff script).

Let's look into forking/copying/modifying html2po to fix those broken tags in the index.html attachment.

Also, we should try it on a more complicated page to give a better real world test: http://svn.mozilla.org/projects/mozilla.com/trunk/en-US/index.html
Assignee: nobody → dschafer
Status: NEW → ASSIGNED
It appears the major issue in the HTML conversions involves this regex, from line 113 of the attached file from translate-toolkit:
  ^<[^>]*>(.*)</.*>$

This regex is part of function strip_html, which tries to remove any HTML tag that encloses the entire string.  The issue occurs when PHP code is embedded in the HTML.  As an example, consider the HTML code:
  <a href="<?=$home_dir?>/bar/">Foo</a>

The expected output of strip_html called on this HTML code is "Foo", however, because the first part of the above regex matches
  <a href="<?=$home_dir?>
the following occurs:

>>> import html
>>> x = html.htmlfile()
>>> x.strip_html(r'<a href="<?=$home_dir?>/bar/">Foo</a>')
'/bar/">Foo'
This is a proposed patch for the file in attachment 324876 [details].

Everything seems to work fine if the regex is modified to ensure the "<" and ">" it matches aren't "<?" or "?>", which would make them PHP tags instead.
Attachment #324895 - Flags: review?(clouserw)
Blocks: 437341
There was a problem with the regex in the previous patch in that no question marks could appear in tags; this patch fixes that issue.
Attachment #324895 - Attachment is obsolete: true
Attachment #325443 - Flags: review?(clouserw)
Attachment #324895 - Flags: review?(clouserw)
(In reply to comment #5)
> That output is pretty close to what we're looking for and due to pootle working
> with .po I think we should stick with it (rather than xliff - I think Pootle's
> support for xliff is mainly a po2xliff script).

Pootle can handle XLIFF natively.  You need to set the projects filetype to xlf instead of po.

> Let's look into forking/copying/modifying html2po to fix those broken tags in
> the index.html attachment.
> 
> Also, we should try it on a more complicated page to give a better real world
> test: http://svn.mozilla.org/projects/mozilla.com/trunk/en-US/index.html

Please upstream, it looks like these changes will be widely useful.  I stopped using them for Firefox localisation because of some issues, if those are sorted then it helps to feed into that process.

Dan -> Please add some tests to test_html to that we can ensure that this continues working.  There are some failing tests there already which might highlight some other issues.
Attachment #325443 - Flags: review?(clouserw)
This is an update to translate toolkit; it adds support for PHP in the HTML passed to html2po.  The updates in this patch are all to html.py, which defines how translate-toolkit parses HTML files.

Tests for php support have been added to test_html2po.  Currently, a test involving newlines does fail; however, it fails in a relatively benign fashion (the tabs and newlines seem to be replaced with whitespace).  I plan on examining this test-case further to get it working as well.
Attachment #325443 - Attachment is obsolete: true
Attachment #325840 - Flags: review?(clouserw)
Looks like you're using hashlib which makes python 2.5 a requirement.  Up until now I think the toolkit has only needed >2.3.  I'll need to figure out if upgrading python on these boxes will mess with other apps before I can review.  I'm not going to r+ though if a test that wasn't failing before is failing now.
This update moves the tests around (as suggested in http://bugs.locamotion.org/show_bug.cgi?id=134#c3), including marking the multi-line one as wtest.  This also replaces the sha hash with md5, so older versions of Python should support it.
Attachment #325840 - Attachment is obsolete: true
Attachment #328420 - Flags: review?(clouserw)
Attachment #325840 - Flags: review?(clouserw)
Priority: -- → P1
Attachment #328420 - Flags: review?(clouserw) → review+
Leaving this open while we pursue merging upstream to translate toolkit; other than that, I think the script is pretty much done.
Comment on attachment 328420 [details] [diff] [review]
Improved testing and older-Python support

Requesting review from Dwayne.  If r+ do we have access to /trunk/translate/  to commit it?
Attachment #328420 - Flags: review?(dwayne)
Probably best for one of the other Translate.org.za guys to review this.  I'll make needed changes to the review request.  If approved I don't see any problems with committing to trunk.
Attachment #328420 - Flags: review?(dwayne) → review?(friedel)
I will not be able to review fully now, so just a few questions/comments.

Firstly, thanks for this, and thanks for considering older Python versions :-)

I'm not seeing any mention of testing the resulting PO in back conversion to translated HTML with po2html. Did this receive any attention yet? I think it should probably be considered together, so that we know we can perform the entire roundtrip.

Please add a short comment to the wtest to explain what the issue is. Hopefully it saves whoever is looking at this later a bit of time.

I notice spacing that is not our usual style, such as
  self.check_single( '<html>
(space after the bracket)  but this is really not important, just mentioning since I see it now. (I also see it is already wrong in the that file on trunk anyway :-)

Thanks again.
The only other suggestion I hae is to make testsnippet a full class-level function like check_php_snippet (along with check_single, check_null, etc.) so that it is optimally reusable, instead of a nested function. (But even that is not so important.)  If all these are addressed you can commit to trunk for these files.
This patch adds better testing; the previous wtest I believe was testing incorrect behavior (specifically, it wanted the newlines in the HTML source to be preserved in the PO file, while test_tag_p_with_linebreak implies that they should not be).  The new multi-line test creates a .po file, then sets the translation for the string and converts back to html, comparing the converted file with the expected result.  Additionally, formatting problems have been fixed and check_phpsnippet is now a full class-level function.

html.py has not been changed from the previous patch; only the testing functions have.
Attachment #328420 - Attachment is obsolete: true
Attachment #329089 - Flags: review?
Attachment #328420 - Flags: review?(friedel)
Attachment #329089 - Flags: review? → review?(friedel)
Comment on attachment 329089 [details] [diff] [review]
Better testing for new html storage file

Thanks for this. Please go ahead and commit to svn trunk!
Attachment #329089 - Flags: review?(friedel) → review+
In revision 7742 of translate toolkit.  Marking this FIXED.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Product: Webtools → Webtools Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: