Convert AMO to use normal .po files

VERIFIED FIXED in 5.0.9

Status

VERIFIED FIXED
10 years ago
3 years ago

People

(Reporter: clouserw, Assigned: gandalf)

Tracking

unspecified
5.0.9
Dependency tree / graph

Details

Attachments

(8 attachments, 1 obsolete attachment)

(Reporter)

Description

10 years ago
Our modified .po files have served us well for a long time, but the .po format has been improved and our methods are no longer necessary.

The current en-US .po file: http://viewvc.svn.mozilla.org/vc/addons/trunk/site/app/locale/en_US/LC_MESSAGES/messages.po?view=markup

We need a magical script that will:

1) Open the en-US .po file and each locale's .po file.  In each locale's .po file replace the msgid with the corresponding msgstr from the en-US .po file.  Look out for plural forms!

2) Open the AMO tree (/trunk/site/app/) and recursively open each file looking for the en-US msgids.  Replace the msgids with their English counterparts (doing appropriate quote substitution if needed).  Again, look out for plural forms.

2b) Bonus points:  AMO is using ___() and n___()* for custom fallback which should no longer be necessary.  For the bonus points replace ___() with _() and n___() with ngettext().  Note the AMO functions have a special last parameter (which is optional though!) which can be discarded.

3) Do step 1 for the en-US .po file - replace the msgid with the msgstr.  After this our .po files will be normal (woo).

4) Refresh the home page and see if I forgot any steps. :)

* The main difference is the last parameter is an optional fallback translation.  They are defined at the bottom of http://viewvc.svn.mozilla.org/vc/addons/trunk/site/app/config/core.php?view=markup
(Assignee)

Updated

10 years ago
Assignee: nobody → gandalf
(Assignee)

Updated

10 years ago
Status: NEW → ASSIGNED
I'm not sure if this will be helpful, but I wrote a Silme script to do that when working with ozten on mozillaservice.org's l10n. It's really simple and AFAIR it didn't handle plurals well, but that was OK, as there wasn't many plural string in the moservice's po file.

Maybe with some tweaks it could serve as a base for a script for this bug.
Attachment #386633 - Attachment description: Silme script used to do the same think for mozillaservice.org → Silme script used to do the same thing for mozillaservice.org
Just a heads-up, after this change, we'll need to properly hook this into Verbatim/Pootle. The current scripts we have there expect the old AMO .po format.
(Reporter)

Comment 3

10 years ago
(In reply to comment #2)
> Just a heads-up, after this change, we'll need to properly hook this into
> Verbatim/Pootle. The current scripts we have there expect the old AMO .po
> format.

Should just be a matter of turning off the scripts and not having to deal with crazy conversions and ridiculous requirements (like the en-US file always up to date).  This is a good thing for verbatim.
(In reply to comment #3)
> (In reply to comment #2)
> > Just a heads-up, after this change, we'll need to properly hook this into
> > Verbatim/Pootle. The current scripts we have there expect the old AMO .po
> > format.
> 
> Should just be a matter of turning off the scripts and not having to deal with
> crazy conversions and ridiculous requirements (like the en-US file always up to
> date).  This is a good thing for verbatim.

Nice! I'll also file a bug shortly about this for verbatim.
Blocks: 502222
(Reporter)

Comment 5

10 years ago
Alright, I thought about this some more.  We can't use straight up _() or we'll lose our context and we can't use pgettext() (the standard for context) because PHP doesn't support it.

We can work around that by specifying keywords when we extract the strings so that's not a big deal, but I think we should continue to use ___() for all strings and have it be in the format:

___('This is my message', 'this_is_my_context')
n___('This is singular', 'This is plural', $num, 'this_is_context')

So, instead of converting everything to _() and ngettext() it should follow this format ^^.  Sorry for the change in specs, but this should work.  Happy to have feedback.

For the record, to extract this format pass this to xgettext:    --keyword=___:1,2c --keyword=n___:1,2,4c
(Reporter)

Comment 6

10 years ago
Aww, the pgettext format has a different order of parameters and I guess it's best to stick as close as we can to that.  so:

___('this_is_my_context', 'this is my string)
n___('this_is_my_context', 'this is singular', 'this is plural', $num)

And the keywords are:
--keyword=___:1c,2 --keyword=n___:1c,2,3
(Assignee)

Comment 7

10 years ago
An update to the story.

1) I have a script that turns AMO .po file into normal gettext either with or without msgctxt.
2) I have a modification to ___() to play nice with normal gettext.

Now:

3) PHP does not support msgctxt at all (if there's a context you have to query for _('context_msg\004eng_msg') which sux.
4) I can remove the context overall but it generates 155 duplicated entities (cases when we have more than one english entity with the same name).

In all cases the message in polish locale is the same for all translations, so if we rely solely on PL we can remove the duplicates and go with clean .po file without context and then go with clean _() in PHP.
The only other way would be to keep the context and keep using ___() but it requires us to always have context in ___().

Will, Fred, what is your opinion?

Comment 8

10 years ago
Regarding msgctxt support in PHP, there's a simple workaround listed here: http://us2.php.net/manual/en/book.gettext.php#89975 that is supposed to work.
(Assignee)

Comment 9

10 years ago
Alexandru: I know. That's why I wrote that I can keep ___() but then I need to have context for each string.
(In reply to comment #7)
> 3) PHP does not support msgctxt at all (if there's a context you have to query
> for _('context_msg\004eng_msg') which sux.

I don't think it's a problem to keep using ___() everywhere in AMO. Can you give an example of what such a call would look like? Right now, it contains the pseudo-tag and often also an English fallback -- it wouldn't become any more than that would it?
(Assignee)

Comment 11

10 years ago
Yea, so in the case of keeping ___() we will have to provide english string + pseudo tag in all places.

The only reason I'd prefer to get rid of context+pseudo tags is that we, in fact don't need them, and it is a good moment now to clean it up.
What would that look like? Can we infer the context from somewhere automatically? Are you saying most of the time we do not need context at all, only to disambiguate multiple identical English strings? Or do we add context everywhere?
(Assignee)

Comment 13

10 years ago
We use context (in the normal gettext mode) only to disambiguate between multiple identical English strings. Example: "What's next?" is placed twice and the only difference is the context thing, but all locales translate both in the same way, so keeping it separated is not needed. Etc.
(Reporter)

Comment 14

10 years ago
I think we need to have context be an option, but if we can keep it optional without putting it on every string I think that would be fine.  I'm not sure how we would determine which strings it's needed on right now (I'm pretty sure it is needed right now though).

Any localizers following this are invited to comment. :)

Comment 15

10 years ago
For me, I think there would be few cases where I would want to translate depending on the context. Writing contexts would require also maintaining them.

I didn't find any big project among the translations I do or import in Narro that uses msgctxt. Although I support msgctxt in Narro.

But I have to admit it's nice when you translate to see something like this:

#: views/promote/press.php:2
msgctxt "Used on the main page, on the top right"
msgid "What's the good word?"
msgstr "" 

So if you decide to use msgctxt, I suggest doing it for every text, not spending time judging which text needs context. Event if you put the same context like "On the main page" for all the texts displayed on the main page it's still helpful.
(In reply to comment #15)
> For me, I think there would be few cases where I would want to translate
> depending on the context. Writing contexts would require also maintaining them.

But for those cases it would be definitely necessary to disambiguate the word based on the context.

For example, the word "View" (pretty common on UIs) has at least two categories in English: it can act as a name or as a verb. If your msgid contains only that string, you can't expect other languages to have the same translation for both categories.
In Basque, if "View" is a verb you would translate to "Ikusi", whereas it would be "Ikuspegi" if "View" acts as a name.

> I didn't find any big project among the translations I do or import in Narro
> that uses msgctxt. Although I support msgctxt in Narro.

WordPress uses msgctxt, as well as KDE and GNOME.

> But I have to admit it's nice when you translate to see something like this:
> 
> #: views/promote/press.php:2
> msgctxt "Used on the main page, on the top right"
> msgid "What's the good word?"
> msgstr "" 
> 
> So if you decide to use msgctxt, I suggest doing it for every text, not
> spending time judging which text needs context. Event if you put the same
> context like "On the main page" for all the texts displayed on the main page
> it's still helpful.

I wouldn't do it for every text unless necessary, because in that case you would be repeating strings with the same meaning (and in the same context) over and over again. So translators would have more (unnecessary) work to do, and you wouldn't be taking advantage of the power gettext gives you.


For its closeness on the implementation side, it's worth pointing at: http://core.trac.wordpress.org/ticket/9112

Comment 17

10 years ago
(In reply to comment #16)
> (In reply to comment #15)
> 
> > So if you decide to use msgctxt, I suggest doing it for every text, not
> > spending time judging which text needs context. Event if you put the same
> > context like "On the main page" for all the texts displayed on the main page
> > it's still helpful.
> 
> I wouldn't do it for every text unless necessary, because in that case you
> would be repeating strings with the same meaning (and in the same context) over
> and over again. So translators would have more (unnecessary) work to do, and
> you wouldn't be taking advantage of the power gettext gives you.
> 

Isn't the uniqueness based on msgctxt and msgid ? If it's the same msgctxt you wouldn't have to translate more than once, right ?
(In reply to comment #17)
> (In reply to comment #16)
> > (In reply to comment #15)
> > 
> > > So if you decide to use msgctxt, I suggest doing it for every text, not
> > > spending time judging which text needs context. Event if you put the same
> > > context like "On the main page" for all the texts displayed on the main page
> > > it's still helpful.
> > 
> > I wouldn't do it for every text unless necessary, because in that case you
> > would be repeating strings with the same meaning (and in the same context) over
> > and over again. So translators would have more (unnecessary) work to do, and
> > you wouldn't be taking advantage of the power gettext gives you.
> > 
> 
> Isn't the uniqueness based on msgctxt and msgid ? If it's the same msgctxt you
> wouldn't have to translate more than once, right ?

Yes but if you suggest to add context for every text, then you have to keep track of every context and take care of adding every identical text with the same context. I don't think this is very practical for developers.

Thus I suggest adding context only for those strings that disambiguation is required.
(Assignee)

Comment 19

10 years ago
Ok, all pieces are in place.

I'm going to attach all patches in the next comments.
That's the procedure:

1) We switch config/core.php ___() and n___() to the new model
2) We convert all .po files in all locales to normal gettext using translatetoolkit mono2po.py with patches
3) We patch all ./config ./controller ./views to the new model using a script
4) We patch Pootle to handle AMO like a normal gettext project
5) We upgrade scripts to dump .po files from AMO project

Notes:
2) on the way we create a map required to execute step 3)
3) patching will require human touch in 3 cases and manual conversion of n___ functions (there are 25 of them)
5) is not ready

patches are coming
(Assignee)

Comment 20

10 years ago
patch for step 1
(Assignee)

Comment 22

10 years ago
patch for mono2po (step 2) to handle the switch and create a map pickle
(Assignee)

Comment 23

10 years ago
script for step 3
(Assignee)

Comment 24

10 years ago
example result patch from step 3
(Assignee)

Comment 25

10 years ago
patch for Pootle AMO support, step 4
(In reply to comment #19)
> Ok, all pieces are in place.

Nice, Gandalf!
(Reporter)

Comment 27

10 years ago
Thanks Gandalf!  I just ran these scripts and they appear to work very well.  One thing I had to add was:
      if os.path.splitext(name)[1].lower() in ['.php','.thtml']:
right above find_strings() in main() in amo_php_conv.py to keep it from crashing on binary files and things.  

Aside from that, can you alter amo_php_conv.py to print out the new strings with single quotes instead of double?  There shouldn't be any variables in them anyway and it will let us just use $'s without having to escape them first.

To localizers:  I'll try to land this asap to give it some bake time before our next release.
(Assignee)

Comment 28

10 years ago
updated according to will's feedback.

Two notes:
1) n___() will have to be solved manually
2) my local remora keeps crashing on some screens so I may not be able to find issues with the script, but first impression is that it works :)

I'd appreciate feedback and testing. stage server would help too probably at some point, but it's up to you.
Attachment #392946 - Attachment is obsolete: true
(Assignee)

Updated

10 years ago
Attachment #393781 - Attachment is patch: false
(Reporter)

Updated

10 years ago
Target Milestone: --- → 5.0.9
(Reporter)

Comment 29

10 years ago
All scripts are run and changes committed.  The script in comment 28 didn't escape single quotes but I didn't notice until it was too late.  I went through and changed most (I hope :) of them manually.  There is a good chance I missed some so let me know if you find anything - they'll surface as syntax errors so will be easy to spot/fix.
(Reporter)

Comment 30

10 years ago
Alright, in addition to escaping single quotes:

1) Any __() that spanned multiple lines wasn't picked up.  Pretty sure I got all these manually.

2) .po headers were lost in the conversion (author info, plural forms, etc.).  I restored these from the old files.

3) context was saved in .po files but not in the actual files.  If context exists in the .po, it should remain in ___() and n___() too.

I adjusted our extraction script to look for ___:1, ___:1,2c, n___:1,2, and n___:1,2,4c.  However, due to #3 above, it extracts all strings with contexts now and combines them in the .po file.  This makes them get marked as fuzzy automatically.

In my opinion, I've done too much tonight to revert and fix #3 in the conversion script (sorry I didn't catch it earlier).  I've been working on this since about 5pm though and I don't have enough brain power left to write another script to do this for me.  So, where I'm at:

- the .po's are in the freshly converted state, including English.  That means that context still exists in them.

- /site/app/locale/extract-po.sh will scan all our files for strings and automatically merge them into the en-US messages.po.  If you want to see where we are at, run that and witness our missing context. :(  It also has about 700 lines of deprecated strings - that's expected I think and once we fix the context problem I'll just delete them

Stuff that still needs to happen:

1) context pulled from English .po and added to all the files as parameters in the ___() functions

2) extract-po.sh run, make sure it doesn't completely break the English .po

3) (Note to self: extraction script is deleting dev comments in the .po file.  It doesn't delete localizer comments so we should switch back to those but verify merging from en-US->$lang preserves localizer comments too)

4) Merge en-US -> all locales.  Then notify localizers that new files are in place and ready to look at.

I'm going to try to read through my bugmail tonight, so ping me on IRC when you get up and I might still be around.  Otherwise, I'll talk to you later on.  Thanks.
(Assignee)

Comment 31

10 years ago
(In reply to comment #30)
> 1) Any __() that spanned multiple lines wasn't picked up.  Pretty sure I got
> all these manually.

I just verified that it did work for me. Can it be your python vs. my python version?
 
> 2) .po headers were lost in the conversion (author info, plural forms, etc.). 
> I restored these from the old files.

Same here. As you can see in the example patch to .po attached here it did not remove headers.

> 3) context was saved in .po files but not in the actual files.  If context
> exists in the .po, it should remain in ___() and n___() too.

Same here. If you take a look at the patch example attached here the context has been added where it stayed in .po (via map.pkl)

> I adjusted our extraction script to look for ___:1, ___:1,2c, n___:1,2, and
> n___:1,2,4c.  However, due to #3 above, it extracts all strings with contexts
> now and combines them in the .po file.  This makes them get marked as fuzzy
> automatically.

Ouch :( I didn't play with extraction script

> 1) context pulled from English .po and added to all the files as parameters in
> the ___() functions

I'll take this.

Thank Will for support!
(Assignee)

Comment 32

10 years ago
Attaching first version of a script to fix AMO context strings which are not included in thtml/php files in some cases.

The script takes two arguments: en_US messages.po and directory and iterates over file list to add missing context strings.

In my tip test env it skipped a few entities due to changes in thtml files (vs. po file).

Wil, let me know if that works for you!
(Assignee)

Comment 33

10 years ago
Oh, forgot to add. The fix will not make anyone happy without the patch from comment 20. :)
(Reporter)

Comment 34

10 years ago
Everything has been run and converted.  I'm marking this as resolved.  Thank you so much for writing these scripts, Gandalf!

I sent mail to dev-l10n-web, waiting for input from localizers but if there are problems we'll deal with them on a case by case basis.
Status: ASSIGNED → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
Blocks: 511296
No longer blocks: 511296
Blocks: 511364
Blocks: 511367
Blocks: 511371
Attachment #392949 - Attachment is patch: true
Attachment #392949 - Attachment mime type: application/octet-stream → text/plain
Comment on attachment 392949 [details] [diff] [review]
patch for Pootle

I moved the AMO hook scripts out of the way in Pootle's repository, revision 12241.
Attachment #392949 - Flags: review+
(Reporter)

Comment 36

10 years ago
We are definitely using normal .po's now
Status: RESOLVED → VERIFIED
Product: addons.mozilla.org → addons.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.