Closed Bug 501988 Opened 16 years ago Closed 16 years ago

Convert AMO to use normal .po files

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

5.0.9

People

(Reporter: clouserw, Assigned: zbraniecki)

References

Details

Attachments

(8 files, 1 obsolete file)

Silme script used to do the same thing for mozillaservice.org 16 years ago Staś Małolepszy :stas 1.25 KB, text/plain		Details
patch for config/core.php 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 3.56 KB, patch		Details \| Diff \| Splinter Review
example of a po/LC_MESSAGES/messages.po conversion patch 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 350.31 KB, patch		Details \| Diff \| Splinter Review
patch to mono2po script 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 2.22 KB, patch		Details \| Diff \| Splinter Review
a script to convert .php and .thtml files 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 2.41 KB, text/plain		Details
example result patch from step 3 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 755.28 KB, patch		Details \| Diff \| Splinter Review
patch for Pootle 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 8.55 KB, patch	wenzel : review+	Details \| Diff \| Splinter Review
amo2po convet script, take 2 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 2.92 KB, text/plain		Details
fix amo context script 16 years ago Zibi Braniecki [:zbraniecki][:gandalf] 3.43 KB, text/plain		Details

Wil Clouser [:clouserw]

Reporter

Description

•

16 years ago

Our modified .po files have served us well for a long time, but the .po format has been improved and our methods are no longer necessary. The current en-US .po file: http://viewvc.svn.mozilla.org/vc/addons/trunk/site/app/locale/en_US/LC_MESSAGES/messages.po?view=markup We need a magical script that will: 1) Open the en-US .po file and each locale's .po file. In each locale's .po file replace the msgid with the corresponding msgstr from the en-US .po file. Look out for plural forms! 2) Open the AMO tree (/trunk/site/app/) and recursively open each file looking for the en-US msgids. Replace the msgids with their English counterparts (doing appropriate quote substitution if needed). Again, look out for plural forms. 2b) Bonus points: AMO is using ___() and n___()* for custom fallback which should no longer be necessary. For the bonus points replace ___() with _() and n___() with ngettext(). Note the AMO functions have a special last parameter (which is optional though!) which can be discarded. 3) Do step 1 for the en-US .po file - replace the msgid with the msgstr. After this our .po files will be normal (woo). 4) Refresh the home page and see if I forgot any steps. :) * The main difference is the last parameter is an optional fallback translation. They are defined at the bottom of http://viewvc.svn.mozilla.org/vc/addons/trunk/site/app/config/core.php?view=markup

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Updated

•

16 years ago

Assignee: nobody → gandalf

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Updated

•

16 years ago

Status: NEW → ASSIGNED

Staś Małolepszy :stas

Comment 1

•

16 years ago

Attached file Silme script used to do the same thing for mozillaservice.org — Details

I'm not sure if this will be helpful, but I wrote a Silme script to do that when working with ozten on mozillaservice.org's l10n. It's really simple and AFAIR it didn't handle plurals well, but that was OK, as there wasn't many plural string in the moservice's po file. Maybe with some tweaks it could serve as a base for a script for this bug.

Staś Małolepszy :stas

Updated

•

16 years ago

Attachment #386633 - Attachment description: Silme script used to do the same think for mozillaservice.org → Silme script used to do the same thing for mozillaservice.org

Fred Wenzel [:wenzel]

Comment 2

•

16 years ago

Just a heads-up, after this change, we'll need to properly hook this into Verbatim/Pootle. The current scripts we have there expect the old AMO .po format.

Wil Clouser [:clouserw]

Reporter

Comment 3

•

16 years ago

(In reply to comment #2) > Just a heads-up, after this change, we'll need to properly hook this into > Verbatim/Pootle. The current scripts we have there expect the old AMO .po > format. Should just be a matter of turning off the scripts and not having to deal with crazy conversions and ridiculous requirements (like the en-US file always up to date). This is a good thing for verbatim.

Fred Wenzel [:wenzel]

Comment 4

•

16 years ago

(In reply to comment #3) > (In reply to comment #2) > > Just a heads-up, after this change, we'll need to properly hook this into > > Verbatim/Pootle. The current scripts we have there expect the old AMO .po > > format. > > Should just be a matter of turning off the scripts and not having to deal with > crazy conversions and ridiculous requirements (like the en-US file always up to > date). This is a good thing for verbatim. Nice! I'll also file a bug shortly about this for verbatim.

Fred Wenzel [:wenzel]

Updated

•

16 years ago

Blocks: 502222

Wil Clouser [:clouserw]

Reporter

Comment 5

•

16 years ago

Alright, I thought about this some more. We can't use straight up _() or we'll lose our context and we can't use pgettext() (the standard for context) because PHP doesn't support it. We can work around that by specifying keywords when we extract the strings so that's not a big deal, but I think we should continue to use ___() for all strings and have it be in the format: ___('This is my message', 'this_is_my_context') n___('This is singular', 'This is plural', $num, 'this_is_context') So, instead of converting everything to _() and ngettext() it should follow this format ^^. Sorry for the change in specs, but this should work. Happy to have feedback. For the record, to extract this format pass this to xgettext: --keyword=___:1,2c --keyword=n___:1,2,4c

Wil Clouser [:clouserw]

Reporter

Comment 6

•

16 years ago

Aww, the pgettext format has a different order of parameters and I guess it's best to stick as close as we can to that. so: ___('this_is_my_context', 'this is my string) n___('this_is_my_context', 'this is singular', 'this is plural', $num) And the keywords are: --keyword=___:1c,2 --keyword=n___:1c,2,3

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 7

•

16 years ago

An update to the story. 1) I have a script that turns AMO .po file into normal gettext either with or without msgctxt. 2) I have a modification to ___() to play nice with normal gettext. Now: 3) PHP does not support msgctxt at all (if there's a context you have to query for _('context_msg\004eng_msg') which sux. 4) I can remove the context overall but it generates 155 duplicated entities (cases when we have more than one english entity with the same name). In all cases the message in polish locale is the same for all translations, so if we rely solely on PL we can remove the duplicates and go with clean .po file without context and then go with clean _() in PHP. The only other way would be to keep the context and keep using ___() but it requires us to always have context in ___(). Will, Fred, what is your opinion?

Alexandru Szasz

Comment 8

•

16 years ago

Regarding msgctxt support in PHP, there's a simple workaround listed here: http://us2.php.net/manual/en/book.gettext.php#89975 that is supposed to work.

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 9

•

16 years ago

Alexandru: I know. That's why I wrote that I can keep ___() but then I need to have context for each string.

Fred Wenzel [:wenzel]

Comment 10

•

16 years ago

(In reply to comment #7) > 3) PHP does not support msgctxt at all (if there's a context you have to query > for _('context_msg\004eng_msg') which sux. I don't think it's a problem to keep using ___() everywhere in AMO. Can you give an example of what such a call would look like? Right now, it contains the pseudo-tag and often also an English fallback -- it wouldn't become any more than that would it?

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 11

•

16 years ago

Yea, so in the case of keeping ___() we will have to provide english string + pseudo tag in all places. The only reason I'd prefer to get rid of context+pseudo tags is that we, in fact don't need them, and it is a good moment now to clean it up.

Fred Wenzel [:wenzel]

Comment 12

•

16 years ago

What would that look like? Can we infer the context from somewhere automatically? Are you saying most of the time we do not need context at all, only to disambiguate multiple identical English strings? Or do we add context everywhere?

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 13

•

16 years ago

We use context (in the normal gettext mode) only to disambiguate between multiple identical English strings. Example: "What's next?" is placed twice and the only difference is the context thing, but all locales translate both in the same way, so keeping it separated is not needed. Etc.

Wil Clouser [:clouserw]

Reporter

Comment 14

•

16 years ago

I think we need to have context be an option, but if we can keep it optional without putting it on every string I think that would be fine. I'm not sure how we would determine which strings it's needed on right now (I'm pretty sure it is needed right now though). Any localizers following this are invited to comment. :)

Alexandru Szasz

Comment 15

•

16 years ago

For me, I think there would be few cases where I would want to translate depending on the context. Writing contexts would require also maintaining them. I didn't find any big project among the translations I do or import in Narro that uses msgctxt. Although I support msgctxt in Narro. But I have to admit it's nice when you translate to see something like this: #: views/promote/press.php:2 msgctxt "Used on the main page, on the top right" msgid "What's the good word?" msgstr "" So if you decide to use msgctxt, I suggest doing it for every text, not spending time judging which text needs context. Event if you put the same context like "On the main page" for all the texts displayed on the main page it's still helpful.

Julen Ruiz Aizpuru

Comment 16

•

16 years ago

(In reply to comment #15) > For me, I think there would be few cases where I would want to translate > depending on the context. Writing contexts would require also maintaining them. But for those cases it would be definitely necessary to disambiguate the word based on the context. For example, the word "View" (pretty common on UIs) has at least two categories in English: it can act as a name or as a verb. If your msgid contains only that string, you can't expect other languages to have the same translation for both categories. In Basque, if "View" is a verb you would translate to "Ikusi", whereas it would be "Ikuspegi" if "View" acts as a name. > I didn't find any big project among the translations I do or import in Narro > that uses msgctxt. Although I support msgctxt in Narro. WordPress uses msgctxt, as well as KDE and GNOME. > But I have to admit it's nice when you translate to see something like this: > > #: views/promote/press.php:2 > msgctxt "Used on the main page, on the top right" > msgid "What's the good word?" > msgstr "" > > So if you decide to use msgctxt, I suggest doing it for every text, not > spending time judging which text needs context. Event if you put the same > context like "On the main page" for all the texts displayed on the main page > it's still helpful. I wouldn't do it for every text unless necessary, because in that case you would be repeating strings with the same meaning (and in the same context) over and over again. So translators would have more (unnecessary) work to do, and you wouldn't be taking advantage of the power gettext gives you. For its closeness on the implementation side, it's worth pointing at: http://core.trac.wordpress.org/ticket/9112

Alexandru Szasz

Comment 17

•

16 years ago

(In reply to comment #16) > (In reply to comment #15) > > > So if you decide to use msgctxt, I suggest doing it for every text, not > > spending time judging which text needs context. Event if you put the same > > context like "On the main page" for all the texts displayed on the main page > > it's still helpful. > > I wouldn't do it for every text unless necessary, because in that case you > would be repeating strings with the same meaning (and in the same context) over > and over again. So translators would have more (unnecessary) work to do, and > you wouldn't be taking advantage of the power gettext gives you. > Isn't the uniqueness based on msgctxt and msgid ? If it's the same msgctxt you wouldn't have to translate more than once, right ?

Julen Ruiz Aizpuru

Comment 18

•

16 years ago

(In reply to comment #17) > (In reply to comment #16) > > (In reply to comment #15) > > > > > So if you decide to use msgctxt, I suggest doing it for every text, not > > > spending time judging which text needs context. Event if you put the same > > > context like "On the main page" for all the texts displayed on the main page > > > it's still helpful. > > > > I wouldn't do it for every text unless necessary, because in that case you > > would be repeating strings with the same meaning (and in the same context) over > > and over again. So translators would have more (unnecessary) work to do, and > > you wouldn't be taking advantage of the power gettext gives you. > > > > Isn't the uniqueness based on msgctxt and msgid ? If it's the same msgctxt you > wouldn't have to translate more than once, right ? Yes but if you suggest to add context for every text, then you have to keep track of every context and take care of adding every identical text with the same context. I don't think this is very practical for developers. Thus I suggest adding context only for those strings that disambiguation is required.

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 19

•

16 years ago

Ok, all pieces are in place. I'm going to attach all patches in the next comments. That's the procedure: 1) We switch config/core.php ___() and n___() to the new model 2) We convert all .po files in all locales to normal gettext using translatetoolkit mono2po.py with patches 3) We patch all ./config ./controller ./views to the new model using a script 4) We patch Pootle to handle AMO like a normal gettext project 5) We upgrade scripts to dump .po files from AMO project Notes: 2) on the way we create a map required to execute step 3) 3) patching will require human touch in 3 cases and manual conversion of n___ functions (there are 25 of them) 5) is not ready patches are coming

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 20

•

16 years ago

Attached patch patch for config/core.php — Details — Splinter Review

patch for step 1

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 21

•

16 years ago

Attached patch example of a po/LC_MESSAGES/messages.po conversion patch — Details — Splinter Review

example result of step 2

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 22

•

16 years ago

Attached patch patch to mono2po script — Details — Splinter Review

patch for mono2po (step 2) to handle the switch and create a map pickle

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 23

•

16 years ago

Attached file a script to convert *.php and *.thtml files (obsolete) — Details

script for step 3

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 24

•

16 years ago

Attached patch example result patch from step 3 — Details — Splinter Review

example result patch from step 3

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 25

•

16 years ago

Attached patch patch for Pootle — Details — Splinter Review

patch for Pootle AMO support, step 4

Fred Wenzel [:wenzel]

Comment 26

•

16 years ago

(In reply to comment #19) > Ok, all pieces are in place. Nice, Gandalf!

Wil Clouser [:clouserw]

Reporter

Comment 27

•

16 years ago

Thanks Gandalf! I just ran these scripts and they appear to work very well. One thing I had to add was: if os.path.splitext(name)[1].lower() in ['.php','.thtml']: right above find_strings() in main() in amo_php_conv.py to keep it from crashing on binary files and things. Aside from that, can you alter amo_php_conv.py to print out the new strings with single quotes instead of double? There shouldn't be any variables in them anyway and it will let us just use $'s without having to escape them first. To localizers: I'll try to land this asap to give it some bake time before our next release.

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 28

•

16 years ago

Attached file amo2po convet script, take 2 — Details

updated according to will's feedback. Two notes: 1) n___() will have to be solved manually 2) my local remora keeps crashing on some screens so I may not be able to find issues with the script, but first impression is that it works :) I'd appreciate feedback and testing. stage server would help too probably at some point, but it's up to you.

Attachment #392946 - Attachment is obsolete: true

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Updated

•

16 years ago

Attachment #393781 - Attachment is patch: false

Wil Clouser [:clouserw]

Reporter

Updated

•

16 years ago

Target Milestone: --- → 5.0.9

Wil Clouser [:clouserw]

Reporter

Comment 29

•

16 years ago

All scripts are run and changes committed. The script in comment 28 didn't escape single quotes but I didn't notice until it was too late. I went through and changed most (I hope :) of them manually. There is a good chance I missed some so let me know if you find anything - they'll surface as syntax errors so will be easy to spot/fix.

Wil Clouser [:clouserw]

Reporter

Comment 30

•

16 years ago

Alright, in addition to escaping single quotes: 1) Any __() that spanned multiple lines wasn't picked up. Pretty sure I got all these manually. 2) .po headers were lost in the conversion (author info, plural forms, etc.). I restored these from the old files. 3) context was saved in .po files but not in the actual files. If context exists in the .po, it should remain in ___() and n___() too. I adjusted our extraction script to look for ___:1, ___:1,2c, n___:1,2, and n___:1,2,4c. However, due to #3 above, it extracts all strings with contexts now and combines them in the .po file. This makes them get marked as fuzzy automatically. In my opinion, I've done too much tonight to revert and fix #3 in the conversion script (sorry I didn't catch it earlier). I've been working on this since about 5pm though and I don't have enough brain power left to write another script to do this for me. So, where I'm at: - the .po's are in the freshly converted state, including English. That means that context still exists in them. - /site/app/locale/extract-po.sh will scan all our files for strings and automatically merge them into the en-US messages.po. If you want to see where we are at, run that and witness our missing context. :( It also has about 700 lines of deprecated strings - that's expected I think and once we fix the context problem I'll just delete them Stuff that still needs to happen: 1) context pulled from English .po and added to all the files as parameters in the ___() functions 2) extract-po.sh run, make sure it doesn't completely break the English .po 3) (Note to self: extraction script is deleting dev comments in the .po file. It doesn't delete localizer comments so we should switch back to those but verify merging from en-US->$lang preserves localizer comments too) 4) Merge en-US -> all locales. Then notify localizers that new files are in place and ready to look at. I'm going to try to read through my bugmail tonight, so ping me on IRC when you get up and I might still be around. Otherwise, I'll talk to you later on. Thanks.

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 31

•

16 years ago

(In reply to comment #30) > 1) Any __() that spanned multiple lines wasn't picked up. Pretty sure I got > all these manually. I just verified that it did work for me. Can it be your python vs. my python version? > 2) .po headers were lost in the conversion (author info, plural forms, etc.). > I restored these from the old files. Same here. As you can see in the example patch to .po attached here it did not remove headers. > 3) context was saved in .po files but not in the actual files. If context > exists in the .po, it should remain in ___() and n___() too. Same here. If you take a look at the patch example attached here the context has been added where it stayed in .po (via map.pkl) > I adjusted our extraction script to look for ___:1, ___:1,2c, n___:1,2, and > n___:1,2,4c. However, due to #3 above, it extracts all strings with contexts > now and combines them in the .po file. This makes them get marked as fuzzy > automatically. Ouch :( I didn't play with extraction script > 1) context pulled from English .po and added to all the files as parameters in > the ___() functions I'll take this. Thank Will for support!

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 32

•

16 years ago

Attached file fix amo context script — Details

Attaching first version of a script to fix AMO context strings which are not included in thtml/php files in some cases. The script takes two arguments: en_US messages.po and directory and iterates over file list to add missing context strings. In my tip test env it skipped a few entities due to changes in thtml files (vs. po file). Wil, let me know if that works for you!

Zibi Braniecki [:zbraniecki][:gandalf]

Assignee

Comment 33

•

16 years ago

Oh, forgot to add. The fix will not make anyone happy without the patch from comment 20. :)

Wil Clouser [:clouserw]

Reporter

Comment 34

•

16 years ago

Everything has been run and converted. I'm marking this as resolved. Thank you so much for writing these scripts, Gandalf! I sent mail to dev-l10n-web, waiting for input from localizers but if there are problems we'll deal with them on a case by case basis.

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Fred Wenzel [:wenzel]

Updated

•

16 years ago

Blocks: 511296

Fred Wenzel [:wenzel]

Updated

•

16 years ago

No longer blocks: 511296

Fred Wenzel [:wenzel]

Updated

•

16 years ago

Blocks: 511364

Fred Wenzel [:wenzel]

Updated

•

16 years ago

Blocks: 511367

Fred Wenzel [:wenzel]

Updated

•

16 years ago

Blocks: 511371

Fred Wenzel [:wenzel]

Updated

•

16 years ago

Attachment #392949 - Attachment is patch: true

Attachment #392949 - Attachment mime type: application/octet-stream → text/plain

Fred Wenzel [:wenzel]

Comment 35

•

16 years ago

Comment on attachment 392949 [details] [diff] [review] patch for Pootle I moved the AMO hook scripts out of the way in Pootle's repository, revision 12241.

Attachment #392949 - Flags: review+

Wil Clouser [:clouserw]

Reporter

Comment 36

•

16 years ago

We are definitely using normal .po's now

Status: RESOLVED → VERIFIED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: addons.mozilla.org → addons.mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.