Improve the heuristics used to make Accented English strings longer

NEW
Unassigned

Status

Firefox OS
Gaia::L10n
P4
normal
3 years ago
2 years ago

People

(Reporter: stas, Unassigned)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [good first bug])

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
Bug 900182 will add a fake Accented English locale.  It currently uses a very simple method of making its strings longer than regular English: every vowel is doubled.

This could be improved, with the goal of making short string be affect in greater extent than long strings.

See flod's http://l10n.mozilla-community.org/~flod/compare_length/ for some length statistics.  French might be a good example:  strings are 33% longer than English on average, but the average increase differs depending on the length of the original string:

  short   (0, 5]   +58% (+2.3 chars)
  medium  (5, 10]  +31% (+2.5 chars)
  long    (10, 20] +38% (+5.6 chars)
  phrase  (20, ∞)  +25% (+13 chars)

The current vowel strategy might be too linear to be useful.
Priority: -- → P4
(Reporter)

Updated

3 years ago
Whiteboard: [good first bug]
If I didn't mess up my tests, the current strategy (double vowels) gives us these stats.

  global           +32.2%  (+7.1 chars)
  short   (0, 5]   +34.4%  (+1.4 chars)
  medium  (5, 10]  +35.0%  (+2.7 chars)
  long    (10, 20] +30.8%  (+4.7 chars)
  phrase  (20, ∞)  +30.6%  (+15.8 chars)

So, the only big difference is with short strings. 
Maybe it would enough to double one random character if the string has from 2 to 5 characters.
(In reply to Francesco Lodolo [:flod] from comment #1)
> If I didn't mess up my tests, the current strategy (double vowels) gives us
> these stats.

Actually, this is not reliable, I didn't think about variable names in strings.
I'm using a pseudo-locale created with ugly Python code, so it might not be completely reliable (e.g. I'm trying to not double vowels in time/date formats, and variable names). Anyhow, these are some data.

Current strategy (double vowels)

  global           +30.12%  (+ 6.62 chars)
  short   (0, 5]   +34.27%  (+ 1.35 chars)
  medium  (5, 10]  +34.36%  (+ 2.65 chars)
  long    (10, 20] +27.70%  (+ 4.15 chars)
  phrase  (20, ∞)  +27.93%  (+14.92 chars)

Current strategy + double last character if length of the original string is <= 4

  global           +33.32%  (+ 6.76 chars)
  short   (0, 5]   +56.69%  (+ 2.04 chars)
  medium  (5, 10]  +34.57%  (+ 2.67 chars)
  long    (10, 20] +28.32%  (+ 4.25 chars)
  phrase  (20, ∞)  +27.99%  (+14.93 chars)

Current strategy + double last character if length of the new string (with doubled vowels) is <= 5

  global           +34.14%  (+ 6.80 chars)
  short   (0, 5]   +62.98%  (+ 2.35 chars)
  medium  (5, 10]  +34.57%  (+ 2.67 chars)
  long    (10, 20] +28.32%  (+ 4.25 chars)
  phrase  (20, ∞)  +27.99%  (+14.93 chars)
Some more thoughts: while I believe that l10n.js has access to a "compiled" string (with variables replaced), I only have access to the original strings. 

So, in my case "SIM {n}" is 7 characters long, while for l10n.js it should be just 5 (e.g. "SIM 1").

Besides that I think that the data could be still useful. I fixed some more lousy variable replacements and added gaia-l10n locales to the picture
http://l10n.mozilla-community.org/~flod/compare_length_gaia/
(Reporter)

Comment 5

3 years ago
Actually, l10n.js pseudolocalizes the strings in their raw form, right after the l10n resources are downloaded:

  https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L1086-L1101

The raw form for "foo bar {n} baz" is just "foo bar {n} baz". However, the value passed to the makeLonger function here:

  https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L941-L945

is "foo bar" and then "baz", because I tried to exclude tokens and syntax which should not be pseudolocalized:

  https://github.com/mozilla-b2g/gaia/blob/ea93363a8c424d65a9ad91438ce6961377a20f98/shared/js/l10n.js#L966-L981
Created attachment 8479668 [details] [diff] [review]
23349.patch

Not sure if I should ask for feedback or review, let's start with f?.

This is the most intrusive version of the patch:
* (not really necessary) makeAccented is renamed as remapAlphaCharacters, since it doesn't do just Accented English and the name is confusing.
* Instead of passing pieces of strings to makeLonger and makeRTL, I'm replacing variables with placeholders, apply transformation to the entire string and then restore placeholders. 

The alternative patch just changes makeLonger, but I need more controls on the last transformations:
* exclude val.length<=1
* exclude val = ''

Github PR
https://github.com/mozilla-b2g/gaia/pull/23349
Attachment #8479668 - Flags: feedback?(stas)
(Reporter)

Comment 7

3 years ago
Thanks for the patch, flod!  I'll review it next week;  this week has been a little bit crazy because of the FL deadline.
(Reporter)

Comment 8

3 years ago
Comment on attachment 8479668 [details] [diff] [review]
23349.patch

Review of attachment 8479668 [details] [diff] [review]:
-----------------------------------------------------------------

Hey Flod, thanks for the patch.  I agree with the name changes, but I think I'd like to try out an alternative approach to this one.  In your patch, the string is taken as a whole and based on its total length the last word might become longer in certain cases.  This leads to non-determinism:  one word can have more than one 'translations' into pseudolocales with this method.  Instead, I wonder if it would be possible to look at each word separately and come up with rules that would make certain short words longer consistently.  You could also make longer words grow only by a little, to compensate for longer shorter words in full sentences.

What do you think?
Attachment #8479668 - Flags: feedback?(stas) → feedback-
I'm not sure I see value in being deterministic: pseudolocale needs to be understandable, and qps-ploc gives us a more realistic coverage by increasing the original en-US length. Numbers say that we're already doing a good job in all sectors besides very short words (under 5 characters).

Also, while this becomes deterministic for single words (i.e. 'Done' is always rendered in the same way), the inflation for longer strings becomes a lot harder to measure: a string made of short words could become extremely long (e.g. "You can not make calls, send messages or go online because emergency callback mode is enabled. Would you like to turn it off?"), unless you want to consider the entire sentence.
(Reporter)

Updated

2 years ago
Blocks: 1143275
You need to log in before you can comment on or make changes to this bug.