Closed Bug 1037392 Opened 10 years ago Closed 10 years ago

[compare-locales][properties] unicode escapes should only trigger if they're followed by a hex

Categories

(Mozilla Localizations :: Infrastructure, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Pike, Assigned: Pike)

References

Details

Attachments

(1 file)

In Esperanto, we've triggered a compare-locales bug, by

some\unicode string.

The \u triggers the 0, which is bad.
Blocks: 1040019
This patch got slightly larger than I expected, and also more involved.

I rewrote the complete \ handling, and it turns out that my stats change.

Things like

sep = ,\\u20

are now not equal to 

sep = ,\u20

which, well, is good, I think.

The main culprit was that I did one kind of escape after the other, so the above ended up as \u20 first, and then '\ ', passing that one through.

I also had to refactor a bit, so that I can actually issue a warning, with proper location information.
Assignee: nobody → l10n
Status: NEW → ASSIGNED
Attachment #8459642 - Flags: review?(stas)
Comment on attachment 8459642 [details] [diff] [review]
rewrite escape handling, to be more sane, and correct

Review of attachment 8459642 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, with the following comment:

I'd suggest to only accept as valid Unicode escape sequences which consist of four hex digits.  Many language specs use the same requirements;  I checked Python, Java, C# and JavaScript.  See http://www.ecma-international.org/ecma-262/5.1/#sec-7.8.4 for instance:

UnicodeEscapeSequence ::
    u HexDigit HexDigit HexDigit HexDigit

::: lib/Mozilla/Parser.py
@@ +234,4 @@
>                    m.span('post'))
>  
>  class PropertiesParser(Parser):
> +  escape = re.compile('\\\\((?P<uni>u[0-9a-fA-F]{1,4})|'\

I'd be in favor of just recognizing u[0-9a-fA-F]{4} here, as I stated above.

Also, maybe use the raw string notation on the first line to avoid the backslash plague?

Lastly, IIRC, there's no need for the backslash at the end of the line; Python implicitly concatenates string literals inside parens.

@@ +241,5 @@
>      self.reKey = re.compile('^(\s*)((?:[#!].*?\n\s*)*)([^#!\s\n][^=:\n]*?)\s*[:=][ \t]*',re.M)
>      self.reHeader = re.compile('^\s*([#!].*\s*)+')
>      self.reFooter = re.compile('\s*([#!].*\s*)*$')
>      self._escapedEnd = re.compile(r'\\+$')
>      self._trailingWS = re.compile(r'[ \t]*$')

Any reason for those regexps to be instance variables instead of class variables like escape and known_escapes?

@@ +298,5 @@
> +      if found['uni']:
> +        return unichr(int(found['uni'][1:], 16))
> +      if found['nl']:
> +        return ''
> +      return self.known_escapes.get(found['single'], found['single'])

Nice.
Attachment #8459642 - Flags: review?(stas) → review+
(In reply to Staś Małolepszy :stas from comment #2)
> I'd suggest to only accept as valid Unicode escape sequences which consist
> of four hex digits.

Considering we have original strings in en-US for Gaia using \u20 (not sure about Gecko products), I wouldn't do that.
It looks like there's only 4 places \u20 is used:

apps/sms/sms.properties:104:thread-separator = \u20|\u20
apps/sms/sms.properties:109:carrier-separator = ,\u20

apps/system/system.properties:342:crash-reports-description-3-start=We use crash reports to try to fix problems and improve our products. We handle your information as we describe in our\u20

apps/settings/settings.properties:1196:crash-reports-description-3-start=We use crash reports to try to fix problems and improve our products. We handle your information as we describe in our\u20

(I used | ag "\\\\u[0-9a-fA-F]{1,2}[^0-9a-fA-F]" | to find them.)

Can we fix those strings on master?
(In reply to Staś Małolepszy :stas from comment #4)
> Can we fix those strings on master?

We could, but we'd also be broken all over the place. Just one string as an example
http://transvision.mozfr.org/string/?entity=apps/sms/sms.properties:carrier-separator&repo=gaia
I don't see a value in going for 4 byte escape sequences, that's just tedious to type.

One could argue about 1 byte ones, as they're practically of little use, but then again, what to do with them? We'd fall back to properly unescaping them anyway, I guess.
Looking at http://mxr.mozilla.org/l10n-mozilla-aurora/search?string=\\u&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-mozilla-aurora, though, I see how non-4-byte escapes can easily be errors, so it might make sense to actually add a warning for a shortened escape sequence.

For that, we should fix en-US, yes.

http://mxr.mozilla.org/l10n-gaia/search?string=\\u&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-gaia is a pretty mixed beast, too.
I filed bug 1042606 to discourage short unicode escapes.

This bug is FIXED, via http://hg.mozilla.org/l10n/compare-locales/rev/4939ef2a8086.

I'll cut a release later today.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: