Closed
Bug 1037392
Opened 10 years ago
Closed 10 years ago
[compare-locales][properties] unicode escapes should only trigger if they're followed by a hex
Categories
(Mozilla Localizations :: Infrastructure, defect)
Mozilla Localizations
Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Pike, Assigned: Pike)
References
Details
Attachments
(1 file)
5.12 KB,
patch
|
stas
:
review+
|
Details | Diff | Splinter Review |
In Esperanto, we've triggered a compare-locales bug, by some\unicode string. The \u triggers the 0, which is bad.
Assignee | ||
Comment 1•10 years ago
|
||
This patch got slightly larger than I expected, and also more involved. I rewrote the complete \ handling, and it turns out that my stats change. Things like sep = ,\\u20 are now not equal to sep = ,\u20 which, well, is good, I think. The main culprit was that I did one kind of escape after the other, so the above ended up as \u20 first, and then '\ ', passing that one through. I also had to refactor a bit, so that I can actually issue a warning, with proper location information.
Comment 2•10 years ago
|
||
Comment on attachment 8459642 [details] [diff] [review] rewrite escape handling, to be more sane, and correct Review of attachment 8459642 [details] [diff] [review]: ----------------------------------------------------------------- r=me, with the following comment: I'd suggest to only accept as valid Unicode escape sequences which consist of four hex digits. Many language specs use the same requirements; I checked Python, Java, C# and JavaScript. See http://www.ecma-international.org/ecma-262/5.1/#sec-7.8.4 for instance: UnicodeEscapeSequence :: u HexDigit HexDigit HexDigit HexDigit ::: lib/Mozilla/Parser.py @@ +234,4 @@ > m.span('post')) > > class PropertiesParser(Parser): > + escape = re.compile('\\\\((?P<uni>u[0-9a-fA-F]{1,4})|'\ I'd be in favor of just recognizing u[0-9a-fA-F]{4} here, as I stated above. Also, maybe use the raw string notation on the first line to avoid the backslash plague? Lastly, IIRC, there's no need for the backslash at the end of the line; Python implicitly concatenates string literals inside parens. @@ +241,5 @@ > self.reKey = re.compile('^(\s*)((?:[#!].*?\n\s*)*)([^#!\s\n][^=:\n]*?)\s*[:=][ \t]*',re.M) > self.reHeader = re.compile('^\s*([#!].*\s*)+') > self.reFooter = re.compile('\s*([#!].*\s*)*$') > self._escapedEnd = re.compile(r'\\+$') > self._trailingWS = re.compile(r'[ \t]*$') Any reason for those regexps to be instance variables instead of class variables like escape and known_escapes? @@ +298,5 @@ > + if found['uni']: > + return unichr(int(found['uni'][1:], 16)) > + if found['nl']: > + return '' > + return self.known_escapes.get(found['single'], found['single']) Nice.
Attachment #8459642 -
Flags: review?(stas) → review+
Comment 3•10 years ago
|
||
(In reply to Staś Małolepszy :stas from comment #2) > I'd suggest to only accept as valid Unicode escape sequences which consist > of four hex digits. Considering we have original strings in en-US for Gaia using \u20 (not sure about Gecko products), I wouldn't do that.
Comment 4•10 years ago
|
||
It looks like there's only 4 places \u20 is used: apps/sms/sms.properties:104:thread-separator = \u20|\u20 apps/sms/sms.properties:109:carrier-separator = ,\u20 apps/system/system.properties:342:crash-reports-description-3-start=We use crash reports to try to fix problems and improve our products. We handle your information as we describe in our\u20 apps/settings/settings.properties:1196:crash-reports-description-3-start=We use crash reports to try to fix problems and improve our products. We handle your information as we describe in our\u20 (I used | ag "\\\\u[0-9a-fA-F]{1,2}[^0-9a-fA-F]" | to find them.) Can we fix those strings on master?
Comment 5•10 years ago
|
||
(In reply to Staś Małolepszy :stas from comment #4) > Can we fix those strings on master? We could, but we'd also be broken all over the place. Just one string as an example http://transvision.mozfr.org/string/?entity=apps/sms/sms.properties:carrier-separator&repo=gaia
Assignee | ||
Comment 6•10 years ago
|
||
I don't see a value in going for 4 byte escape sequences, that's just tedious to type. One could argue about 1 byte ones, as they're practically of little use, but then again, what to do with them? We'd fall back to properly unescaping them anyway, I guess.
Assignee | ||
Comment 7•10 years ago
|
||
Looking at http://mxr.mozilla.org/l10n-mozilla-aurora/search?string=\\u&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-mozilla-aurora, though, I see how non-4-byte escapes can easily be errors, so it might make sense to actually add a warning for a shortened escape sequence. For that, we should fix en-US, yes. http://mxr.mozilla.org/l10n-gaia/search?string=\\u&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-gaia is a pretty mixed beast, too.
Assignee | ||
Comment 8•10 years ago
|
||
I filed bug 1042606 to discourage short unicode escapes. This bug is FIXED, via http://hg.mozilla.org/l10n/compare-locales/rev/4939ef2a8086. I'll cut a release later today.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•