Closed Bug 1037392 Opened 10 years ago Closed 10 years ago

[compare-locales][properties] unicode escapes should only trigger if they're followed by a hex

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: Pike, Assigned: Pike)

References

Details

Attachments

(1 file)

rewrite escape handling, to be more sane, and correct 10 years ago Axel Hecht [:Pike] 5.12 KB, patch	stas : review+	Details \| Diff \| Splinter Review

Axel Hecht [:Pike]

Assignee

Description

•

10 years ago

In Esperanto, we've triggered a compare-locales bug, by

some\unicode string.

The \u triggers the 0, which is bad.

Axel Hecht [:Pike]

Assignee

Updated

•

10 years ago

Blocks: 1040019

Axel Hecht [:Pike]

Assignee

Comment 1

•

10 years ago

Attached patch rewrite escape handling, to be more sane, and correct — Details — Splinter Review

This patch got slightly larger than I expected, and also more involved.

I rewrote the complete \ handling, and it turns out that my stats change.

Things like

sep = ,\\u20

are now not equal to 

sep = ,\u20

which, well, is good, I think.

The main culprit was that I did one kind of escape after the other, so the above ended up as \u20 first, and then '\ ', passing that one through.

I also had to refactor a bit, so that I can actually issue a warning, with proper location information.

Assignee: nobody → l10n

Status: NEW → ASSIGNED

Attachment #8459642 - Flags: review?(stas)

Staś Małolepszy :stas

Comment 2

•

10 years ago

Comment on attachment 8459642 [details] [diff] [review]
rewrite escape handling, to be more sane, and correct

Review of attachment 8459642 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, with the following comment:

I'd suggest to only accept as valid Unicode escape sequences which consist of four hex digits.  Many language specs use the same requirements;  I checked Python, Java, C# and JavaScript.  See http://www.ecma-international.org/ecma-262/5.1/#sec-7.8.4 for instance:

UnicodeEscapeSequence ::
    u HexDigit HexDigit HexDigit HexDigit

::: lib/Mozilla/Parser.py
@@ +234,4 @@
>                    m.span('post'))
>  
>  class PropertiesParser(Parser):
> +  escape = re.compile('\\\\((?P<uni>u[0-9a-fA-F]{1,4})|'\

I'd be in favor of just recognizing u[0-9a-fA-F]{4} here, as I stated above.

Also, maybe use the raw string notation on the first line to avoid the backslash plague?

Lastly, IIRC, there's no need for the backslash at the end of the line; Python implicitly concatenates string literals inside parens.

@@ +241,5 @@
>      self.reKey = re.compile('^(\s*)((?:[#!].*?\n\s*)*)([^#!\s\n][^=:\n]*?)\s*[:=][ \t]*',re.M)
>      self.reHeader = re.compile('^\s*([#!].*\s*)+')
>      self.reFooter = re.compile('\s*([#!].*\s*)*$')
>      self._escapedEnd = re.compile(r'\\+$')
>      self._trailingWS = re.compile(r'[ \t]*$')

Any reason for those regexps to be instance variables instead of class variables like escape and known_escapes?

@@ +298,5 @@
> +      if found['uni']:
> +        return unichr(int(found['uni'][1:], 16))
> +      if found['nl']:
> +        return ''
> +      return self.known_escapes.get(found['single'], found['single'])

Nice.

Attachment #8459642 - Flags: review?(stas) → review+

Francesco Lodolo [:flod]

Comment 3

•

10 years ago

(In reply to Staś Małolepszy :stas from comment #2)
> I'd suggest to only accept as valid Unicode escape sequences which consist
> of four hex digits.

Considering we have original strings in en-US for Gaia using \u20 (not sure about Gecko products), I wouldn't do that.

Staś Małolepszy :stas

Comment 4

•

10 years ago

It looks like there's only 4 places \u20 is used:

apps/sms/sms.properties:104:thread-separator = \u20|\u20
apps/sms/sms.properties:109:carrier-separator = ,\u20

apps/system/system.properties:342:crash-reports-description-3-start=We use crash reports to try to fix problems and improve our products. We handle your information as we describe in our\u20

apps/settings/settings.properties:1196:crash-reports-description-3-start=We use crash reports to try to fix problems and improve our products. We handle your information as we describe in our\u20

(I used | ag "\\\\u[0-9a-fA-F]{1,2}[^0-9a-fA-F]" | to find them.)

Can we fix those strings on master?

Francesco Lodolo [:flod]

Comment 5

•

10 years ago

(In reply to Staś Małolepszy :stas from comment #4)
> Can we fix those strings on master?

We could, but we'd also be broken all over the place. Just one string as an example
http://transvision.mozfr.org/string/?entity=apps/sms/sms.properties:carrier-separator&repo=gaia

Axel Hecht [:Pike]

Assignee

Comment 6

•

10 years ago

I don't see a value in going for 4 byte escape sequences, that's just tedious to type.

One could argue about 1 byte ones, as they're practically of little use, but then again, what to do with them? We'd fall back to properly unescaping them anyway, I guess.

Axel Hecht [:Pike]

Assignee

Comment 7

•

10 years ago

Looking at http://mxr.mozilla.org/l10n-mozilla-aurora/search?string=\\u&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-mozilla-aurora, though, I see how non-4-byte escapes can easily be errors, so it might make sense to actually add a warning for a shortened escape sequence.

For that, we should fix en-US, yes.

http://mxr.mozilla.org/l10n-gaia/search?string=\\u&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-gaia is a pretty mixed beast, too.

Axel Hecht [:Pike]

Assignee

Comment 8

•

10 years ago

I filed bug 1042606 to discourage short unicode escapes.

This bug is FIXED, via http://hg.mozilla.org/l10n/compare-locales/rev/4939ef2a8086.

I'll cut a release later today.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[compare-locales][properties] unicode escapes should only trigger if they're followed by a hex

Categories

(Mozilla Localizations :: Infrastructure, defect)

Tracking

(Not tracked)

People

(Reporter: Pike, Assigned: Pike)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type