1280046 - RegExp: case-insensitive matching not compatible with V8/JSC

Reporter

Description

•

9 years ago

SpiderMonkey returns `false` for each of these tests, while V8 and JavaScriptCore return `true` for all of them: /\xB5/i.test('\u03BC') /\u0345/i.test('\u03B9') /\u037F/i.test('\u03F3') /\u03C2/i.test('\u03C3') /\u03D0/i.test('\u03B2') /\u03D1/i.test('\u03B8') /\u03D5/i.test('\u03C6') /\u03D6/i.test('\u03C0') /\u03F0/i.test('\u03BA') /\u03F1/i.test('\u03C1') /\u03F5/i.test('\u03B5') /\u0528/i.test('\u0529') /\u052A/i.test('\u052B') /\u052C/i.test('\u052D') /\u052E/i.test('\u052F') /\u1E9B/i.test('\u1E61') /\u1FBE/i.test('\u03B9') /\uA698/i.test('\uA699') /\uA69A/i.test('\uA69B') /\uA796/i.test('\uA797') /\uA798/i.test('\uA799') /\uA79A/i.test('\uA79B') /\uA79C/i.test('\uA79D') /\uA79E/i.test('\uA79F') /\uA7AB/i.test('\u025C') /\uA7AC/i.test('\u0261') /\uA7AD/i.test('\u026C') /\uA7B0/i.test('\u029E') /\uA7B1/i.test('\u0287') https://mathiasbynens.be/demo/regex-i

Mathias Bynens

Reporter

Updated

•

9 years ago

URL: https://mathiasbynens.be/demo/regex-i

Mathias Bynens

Reporter

Updated

•

9 years ago

See Also: → https://github.com/Microsoft/ChakraCore/issues/1136

Till Schneidereit [:till]

Comment 1

•

9 years ago

needinfo for arai, who worked on this.

Flags: needinfo?(arai.unmht)

Tooru Fujisawa [:arai]

Assignee

Comment 2

•

9 years ago

This should be the issue on how we implement Canonicalize [1]. Basically, we should Canonicalize (toUpperCase) both input and pattern, but we emulate it by applying toUpperCase/toLowerCase [2] on pattern, and not-modifying input. With Canonicalize operation, both U+00B5 and U+03BC become U+039C, and matches: U+00B5 toUpperCase => U+039C U+03BC toUpperCase => U+039C But with toUpperCase/toLowerCase operations, U+00B5 doesn't become U+03BC, and it doesn't match: U+00B5 toUpperCase => U+039C U+00B5 toLowerCase => U+00B5 So, like unicode matching case [3], we should check reverse-conversion(s) [4], instead of toLowerCase. [1] https://tc39.github.io/ecma262/#sec-runtime-semantics-canonicalize-ch [2] https://dxr.mozilla.org/mozilla-central/rev/3ce53bd1e25b93140484d3933c9339a829e0c1eb/js/src/irregexp/RegExpEngine.cpp#337 [3] https://dxr.mozilla.org/mozilla-central/rev/3ce53bd1e25b93140484d3933c9339a829e0c1eb/js/src/irregexp/RegExpEngine.cpp#327 [4] https://dxr.mozilla.org/mozilla-central/rev/3ce53bd1e25b93140484d3933c9339a829e0c1eb/js/src/vm/Unicode.h#266

Flags: needinfo?(arai.unmht)

Tooru Fujisawa [:arai]

Assignee

Comment 3

•

9 years ago

Also, some characters of those testcase are defined in Unicode 7.0, but SpiderMonkey is still using Unicode 6.2.

Comment 4

•

9 years ago

Will fix for Unicode 6.2 characters here. Others will be fixed by bug 1230490.

Assignee: nobody → arai.unmht

Status: NEW → ASSIGNED

Tooru Fujisawa [:arai]

Assignee

Comment 5

•

9 years ago

Attached patch Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp. (obsolete) — Details — Splinter Review

Like unicode RegExp's case, added reverse conversions, where ToUpperCase(code) == ToUpperCase(reverse_n) for each code. there are at most 3 characters for single characters. those information is stored in separated table, js_reverse_upperinfo, reverse_upper_index1, reverse_upper_index2. Also, moved 21.2.2.8.2 [1] step 3.g before creating choices array. > If ch's code unit value ≥ 128 and cu's code unit value < 128, return ch. because, the condition is different for reverse_n cases. In testcase, several characters are commented out, because those are added in Unicode 7.0, and not stored in current UnicodeData.txt. [1] https://tc39.github.io/ecma262/#sec-runtime-semantics-canonicalize-ch

Attachment #8764489 - Flags: review?(till)

Till Schneidereit [:till]

Comment 6

•

9 years ago

Comment on attachment 8764489 [details] [diff] [review] Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp. Review of attachment 8764489 [details] [diff] [review]: ----------------------------------------------------------------- I'm afraid I'll have to punt on this review. I just don't understand the subject matter well enough to meaningfully review the patch. Forwarding to someone who should be much better suited.

Attachment #8764489 - Flags: review?(till) → review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 7

•

9 years ago

Comment on attachment 8764489 [details] [diff] [review] Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp. Review of attachment 8764489 [details] [diff] [review]: ----------------------------------------------------------------- This isn't the most horribly confident review in the world, but then again I doubt anyone else who might review this as a SpiderMonkey reviewer would be much better. :-( *mutters about "someone who should be much better suited", eyes till beadily* The naming concerns here aren't just a matter of personal preference or nitpicking: they actively and significantly interfere with my ability to understand this as I read it. So I'd like us to discuss and reach some consensus on naming before we run with this exactly. But algorithm-wise, it looks to be in the right place, as far as I can tell. ::: js/src/tests/ecma_6/RegExp/ignoreCase-multiple.js @@ +2,5 @@ > +var summary = "ignoreCase match should perform Canonicalize both to input and pattern."; > + > +print(BUGNUMBER + ": " + summary); > + > +assertEq(/\xB5/i.test('\u03BC'), true); I sort of wish there were comments like // U+00B5 <Unicode name>, U+03BC <Unicode name> by these, but eh. Should all of these be symmetric? Given Canonicalize is applied to both sides, it seems like it. It seems like you should then do, var pairs = [ ["\u0345", "\u03B9"], ... ]; for (var pair of pairs) { assertEq(new RegExp(pair[0], "i").test(pair[1]), true); assertEq(new RegExp(pair[1], "i").test(pair[0]), true); } so we get coverage both directions. @@ +15,5 @@ > +assertEq(/\u03F5/i.test('\u03B5'), true); > +assertEq(/\u1E9B/i.test('\u1E61'), true); > +assertEq(/\u1FBE/i.test('\u03B9'), true); > + > +/* TODO: Comment out after bug 1230490 (Unicode 7.0 characters). Ugh, commenting out. Can we make this into a feature-test somehow, so it starts doing...something...when the guard should be removed? Like maybe, if (/\u037F/i.test("\u03F3") === true) { ...rest of tests... } ::: js/src/vm/Unicode.h @@ +242,5 @@ > + * > + * ToUpperCase ToUpperCase > + * code --------------> upperCase <-------------- reverse1 > + * <-------------- reverse2 > + * <-------------- reverse3 This graphic isn't immediately understandable to me. I suggest instead, """ For a codepoint C, ReverseUpperInfo stores three offsets from C to up to three lowercased forms of C (no codepoint in UnicodeData.txt has more than three lowercased forms). If there are fewer than three lowercased forms, some fields are zero. To illustrate, consider the codepoint U+0399 GREEK CAPITAL LETTER IOTA, the uppercased form of these three codepoints: U+03B9 GREEK SMALL LETTER IOTA U+1FBE GREEK PROSGEGRAMMENI U+0345 COMBINING GREEK YPOGEGRAMMENI For the ReverseUpperInfo corresponding to this codepoint, reverse{1,2,3} are 16-bit modular deltas from 0x0399 to each respective codepoint: uint16_t(0x03B9 - 0x0399), uint16_t(0x1FBE - 0x0399), uint16_t(0x0345 - 0x0399) in an unimportant order. Because multiple codepoints map to a single ReverseUpperInfo, a ReverseUpperInfo and its reverse{1,2,3} have no meaning standing alone: they have meaning only with respect to a codepoint mapping to that ReverseUpperInfo. """ as a class-comment or something. @@ +246,5 @@ > + * <-------------- reverse3 > + */ > + uint16_t reverse1; > + uint16_t reverse2; > + uint16_t reverse3; "reverse" is really confusing to me about exactly which direction (lower->upper, upper->lower) is meant. Instead of "reverse" and "upper" as names, I'd probably go for ToLowercaseDeltas, delta{1,2,3}, and other names along those lines. Well, maybe not, those names aren't very good either. Maybe we should do some bikeshedding in #jsapi, ideally with a couple other people as well? The current names just are really hard to think about. :-( @@ +254,5 @@ > +extern const uint8_t reverse_upper_index2[]; > +extern const ReverseUpperInfo js_reverse_upperinfo[]; > + > +inline const ReverseUpperInfo& > +ReverseUpperCaseInfo(char16_t code) Feels like, rather than returning an Info&, we should have a separate struct something like class CodepointToLower { const ReverseUpperInfo& info_; const char16_t code_; static const ReverseUpperInfo& computeInfo(char16_t code) { const size_t shift = 6; size_t index = reverse_upper_index1[code >> shift]; index = reverse_upper_index2[(index << shift) + (code & ((1 << shift) - 1))]; return js_reverse_upperinfo[index]; } public: explicit ReverseUpperCase(char16_t code) : info_(computeInfo(code)), code_(code) {} char16_t lowercase1() const { return uint16_t(code_) + info_.reverse1; } char16_t lowercase2() const { return uint16_t(code_) + info_.reverse2; } char16_t lowercase3() const { return uint16_t(code_) + info_.reverse3; } }; so that |const RUI&| computation is a one-shot deal. The compiler might optimize what you have now, but why not eliminate all doubt? And it's clearer to boot, IMO. ::: js/src/vm/make_unicode.py @@ +178,5 @@ > + > + if not entry: > + continue > + > + upper, lower, name, alias = entry It feels slightly more readable to me to parenthesize the LHS here -- makes clearer where the RHS starts.

Attachment #8764489 - Flags: review?(jwalden+bmo) → review+

Tooru Fujisawa [:arai]

Assignee

Comment 8

•

9 years ago

Attached patch Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp. r=jwalden (obsolete) — Details — Splinter Review

Thank you for reviewing :) Posting the updated patch for bikeshedding. About Unicode 7.0 characters, I noticed that they're simple characters that `code == ToLowerCase(ToUpperCase(code))`. So it was not necessary to put them into a testcase here. removed.

Attachment #8764489 - Attachment is obsolete: true

Attachment #8770721 - Flags: review+

Tooru Fujisawa [:arai]

Assignee

Comment 9

•

9 years ago

Quick Summary. We need a nice name for C1, C2, C3, F1 in the following equation. ToUpperCase(code) == ToUpperCase(C1) ToUpperCase(code) == ToUpperCase(C2) ToUpperCase(code) == ToUpperCase(C3) F1(code) == [C1, C2, C3]

Tooru Fujisawa [:arai]

Assignee

Comment 10

•

9 years ago

maybe, we don't need C1, C2, C3, as we can use `delta_n` name. ToUpperCase(code) == ToUpperCase(code + delta1) ToUpperCase(code) == ToUpperCase(code + delta2) ToUpperCase(code) == ToUpperCase(code + delta3) F1(code) == [code + delta1, code + delta2, code + delta3] So, the important part is F1 there.

Steve Fink [:sfink] [:s:]

Comment 11

•

9 years ago

The two things that come to mind are F1=LowerCaseVariants, or to think of these in terms of equivalence classes of code points under a ToUpperCase mapping, and borrow terminology from there somehow. CaseMappedEquivalents? Expanding C1 out to something like lowerCaseVariantDelta1 would probably be better than a plain delta1, too. I can't say I love those names either, but it's all I came up with. ToLowercaseDelta1 doesn't strike me as any better or worse than lowerCaseVariantDelta1.

Tooru Fujisawa [:arai]

Assignee

Comment 12

•

9 years ago

Thank you :D CaseMappedEquivalents sounds better, as ToLowerCase/LowerCase is not used while calculating them. Will prepare a patch with it, to see how it looks like in actual code.

Tooru Fujisawa [:arai]

Assignee

Comment 13

•

9 years ago

I noticed that "reverse" was wrong word for what I was thinking :P I meant "inverse function", for ToUpperCase^{-1}(x). https://en.wikipedia.org/wiki/Inverse_function

Tooru Fujisawa [:arai]

Assignee

Comment 14

•

9 years ago

Attached patch Search characters with ToUpperCase^{-1} operation on ignoreCase match with non-unicode RegExp. r=jwalden (obsolete) — Details — Splinter Review

Here's a patch with "CaseMappedEquivalents" name.

Attachment #8773141 - Flags: review+

Tooru Fujisawa [:arai]

Assignee

Comment 15

•

9 years ago

Attached patch Search codepoints with same uppercase on ignoreCase match with non-unicode RegExp. — Details — Splinter Review

Based on the discussion in IRC, changed the name to CodepointsWithSameUpperCase.

Attachment #8770721 - Attachment is obsolete: true

Attachment #8773141 - Attachment is obsolete: true

Attachment #8775411 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 16

•

9 years ago

Comment on attachment 8775411 [details] [diff] [review] Search codepoints with same uppercase on ignoreCase match with non-unicode RegExp. Review of attachment 8775411 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/tests/ecma_6/RegExp/ignoreCase-multiple.js @@ +1,2 @@ > +var BUGNUMBER = 1280046; > +var summary = "ignoreCase match should perform Canonicalize both to input and pattern."; s/both to/both on/ ::: js/src/vm/Unicode.h @@ +311,5 @@ > + * > + * Because multiple codepoints map to a single CodepointsWithSameUpperCaseInfo, > + * a CodepointsWithSameUpperCaseInfo and its delta{1,2,3} have no meaning > + * standing alone: > + * they have meaning only with respect to a codepoint mapping to that Remove the stray linebreak.

Attachment #8775411 - Flags: review?(jwalden+bmo) → review+

Tooru Fujisawa [:arai]

Assignee

Comment 17

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/18642055568111041c5b099b66209824b330b74e Bug 1280046 - Search codepoints with same uppercase on ignoreCase match with non-unicode RegExp. r=jwalden

Carsten Book [:Tomcat]

Comment 18

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/186420555681

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox50: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla50

Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp. 9 years ago Tooru Fujisawa [:arai] 34.54 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review
Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp. r=jwalden 9 years ago Tooru Fujisawa [:arai] 37.22 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Search characters with ToUpperCase^{-1} operation on ignoreCase match with non-unicode RegExp. r=jwalden 9 years ago Tooru Fujisawa [:arai] 38.17 KB, patch	arai : review+	Details \| Diff \| Splinter Review
Search codepoints with same uppercase on ignoreCase match with non-unicode RegExp. 9 years ago Tooru Fujisawa [:arai] 37.89 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review