Closed
Bug 1280046
Opened 9 years ago
Closed 9 years ago
RegExp: case-insensitive matching not compatible with V8/JSC
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla50
| Tracking | Status | |
|---|---|---|
| firefox50 | --- | fixed |
People
(Reporter: mathias, Assigned: arai)
References
()
Details
Attachments
(1 file, 3 obsolete files)
|
37.89 KB,
patch
|
Waldo
:
review+
|
Details | Diff | Splinter Review |
SpiderMonkey returns `false` for each of these tests, while V8 and JavaScriptCore return `true` for all of them:
/\xB5/i.test('\u03BC')
/\u0345/i.test('\u03B9')
/\u037F/i.test('\u03F3')
/\u03C2/i.test('\u03C3')
/\u03D0/i.test('\u03B2')
/\u03D1/i.test('\u03B8')
/\u03D5/i.test('\u03C6')
/\u03D6/i.test('\u03C0')
/\u03F0/i.test('\u03BA')
/\u03F1/i.test('\u03C1')
/\u03F5/i.test('\u03B5')
/\u0528/i.test('\u0529')
/\u052A/i.test('\u052B')
/\u052C/i.test('\u052D')
/\u052E/i.test('\u052F')
/\u1E9B/i.test('\u1E61')
/\u1FBE/i.test('\u03B9')
/\uA698/i.test('\uA699')
/\uA69A/i.test('\uA69B')
/\uA796/i.test('\uA797')
/\uA798/i.test('\uA799')
/\uA79A/i.test('\uA79B')
/\uA79C/i.test('\uA79D')
/\uA79E/i.test('\uA79F')
/\uA7AB/i.test('\u025C')
/\uA7AC/i.test('\u0261')
/\uA7AD/i.test('\u026C')
/\uA7B0/i.test('\u029E')
/\uA7B1/i.test('\u0287')
https://mathiasbynens.be/demo/regex-i
| Reporter | ||
Updated•9 years ago
|
| Reporter | ||
Updated•9 years ago
|
| Assignee | ||
Comment 2•9 years ago
|
||
This should be the issue on how we implement Canonicalize [1].
Basically, we should Canonicalize (toUpperCase) both input and pattern, but we emulate it by applying toUpperCase/toLowerCase [2] on pattern, and not-modifying input.
With Canonicalize operation, both U+00B5 and U+03BC become U+039C, and matches:
U+00B5 toUpperCase => U+039C
U+03BC toUpperCase => U+039C
But with toUpperCase/toLowerCase operations, U+00B5 doesn't become U+03BC, and it doesn't match:
U+00B5 toUpperCase => U+039C
U+00B5 toLowerCase => U+00B5
So, like unicode matching case [3], we should check reverse-conversion(s) [4], instead of toLowerCase.
[1] https://tc39.github.io/ecma262/#sec-runtime-semantics-canonicalize-ch
[2] https://dxr.mozilla.org/mozilla-central/rev/3ce53bd1e25b93140484d3933c9339a829e0c1eb/js/src/irregexp/RegExpEngine.cpp#337
[3] https://dxr.mozilla.org/mozilla-central/rev/3ce53bd1e25b93140484d3933c9339a829e0c1eb/js/src/irregexp/RegExpEngine.cpp#327
[4] https://dxr.mozilla.org/mozilla-central/rev/3ce53bd1e25b93140484d3933c9339a829e0c1eb/js/src/vm/Unicode.h#266
Flags: needinfo?(arai.unmht)
| Assignee | ||
Comment 3•9 years ago
|
||
Also, some characters of those testcase are defined in Unicode 7.0, but SpiderMonkey is still using Unicode 6.2.
See Also: → 1230490
| Assignee | ||
Comment 4•9 years ago
|
||
Will fix for Unicode 6.2 characters here.
Others will be fixed by bug 1230490.
Assignee: nobody → arai.unmht
Status: NEW → ASSIGNED
| Assignee | ||
Comment 5•9 years ago
|
||
Like unicode RegExp's case, added reverse conversions, where
ToUpperCase(code) == ToUpperCase(reverse_n)
for each code.
there are at most 3 characters for single characters.
those information is stored in separated table, js_reverse_upperinfo, reverse_upper_index1, reverse_upper_index2.
Also, moved 21.2.2.8.2 [1] step 3.g before creating choices array.
> If ch's code unit value ≥ 128 and cu's code unit value < 128, return ch.
because, the condition is different for reverse_n cases.
In testcase, several characters are commented out, because those are added in Unicode 7.0, and not stored in current UnicodeData.txt.
[1] https://tc39.github.io/ecma262/#sec-runtime-semantics-canonicalize-ch
Attachment #8764489 -
Flags: review?(till)
Comment 6•9 years ago
|
||
Comment on attachment 8764489 [details] [diff] [review]
Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp.
Review of attachment 8764489 [details] [diff] [review]:
-----------------------------------------------------------------
I'm afraid I'll have to punt on this review. I just don't understand the subject matter well enough to meaningfully review the patch. Forwarding to someone who should be much better suited.
Attachment #8764489 -
Flags: review?(till) → review?(jwalden+bmo)
Comment 7•9 years ago
|
||
Comment on attachment 8764489 [details] [diff] [review]
Search characters with reverse-ToUpperCase operation on ignoreCase match with non-unicode RegExp.
Review of attachment 8764489 [details] [diff] [review]:
-----------------------------------------------------------------
This isn't the most horribly confident review in the world, but then again I doubt anyone else who might review this as a SpiderMonkey reviewer would be much better. :-( *mutters about "someone who should be much better suited", eyes till beadily*
The naming concerns here aren't just a matter of personal preference or nitpicking: they actively and significantly interfere with my ability to understand this as I read it. So I'd like us to discuss and reach some consensus on naming before we run with this exactly. But algorithm-wise, it looks to be in the right place, as far as I can tell.
::: js/src/tests/ecma_6/RegExp/ignoreCase-multiple.js
@@ +2,5 @@
> +var summary = "ignoreCase match should perform Canonicalize both to input and pattern.";
> +
> +print(BUGNUMBER + ": " + summary);
> +
> +assertEq(/\xB5/i.test('\u03BC'), true);
I sort of wish there were comments like
// U+00B5 <Unicode name>, U+03BC <Unicode name>
by these, but eh.
Should all of these be symmetric? Given Canonicalize is applied to both sides, it seems like it. It seems like you should then do,
var pairs =
[
["\u0345", "\u03B9"],
...
];
for (var pair of pairs) {
assertEq(new RegExp(pair[0], "i").test(pair[1]), true);
assertEq(new RegExp(pair[1], "i").test(pair[0]), true);
}
so we get coverage both directions.
@@ +15,5 @@
> +assertEq(/\u03F5/i.test('\u03B5'), true);
> +assertEq(/\u1E9B/i.test('\u1E61'), true);
> +assertEq(/\u1FBE/i.test('\u03B9'), true);
> +
> +/* TODO: Comment out after bug 1230490 (Unicode 7.0 characters).
Ugh, commenting out. Can we make this into a feature-test somehow, so it starts doing...something...when the guard should be removed? Like maybe,
if (/\u037F/i.test("\u03F3") === true) {
...rest of tests...
}
::: js/src/vm/Unicode.h
@@ +242,5 @@
> + *
> + * ToUpperCase ToUpperCase
> + * code --------------> upperCase <-------------- reverse1
> + * <-------------- reverse2
> + * <-------------- reverse3
This graphic isn't immediately understandable to me. I suggest instead,
"""
For a codepoint C, ReverseUpperInfo stores three offsets from C to up to three lowercased forms of C (no codepoint in UnicodeData.txt has more than three lowercased forms). If there are fewer than three lowercased forms, some fields are zero.
To illustrate, consider the codepoint U+0399 GREEK CAPITAL LETTER IOTA, the uppercased form of these three codepoints:
U+03B9 GREEK SMALL LETTER IOTA
U+1FBE GREEK PROSGEGRAMMENI
U+0345 COMBINING GREEK YPOGEGRAMMENI
For the ReverseUpperInfo corresponding to this codepoint, reverse{1,2,3} are 16-bit modular deltas from 0x0399 to each respective codepoint: uint16_t(0x03B9 - 0x0399), uint16_t(0x1FBE - 0x0399), uint16_t(0x0345 - 0x0399) in an unimportant order.
Because multiple codepoints map to a single ReverseUpperInfo, a ReverseUpperInfo and its reverse{1,2,3} have no meaning standing alone: they have meaning only with respect to a codepoint mapping to that ReverseUpperInfo.
"""
as a class-comment or something.
@@ +246,5 @@
> + * <-------------- reverse3
> + */
> + uint16_t reverse1;
> + uint16_t reverse2;
> + uint16_t reverse3;
"reverse" is really confusing to me about exactly which direction (lower->upper, upper->lower) is meant.
Instead of "reverse" and "upper" as names, I'd probably go for ToLowercaseDeltas, delta{1,2,3}, and other names along those lines. Well, maybe not, those names aren't very good either. Maybe we should do some bikeshedding in #jsapi, ideally with a couple other people as well? The current names just are really hard to think about. :-(
@@ +254,5 @@
> +extern const uint8_t reverse_upper_index2[];
> +extern const ReverseUpperInfo js_reverse_upperinfo[];
> +
> +inline const ReverseUpperInfo&
> +ReverseUpperCaseInfo(char16_t code)
Feels like, rather than returning an Info&, we should have a separate struct something like
class CodepointToLower
{
const ReverseUpperInfo& info_;
const char16_t code_;
static const ReverseUpperInfo& computeInfo(char16_t code) {
const size_t shift = 6;
size_t index = reverse_upper_index1[code >> shift];
index = reverse_upper_index2[(index << shift) + (code & ((1 << shift) - 1))];
return js_reverse_upperinfo[index];
}
public:
explicit ReverseUpperCase(char16_t code)
: info_(computeInfo(code)),
code_(code)
{}
char16_t lowercase1() const { return uint16_t(code_) + info_.reverse1; }
char16_t lowercase2() const { return uint16_t(code_) + info_.reverse2; }
char16_t lowercase3() const { return uint16_t(code_) + info_.reverse3; }
};
so that |const RUI&| computation is a one-shot deal. The compiler might optimize what you have now, but why not eliminate all doubt? And it's clearer to boot, IMO.
::: js/src/vm/make_unicode.py
@@ +178,5 @@
> +
> + if not entry:
> + continue
> +
> + upper, lower, name, alias = entry
It feels slightly more readable to me to parenthesize the LHS here -- makes clearer where the RHS starts.
Attachment #8764489 -
Flags: review?(jwalden+bmo) → review+
| Assignee | ||
Comment 8•9 years ago
|
||
Thank you for reviewing :)
Posting the updated patch for bikeshedding.
About Unicode 7.0 characters, I noticed that they're simple characters that `code == ToLowerCase(ToUpperCase(code))`.
So it was not necessary to put them into a testcase here.
removed.
Attachment #8764489 -
Attachment is obsolete: true
Attachment #8770721 -
Flags: review+
| Assignee | ||
Comment 9•9 years ago
|
||
Quick Summary.
We need a nice name for C1, C2, C3, F1 in the following equation.
ToUpperCase(code) == ToUpperCase(C1)
ToUpperCase(code) == ToUpperCase(C2)
ToUpperCase(code) == ToUpperCase(C3)
F1(code) == [C1, C2, C3]
| Assignee | ||
Comment 10•9 years ago
|
||
maybe, we don't need C1, C2, C3, as we can use `delta_n` name.
ToUpperCase(code) == ToUpperCase(code + delta1)
ToUpperCase(code) == ToUpperCase(code + delta2)
ToUpperCase(code) == ToUpperCase(code + delta3)
F1(code) == [code + delta1, code + delta2, code + delta3]
So, the important part is F1 there.
Comment 11•9 years ago
|
||
The two things that come to mind are F1=LowerCaseVariants, or to think of these in terms of equivalence classes of code points under a ToUpperCase mapping, and borrow terminology from there somehow. CaseMappedEquivalents?
Expanding C1 out to something like lowerCaseVariantDelta1 would probably be better than a plain delta1, too.
I can't say I love those names either, but it's all I came up with. ToLowercaseDelta1 doesn't strike me as any better or worse than lowerCaseVariantDelta1.
| Assignee | ||
Comment 12•9 years ago
|
||
Thank you :D
CaseMappedEquivalents sounds better, as ToLowerCase/LowerCase is not used while calculating them.
Will prepare a patch with it, to see how it looks like in actual code.
| Assignee | ||
Comment 13•9 years ago
|
||
I noticed that "reverse" was wrong word for what I was thinking :P
I meant "inverse function", for ToUpperCase^{-1}(x).
https://en.wikipedia.org/wiki/Inverse_function
| Assignee | ||
Comment 14•9 years ago
|
||
Here's a patch with "CaseMappedEquivalents" name.
Attachment #8773141 -
Flags: review+
| Assignee | ||
Comment 15•9 years ago
|
||
Based on the discussion in IRC, changed the name to CodepointsWithSameUpperCase.
Attachment #8770721 -
Attachment is obsolete: true
Attachment #8773141 -
Attachment is obsolete: true
Attachment #8775411 -
Flags: review?(jwalden+bmo)
Comment 16•9 years ago
|
||
Comment on attachment 8775411 [details] [diff] [review]
Search codepoints with same uppercase on ignoreCase match with non-unicode RegExp.
Review of attachment 8775411 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/tests/ecma_6/RegExp/ignoreCase-multiple.js
@@ +1,2 @@
> +var BUGNUMBER = 1280046;
> +var summary = "ignoreCase match should perform Canonicalize both to input and pattern.";
s/both to/both on/
::: js/src/vm/Unicode.h
@@ +311,5 @@
> + *
> + * Because multiple codepoints map to a single CodepointsWithSameUpperCaseInfo,
> + * a CodepointsWithSameUpperCaseInfo and its delta{1,2,3} have no meaning
> + * standing alone:
> + * they have meaning only with respect to a codepoint mapping to that
Remove the stray linebreak.
Attachment #8775411 -
Flags: review?(jwalden+bmo) → review+
| Assignee | ||
Comment 17•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/18642055568111041c5b099b66209824b330b74e
Bug 1280046 - Search codepoints with same uppercase on ignoreCase match with non-unicode RegExp. r=jwalden
Comment 18•9 years ago
|
||
| bugherder | ||
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
status-firefox50:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
You need to log in
before you can comment on or make changes to this bug.
Description
•