Bugzilla

Comment 1

•

10 years ago

Hah, this seems to be a bug in our irregexp port somewhere that only affects TwoByte strings:

I get the following output for:

    print((/(?!(?!(?!6)[\Wc]))/i).test());

d8:                : false
SM before irregexp : false
SM --latin1-strings: false
SM                 : true

Andrew McCreight [:mccr8]

Comment 2

•

10 years ago

Not sure but I think this is a bug in CharacterRange::AddCaseEquivalents.

Flags: needinfo?(jdemooij) → needinfo?(bhackett1024)

Updated

•

10 years ago

status-firefox31: --- → unaffected

status-firefox32: --- → affected

status-firefox33: --- → affected

Assignee

Comment 3

•

10 years ago

Attached patch patch (obsolete) — Details — Splinter Review

This is a problem with the same \u0130 'capital I with a dot over it' character that also caused problems in one of the existing tests in bug 976446.  That bug just changed the test, though I guess this fix is better (which also restores the original test).

Assignee: nobody → bhackett1024

Attachment #8456580 - Flags: review?(jdemooij)

Flags: needinfo?(bhackett1024)

Comment 4

•

10 years ago

Comment on attachment 8456580 [details] [diff] [review]
patch

Review of attachment 8456580 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think this is correct, with the patch the following test fails:

assertEq("foobar\xff5baz\u1200".search(/bar\u0178\d/i), 3);

Attachment #8456580 - Flags: review?(jdemooij)

Assignee

Comment 5

•

10 years ago

(In reply to Jan de Mooij [:jandem] from comment #4)
> Comment on attachment 8456580 [details] [diff] [review]
> patch
> 
> Review of attachment 8456580 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I don't think this is correct, with the patch the following test fails:
> 
> assertEq("foobar\xff5baz\u1200".search(/bar\u0178\d/i), 3);

OK, I have no idea what the correct fix is here then.

We have this comment:

    // The standard requires that non-ASCII characters cannot have ASCII
    // character codes in their equivalence class.

But this is contradicted by our unicode tables (which we presumably pull from some standard source), in which the lowercase of \u0130 is a lowercase 'i'.

We could just watch for cases where the unicode tables have a pattern like this, and filter it out, like this patch does, but doing that on Latin1 characters is apparently not correct.  We could do it on Ascii characters (<= 127) but the irregexp source already seems to conflate the concepts of Ascii strings and Latin1 strings all over the place, including when invoking this mysterious standard.

Should I just do a spotfix for this stupid \u0130 character?

Assignee: bhackett1024 → nobody

Comment 6

•

10 years ago

(In reply to Brian Hackett (:bhackett) from comment #5)
> OK, I have no idea what the correct fix is here then.

I think the problem is in CharacterRange::AddCaseEquivalents. IIRC we call GetCaseIndependentLetters, but V8 doesn't. We should look at their code and see what they do exactly, and in which cases it's different from what we do.

Assignee

Comment 7

•

10 years ago

(In reply to Jan de Mooij [:jandem] from comment #6)
> (In reply to Brian Hackett (:bhackett) from comment #5)
> > OK, I have no idea what the correct fix is here then.
> 
> I think the problem is in CharacterRange::AddCaseEquivalents. IIRC we call
> GetCaseIndependentLetters, but V8 doesn't. We should look at their code and
> see what they do exactly, and in which cases it's different from what we do.

v8's unicode library's interface is totally different from ours, so to avoid needing to port all of that stuff (and any additional dependencies) I implemented it less efficiently using GetCaseIndependentLetters and our library's ToUpperCase and ToLowerCase.  Barring spec insanity (to wit: this bug) these should be equivalent.

Comment 8

•

10 years ago

I know it sucks but we should still fix it somehow :(

Assignee

Comment 9

•

10 years ago

Attached patch special case \u0130 (obsolete) — Details — Splinter Review

Attachment #8456580 - Attachment is obsolete: true

Attachment #8458125 - Flags: review?(jdemooij)

Comment 10

•

10 years ago

Hm there's also a problem with "k" if you disable Latin1 strings.. (EnableLatin1Strings in String.cpp

assertEq((/(?!(?!(?!6)[\Wc]))/i).test("k"), false);

Comment 11

•

10 years ago

The spec says this:

The abstract operation Canonicalize takes a character parameter ch and performs the following steps:

1. If IgnoreCase is false, return ch.
2. Let u be ch converted to upper case as if by calling the standard built-in method String.prototype.toUpperCase on the one-character String ch.
3. If u does not consist of a single character, return ch.
4. Let cu be u's character.
5. If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch.
6. Return cu.

Does that help?

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

10 years ago

Attachment #8458125 - Flags: review?(jdemooij)

Reporter

Updated

•

10 years ago

Flags: needinfo?(bhackett1024)

Assignee

Comment 12

•

10 years ago

Attached patch patch — Details — Splinter Review

Attachment #8458125 - Attachment is obsolete: true

Attachment #8466291 - Flags: review?(jdemooij)

Flags: needinfo?(bhackett1024)

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 13

•

10 years ago

Comment on attachment 8466291 [details] [diff] [review]
patch

Review of attachment 8466291 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/irregexp/RegExpEngine.cpp
@@ +215,5 @@
> +        // Watch for duplicates.
> +        bool found = false;
> +        for (size_t j = 0; j < count; j++) {
> +            if (letters[j] == c)
> +                found = true;

Nit: can |break;| here.

Attachment #8466291 - Flags: review?(jdemooij) → review+

Reporter

Comment 14

•

10 years ago

Thanks for fixing! Perhaps this is ready for landing soon?

Flags: needinfo?(bhackett1024)

Liz Henry (:lizzard) (relman/hg->git project)

Assignee

Comment 15

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/293f3a5eabdb

Flags: needinfo?(bhackett1024)

Ed Morley [:emorley]

Comment 16

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/293f3a5eabdb

Assignee: nobody → bhackett1024

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla34

Updated

•

10 years ago

Flags: qe-verify-

Updated

•

10 years ago

Blocks: 976446
No longer blocks: latin1strings, 1028867

Comment 18

•

10 years ago

Comment on attachment 8466291 [details] [diff] [review]
patch

The patch was not backported but apparently people are running into this (see bug 1076728). I know it's late in the cycle but requesting approval just in case.

Approval Request Comment
[Feature/regressing bug #]: Bug 976446.
[User impact if declined]: Broken websites.
[Describe test coverage new/current, TBPL]: Tested on TBPL, patch is in 34/35.
[Risks and why]: Low risk.
[String/UUID change made/needed]: None.

Attachment #8466291 - Flags: approval-mozilla-beta?

Comment 19

•

10 years ago

Jan, that is unfortunate. However, having it on m-c for a month demonstrates that the change should be safe. You confirm that it didn't cause any side effect?

Flags: needinfo?(jdemooij)

Comment 20

•

10 years ago

(In reply to Sylvestre Ledru [:sylvestre] from comment #19)
> You confirm that it didn't cause any side effect?

I'm not aware of any side effects, no.

Flags: needinfo?(jdemooij)

Updated

•

10 years ago

status-firefox32: affected → wontfix

status-firefox33: affected → wontfix

status-firefox34: --- → affected

status-firefox35: --- → fixed

Updated

•

10 years ago

status-firefox34: affected → fixed