Closed Bug 1033946 Opened 10 years ago Closed 10 years ago

Differential Testing: Different output message involving RegExp, exec, --latin1-strings

Categories

(Core :: JavaScript Engine: JIT, defect)

x86_64
All
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla34
Tracking Status
firefox31 --- unaffected
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
firefox35 --- fixed

People

(Reporter: gkw, Assigned: bhackett1024)

References

Details

(Keywords: regression, testcase)

Attachments

(1 file, 2 obsolete files)

print((/(?!(?!(?!6)[\Wc]))/i).exec())

$ ./js-dbgDisabled-opt-64-prof-dm-ts-darwin-5d9af625f42e --ion-offthread-compile=off --ion-eager 370.js

$ ./js-dbgDisabled-opt-64-prof-dm-ts-darwin-5d9af625f42e --ion-offthread-compile=off --ion-eager --latin1-strings 370.js
null

(Tested this on 64-bit Mac js opt threadsafe deterministic shell off m-c rev 5d9af625f42e)

My configure flags are:

CC="clang -Qunused-arguments" CXX="clang++ -Qunused-arguments" AR=ar sh /Users/skywalker/trees/mozilla-central/js/src/configure --target=x86_64-apple-darwin12.5.0 --disable-debug --enable-optimize --enable-profiling --enable-gczeal --enable-debug-symbols --disable-tests --enable-more-deterministic --with-ccache --enable-threadsafe <other NSPR options>

Guessing this is related to bug 998392 as well, so setting needinfo? from Jan.
Flags: needinfo?(jdemooij)
Hah, this seems to be a bug in our irregexp port somewhere that only affects TwoByte strings:

I get the following output for:

    print((/(?!(?!(?!6)[\Wc]))/i).test());

d8:                : false
SM before irregexp : false
SM --latin1-strings: false
SM                 : true
Not sure but I think this is a bug in CharacterRange::AddCaseEquivalents.
Flags: needinfo?(jdemooij) → needinfo?(bhackett1024)
Attached patch patch (obsolete) — Splinter Review
This is a problem with the same \u0130 'capital I with a dot over it' character that also caused problems in one of the existing tests in bug 976446.  That bug just changed the test, though I guess this fix is better (which also restores the original test).
Assignee: nobody → bhackett1024
Attachment #8456580 - Flags: review?(jdemooij)
Flags: needinfo?(bhackett1024)
Comment on attachment 8456580 [details] [diff] [review]
patch

Review of attachment 8456580 [details] [diff] [review]:
-----------------------------------------------------------------

I don't think this is correct, with the patch the following test fails:

assertEq("foobar\xff5baz\u1200".search(/bar\u0178\d/i), 3);
Attachment #8456580 - Flags: review?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #4)
> Comment on attachment 8456580 [details] [diff] [review]
> patch
> 
> Review of attachment 8456580 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I don't think this is correct, with the patch the following test fails:
> 
> assertEq("foobar\xff5baz\u1200".search(/bar\u0178\d/i), 3);

OK, I have no idea what the correct fix is here then.

We have this comment:

    // The standard requires that non-ASCII characters cannot have ASCII
    // character codes in their equivalence class.

But this is contradicted by our unicode tables (which we presumably pull from some standard source), in which the lowercase of \u0130 is a lowercase 'i'.

We could just watch for cases where the unicode tables have a pattern like this, and filter it out, like this patch does, but doing that on Latin1 characters is apparently not correct.  We could do it on Ascii characters (<= 127) but the irregexp source already seems to conflate the concepts of Ascii strings and Latin1 strings all over the place, including when invoking this mysterious standard.

Should I just do a spotfix for this stupid \u0130 character?
Assignee: bhackett1024 → nobody
(In reply to Brian Hackett (:bhackett) from comment #5)
> OK, I have no idea what the correct fix is here then.

I think the problem is in CharacterRange::AddCaseEquivalents. IIRC we call GetCaseIndependentLetters, but V8 doesn't. We should look at their code and see what they do exactly, and in which cases it's different from what we do.
(In reply to Jan de Mooij [:jandem] from comment #6)
> (In reply to Brian Hackett (:bhackett) from comment #5)
> > OK, I have no idea what the correct fix is here then.
> 
> I think the problem is in CharacterRange::AddCaseEquivalents. IIRC we call
> GetCaseIndependentLetters, but V8 doesn't. We should look at their code and
> see what they do exactly, and in which cases it's different from what we do.

v8's unicode library's interface is totally different from ours, so to avoid needing to port all of that stuff (and any additional dependencies) I implemented it less efficiently using GetCaseIndependentLetters and our library's ToUpperCase and ToLowerCase.  Barring spec insanity (to wit: this bug) these should be equivalent.
I know it sucks but we should still fix it somehow :(
Attached patch special case \u0130 (obsolete) — Splinter Review
Attachment #8456580 - Attachment is obsolete: true
Attachment #8458125 - Flags: review?(jdemooij)
Hm there's also a problem with "k" if you disable Latin1 strings.. (EnableLatin1Strings in String.cpp

assertEq((/(?!(?!(?!6)[\Wc]))/i).test("k"), false);
The spec says this:

The abstract operation Canonicalize takes a character parameter ch and performs the following steps:

1. If IgnoreCase is false, return ch.
2. Let u be ch converted to upper case as if by calling the standard built-in method String.prototype.toUpperCase on the one-character String ch.
3. If u does not consist of a single character, return ch.
4. Let cu be u's character.
5. If ch's code unit value is greater than or equal to decimal 128 and cu's code unit value is less than decimal 128, then return ch.
6. Return cu.

Does that help?
Attachment #8458125 - Flags: review?(jdemooij)
Flags: needinfo?(bhackett1024)
Attached patch patchSplinter Review
Attachment #8458125 - Attachment is obsolete: true
Attachment #8466291 - Flags: review?(jdemooij)
Flags: needinfo?(bhackett1024)
Comment on attachment 8466291 [details] [diff] [review]
patch

Review of attachment 8466291 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/irregexp/RegExpEngine.cpp
@@ +215,5 @@
> +        // Watch for duplicates.
> +        bool found = false;
> +        for (size_t j = 0; j < count; j++) {
> +            if (letters[j] == c)
> +                found = true;

Nit: can |break;| here.
Attachment #8466291 - Flags: review?(jdemooij) → review+
Thanks for fixing! Perhaps this is ready for landing soon?
Flags: needinfo?(bhackett1024)
https://hg.mozilla.org/mozilla-central/rev/293f3a5eabdb
Assignee: nobody → bhackett1024
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Flags: qe-verify-
Blocks: 976446
No longer blocks: latin1strings, 1028867
Comment on attachment 8466291 [details] [diff] [review]
patch

The patch was not backported but apparently people are running into this (see bug 1076728). I know it's late in the cycle but requesting approval just in case.

Approval Request Comment
[Feature/regressing bug #]: Bug 976446.
[User impact if declined]: Broken websites.
[Describe test coverage new/current, TBPL]: Tested on TBPL, patch is in 34/35.
[Risks and why]: Low risk.
[String/UUID change made/needed]: None.
Attachment #8466291 - Flags: approval-mozilla-beta?
Jan, that is unfortunate. However, having it on m-c for a month demonstrates that the change should be safe. You confirm that it didn't cause any side effect?
Flags: needinfo?(jdemooij)
(In reply to Sylvestre Ledru [:sylvestre] from comment #19)
> You confirm that it didn't cause any side effect?

I'm not aware of any side effects, no.
Flags: needinfo?(jdemooij)
Comment on attachment 8466291 [details] [diff] [review]
patch

We are too late in the cycle to take it. Sorry, this will be fixed with the release of 34...
Attachment #8466291 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: