Closed Bug 1338779 Opened 7 years ago Closed 7 years ago

Incorrect RegExp matching with word boundary and Kleene plus operator in IgnoreCase+Unicode mode

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox54 --- fixed

People

(Reporter: anba, Assigned: arai)

References

Details

Attachments

(1 file)

Test case:
---
print(/(i\B\u017f)/ui.test("is"));
print(/(i\B\u017f)+/ui.test("is"));
---

Expected: Prints "true" two times
Actual: The second print call outputs "false"


Reproducible with and without the patch from bug 1338373 applied.
Flags: needinfo?(arai.unmht)
looks like we should avoid some optimization for ASCII vs non-ASCII match,
or add exception for those characters.

will try fixing.
Assignee: nobody → arai.unmht
Status: NEW → ASSIGNED
actually, Latin1 vs non-Latin1 match with ignoreCase, regardless of unicode or not.

/(\u039C)/i.test("\xB5") // true
/(\u039C)+/i.test("\xB5") // false <= should be true
This is the issue that TextNode doesn't match when it contains non-Latin1 character that matches to Latin1 character with ignore_case, and input string is Latin1, in certain situation depending on enclosing context.

here's the details.

In TextNode::TextEmitPass, NON_ASCII_MATCH pass tries to skip matching when the input string is Latin1 and the current character is non-Latin1.
It should not skip if the current character maps to Latin1 by ignore_case matching.
So changed it to use ConvertNonLatin1ToLatin1 result for ignore_case.

Then, if above code avoid skipping, that character should be handled in later pass.
Before this patch, ignore_case match was using another 2 passes, NON_LETTER_CHARACTER_MATCH (EmitAtomNonLetter) and CASE_CHARACTER_MATCH (EmitAtomLetter).
Those categories are not describing well when a letter can map across non-Latin1 to Latin1, and non-Latin1 case is filtered out in ConvertNonLatin1ToLatin1 because of Latin1 match, that results in a single character has only one `chars` for GetCaseIndependentLetters().

So, changed them to CASE_SINGLE_CHARACTER_MATCH and CASE_MUTLI_CHARACTER_MATCH:
  * CASE_SINGLE_CHARACTER_MATCH : GetCaseIndependentLetters() == 1
    calls EmitAtomSingle
  * CASE_MULTI_CHARACTER_MATCH : GetCaseIndependentLetters() > 1
    calls EmitAtomMulti
Flags: needinfo?(arai.unmht)
Attachment #8836431 - Flags: review?(till)
also, filed bug 1338841 to rename some "ascii" to "latin1" in irregexp.
See Also: → 1338841
Comment on attachment 8836431 [details] [diff] [review]
Handle characters map across Latin1 and non-Latin1 properly in TextNode.

Review of attachment 8836431 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great, thank you! Also thank you for the excellent explanation of what's going on, very helpful.
Attachment #8836431 - Flags: review?(till) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/639dea0e1a9cd6a91ed98767bb80325b4c9774ff
Bug 1338779 - Handle characters map across Latin1 and non-Latin1 properly in TextNode. r=till
https://hg.mozilla.org/mozilla-central/rev/639dea0e1a9c
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: