CharacterRange::AddCaseEquivalents is horribly inefficient

RESOLVED FIXED in Firefox 56

Status

()

RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jandem, Assigned: jandem)

Tracking

(Blocks: 1 bug)

unspecified
mozilla56
Points:
---

Firefox Tracking Flags

(firefox56 fixed)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

2 years ago
It (1) checks the whole range for case equivalent characters and (2) adds a singleton range for each of these.

We often get here with a range containing > 50000 chars. We should skip ranges of "uninteresting" chars and we should try to add a single range instead of many singleton ones.

The V8 code this was forked from is a lot smarter. Maybe we can import that.
(Assignee)

Comment 1

2 years ago
(In reply to Jan de Mooij [:jandem] from comment #0)
> (2) adds a singleton range for each of these.

Actually it does try to combine with existing ranges, so this part seems okay.
(Assignee)

Comment 2

2 years ago
I added some logging and it seems there are a few large ranges that have no case equivalent chars:

[11566, 42559] (30994 chars)
[43968, 65312] (21345 chars)

It might make sense to skip these.
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1374014
(Assignee)

Comment 4

2 years ago
Btw I think we're also calling this multiple times for the same ranges so we may be doing something stupid higher up the stack. I'll investigate more this week.
Flags: needinfo?(jdemooij)

Comment 5

2 years ago
OK so for whatever reason previously in bug 1374014 I wasn't seeing a backtrace of where these are coming from!  And now I do:

setTimeout_timer
_matchStringToFieldName
getInfo
getFormInfo
collectFormFields
identifyAutofillFields
doIdentifyAutofillFields/<
setTimeout_timer

Here are the regexes in question: https://searchfox.org/mozilla-central/source/browser/extensions/formautofill/content/heuristicsRegexp.js

Comment 6

2 years ago
(Also it seems like we run this code for each field name we find, which looks to be the name of each form control we find on the page: https://searchfox.org/mozilla-central/rev/7cc377ce3f0a569c5c9b362d589705cd4fecc0ac/browser/extensions/formautofill/FormAutofillHeuristics.jsm#101.  This may be extremely expensive...  If we can't do something about this in SpiderMonkey we should probably file a front-end bug about it.)
(Assignee)

Comment 7

2 years ago
Posted patch Patch (obsolete) — Splinter Review
OK so the problem here is that some of these chrome regexps use the "u" and "i" flags, and the ranges we add for '.' is where we spend a lot of (unnecessary) time.

The patch optimizes this as follows:

(1) In MakeCaseIndependent we can skip character classes that contain everything except whitespace and surrogates, as AddCaseEquivalents is a no-op in this case. This is similar to the cc->is_standard check there.

(2) In AddCaseEquivalents we can skip surrogate character ranges. The V8 code upstream has a similar check.

This improves the micro-benchmark below from >= 10 ms to 0 ms.

  var re = /f...................................../ui;
  var t = new Date;
  print("foo".match(re));
  print(new Date - t);
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
Attachment #8881328 - Flags: review?(arai.unmht)
Comment on attachment 8881328 [details] [diff] [review]
Patch

Review of attachment 8881328 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, the change looks good, except that js/src/vm/make_unicode.py needs corresponding modification.

::: js/src/irregexp/RegExpCharacters.cpp
@@ +140,5 @@
> +    0x2028, 0x2029 + 1, // LINE SEPARATOR..PARAGRAPH SEPARATOR
> +    0xD800, 0xDFFF + 1, // <Lead Surrogate Min>..<Trail Surrogate Max>
> +    0xFFFF + 1
> +};
> +const int js::irregexp::kLineTerminatorAndSurrogateRangeCount = 9;

this file is automatically generated by make_unicode.py,
and this needs to be handled there, next to the following line:

https://dxr.mozilla.org/mozilla-central/rev/53477d584130945864c4491632f88da437353356/js/src/vm/make_unicode.py#1427
>         character_range('LineTerminator', line_terminator)
Attachment #8881328 - Flags: review?(arai.unmht) → feedback+
(Assignee)

Comment 9

2 years ago
Posted patch Patch v2Splinter Review
Good catch, arai.

I undid my RegExpCharacters.cpp changes and ran make_unicode.py (which generated exactly the same file) :)
Attachment #8881328 - Attachment is obsolete: true
Attachment #8881464 - Flags: review?(arai.unmht)
Comment on attachment 8881464 [details] [diff] [review]
Patch v2

Review of attachment 8881464 [details] [diff] [review]:
-----------------------------------------------------------------

Great!
Attachment #8881464 - Flags: review?(arai.unmht) → review+

Comment 11

2 years ago
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d084672cc9fc
Speed up TextNode::MakeCaseIndependent for the unicode '.' character class ranges. r=arai

Comment 12

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/d084672cc9fc
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox56: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
You need to log in before you can comment on or make changes to this bug.