Closed Bug 673188 Opened 13 years ago Closed 13 years ago

Compile regexps lazily

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla10

People

(Reporter: bzbarsky, Assigned: cdleary)

References

Details

(Whiteboard: [MemShrink:P1])

Attachments

(4 files, 6 obsolete files)

Eager compilation ends up using more time (see bug 672064) and probably more memory (see bug 670689).

This is not quite the same as bug 463545, which was using the nanojit-based regexp jit.  Might still be wontfix, of course.
Whiteboard: [MemShrink]
Could we have an optimization for regexps like we do for the mjit, where the regexp is interpreted a few times before it's compiled? Is this possible with the Yarr interpreter?
(In reply to comment #1)
> Could we have an optimization for regexps like we do for the mjit, where the
> regexp is interpreted a few times before it's compiled? Is this possible
> with the Yarr interpreter?

That's an interesting question. I think the concept doesn't apply to our current architecture for a number of reasons:

- We actually JIT compile directly from the pattern (AST) instead of requiring bytecode first. We only use bytecode for those regexps that are too complex for the JIT compiler to handle (or too difficult to optimize, however you want to look at it).
- We can't easily cross from the interpreter to JIT code at "safe points" -- the looping concept isn't as well defined, so large text inputs on initial runs would suffer. The YARR-JIT also uses the C stack to hold backtracking information last I checked, so you'd have to do a transform there.

So I think hypothetically it's possible and a cool idea, but practically it's not feasible.
The second problem seems like something we could overcome. If the input is too large, we could compile immediately. Or, if we're spending too much time in the interpreter, we could just start over in the JIT (assuming there are no side effects; perhaps I'm being naive...).

I don't really understand the first problem. When does the bytecode actually get generated? Would it be difficult to generate it first instead of only as a fallback?

I guess this isn't really a clear win, since the bytecode might actually be bigger than the generated machine code. We'd have to measure.
Whiteboard: [MemShrink] → [MemShrink][mentor=cdleary]
(In reply to comment #3)
> Or, if we're spending too much time
> in the interpreter, we could just start over in the JIT (assuming there are
> no side effects; perhaps I'm being naive...).

No, that's correct. I think that's a viable approach. I'll think it over some more, but the current regexp setup is that a pattern either gets compiled directly to native (if it's simple enough) or to directly to bytecode if it's too complex. I'm thinking that flushing memory for the unused regexps (bug 673189) plus lazy compilation may have the best usefulness/complexity ratio -- we can flush bytecode as well since it can be regenerated from the regexp source text.
dukeleto from #jsapi said that he'd be interested in taking this as a first bug. Some background information: regexps are currently compiled eagerly, on creation. See |js::RegExp::create*| in jsregexpinlines.h. The interface for these C++ objects is quite limited -- the primary way outside code interacts with them is through |js::RegExp::execute|.

|js::RegExp|s contain the code required to execute a regular expression -- it's either native code xor bytecode (i.e. when the fallback bytecode value is set, there is no native code).

Let me know if you have any questions -- IMO the |js::RegExp| interface is pretty clean and it generally just calls into the appropriate part of the YARR regexp engine component.
Whiteboard: [MemShrink][mentor=cdleary] → [MemShrink:P1][mentor=cdleary]
Assignee: general → jonathan
From what I gather from previous comments, js::RegExp's should be only be compiled if used at least N times, where N is some compile-time constant. This could be implemented by moving code which is currently in |js::RegExp::create*| to |js::RegExp::execute| which is only triggered on the Nth call to|js::RegExp::execute| . This would not require any API changes.

Does that sound like a viable plan?
Status: NEW → ASSIGNED
In #jsapi:

<cdleary-lappy> dukeleto: the bug in its current form is just to lazily compile, not to switch from bytecode over to JITted code

IOW, instead of compiling a regex when it's created, compile it just before its first use.
(In reply to comment #7)
> IOW, instead of compiling a regex when it's created, compile it just before
> its first use.

Yup, and when this is implemented we can put bug 673189 on top of this to clear out unused-but-compiled regexps at each GC.
This change may cause some substantial memory savings, so it's good to see you taking it on, Duke!
Just wanted to give an update on this, since jlebar asked me about it on IRC:

I spent a few days familiarizing myself with the RegExp code and I identified the actual line of code that compiles RegExp's on creation, which gave me a small sense of accomplishment.

Currently I am working on a test so that I can verify my patch works correctly. At first, I was going to write the tests in C++, because there is no way to tell if a RegExp is compiled from Javascript. But I had the pleasure of talking to jimb about this during a Mozilla PDX lunch, and he suggested that I create a private API for the shell which exposes a single function, something like isCompiled(), which will allow Javascript tests to detect if a RegExp is compiled or not.

That is my current plan of attack. If anyone has comments, suggestions or examples for how to implement a shell-only private API, that would be most appreciated.
> If anyone has comments, suggestions or
> examples for how to implement a shell-only private API, that would be most
> appreciated.

I'm no expert, but I think you should take a look at |shell_functions| in js/src/shell/js.cpp.
This patch compiles RegExp lazily, but because the TokenStream is not available at execution time, it passes in NULL currently. Many RegExp tests still pass, but tests which deal with multiple matched items fail, because only a single match is returned.
This is the right idea. I think the match result count is off because parenCount is initialized by |RegExp::compileHelper| which isn't being invoked until after you pull that value out in |RegExp::executeInternal|.

However, there are also a few trickier things lurking.

- Not all |RegExp|s are backed by the bytecode pattern -- looking at |RegExp::compileHelper|, we "try" to JIT compile the pattern, but if we can't, then we fall back on bytecode compilation. For the patterns that are successfully JIT compiled, |byteCode| will always be null.
- Because regular expressions can create "early errors" (errors which must be reported at parse-time, per the ECMAScript specification), we have to at least invoke the YARR parser with a nop-like delegate (see YarrParser.h for information on the parser delegate) in order to check for error codes.
- The token stream is only necessary for early error reporting, so we should be able to create a separate method that just checks for early errors, independent of lazy compilation. (By checking up front, we ensure that no early errors exist at lazy compilation time.)

Duke, let me know what your bandwidth is like for iterating on this patch -- unfortunately, I may have to steal it away to get an interdependent cluster of these regexp enhancements in for the next Firefox release train. I lacked that foresight when I said this was a good first bug, and I'm sorry!
cdleary: I don't give up easily, so I would like to continue iterating on this patch armed with the knowledge of comment #14. I feel confident that I can fix bullet #1 your comment, because I remember reading useful comments about how to check if JIT'ed code is available. I think I know what I need to do to get parentCount correct as well.

I grasp the "early errors" bit, and I will read the relevant parts of the spec to make sure I fully understand how they should work, but that is the part where I feel like I still need to feel around the code a bit more to see what actually needs to be done.

I have tuits to work with you this week to get this bug fixed, but I understand if you need to steal it away. Do you have an estimate for when you will be forced to steal this interesting bug away from me?
By hoisting the call to compile to before we calculate matchItemCount in |RegExp::executeInternal|, I reduced the number of failing test files from 77 down to just these 6:

    ecma_3/RegExp/regress-188206.js
    ecma_3/RegExp/regress-223273.js
    ecma_3/RegExp/regress-57631.js
    ecma_5/Global/cross-global-implicit-this.js
    js1_5/Exceptions/regress-257751.js
    js1_5/Exceptions/regress-332472.js
The list of failing test files for this patch is:
    ecma_3/RegExp/regress-188206.js
    ecma_3/RegExp/regress-223273.js
    ecma_3/RegExp/regress-57631.js
    ecma_5/Global/cross-global-implicit-this.js
    js1_5/Exceptions/regress-257751.js
    js1_5/Exceptions/regress-332472.js

which mostly seem to be tests for "early errors" not being thrown for invalid RegExps.
Attachment #551289 - Attachment is obsolete: true
Attachment #551291 - Attachment is obsolete: true
jorendorff gave me these instructions for finishing this up:

1) Remove Tokenstream param from compile() and compileHelper()
2) Change compileHelper() to assert yarrError is 0 instead of checking for it
3) Learn the secrets of Yarr::YarrPattern to detect early errors
4) WIN
Duke, I think if we get it done in the next couple of weeks it will be fine. Changes like these (lazifying stuff, changing GC) generally need time to bake in the nightly channel, so we're not likely to get it into the current train given that there's only a week or so left.

Some advice: be sure you're running the jit-tests as well as the jstests -- because it's so much easier to add tests there, regexp tests end up in both suites.

I would also recommend, as you gain understanding from the spec, adding tests specifically to test for early errors and such. Increasing our test suite and learning from the spec at the same time is a double win! For jit-tests you effectively just need to drop a file into the directory and use the |assertEq| shell builtin.
In addition to the above currently failing normal tests, these JIT tests are failing (due to the lack of early syntax errors being thrown):

FAILURES:
    -m -j -p /home/leto/hg/mozilla-central/js/src/jit-test/tests/basic/bug576837-regexp.js
    -m -j -p /home/leto/hg/mozilla-central/js/src/jit-test/tests/basic/bug593663-regexp.js
This patch implements the API changes recommended, and starts down the path of reporting early errors in RegExps from the parser.
Attachment #551674 - Attachment is obsolete: true
The newest patch cheats a bit by making reportYarrError a public method so that I can call it from the parser. I plan to fix this by moving reportYarrError to the parser, since it is not used anywhere else.

The bigger problem is, not all the early error failing tests have been resolved. Here is an example of some of the passing/failing tests from ecma_3/RegExp/regress-57631.js :

 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(?)'and flag 'i' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(?)'and flag 'g' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(?)'and flag 'm' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(?)'and flag 'undefined' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(a'and flag 'i' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(a'and flag 'g' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(a'and flag 'm' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '(a'and flag 'undefined' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '( ]'and flag 'i' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '( ]'and flag 'g' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '( ]'and flag 'm' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 FAILED! [reported from test()] Testing for error creating illegal RegExp object on pattern '( ]'and flag 'undefined' : Expected value 'SyntaxError', Actual value 'not a SyntaxError' 
 PASSED! Testing for error creating illegal RegExp object on pattern '(?)'and flag 'a'
 PASSED! Testing for error creating illegal RegExp object on pattern '(?)'and flag '123'
 PASSED! Testing for error creating illegal RegExp object on pattern '(?)'and flag '/(?:)/'
 PASSED! Testing for error creating illegal RegExp object on pattern '(a'and flag 'a'
 PASSED! Testing for error creating illegal RegExp object on pattern '(a'and flag '123'
 PASSED! Testing for error creating illegal RegExp object on pattern '(a'and flag '/(?:)/'
 PASSED! Testing for error creating illegal RegExp object on pattern '( ]'and flag 'a'
 PASSED! Testing for error creating illegal RegExp object on pattern '( ]'and flag '123'
 PASSED! Testing for error creating illegal RegExp object on pattern '( ]'and flag '/(?:)/'

Anybody have any ideas why some invalid regexp early errors are detected, but not others?
My hunch about the remaining failing RegExp tests is that JSC::Yarr::checkSyntax() does not correctly detect some invalid RegExps (such as the ones in bug576837-regexp) because they are valid to parse, but are invalid for other reasons, i.e. that builtin character classes within ranges produce syntax errors.
Attachment #555661 - Attachment is obsolete: true
Attached image Kind of crappy plot.
Just wanted to get some preliminary numbers on how many destructed |RegExpPrivate|s were never executed.

Highest number-of-times-executed counts:

[(0, 2468),
 (1, 753),
 (13, 45),
 (2, 42),
 (35, 40),
 (3, 24),
 (4, 24),
 (9, 23),
 (5, 15),
 (85, 14)]

So laziness seems very useful, and I'll grab memory overhead size for those unused guys.
Assignee: jonathan → cdleary
Doing some pretty lame browsing around GMail, GReader, and Twitter I got the following usage numbers for zero-times-executed |RegExpPrivate|s:

Total:   739108 bytes
Max:       6904
Mode:       363
Mean:       525.68
Stddev:   27533.82

So we generate under a meg of stuff for no real reason.
And we throw these regexps away when we destroy the compartment?

If so, I don't think this is MemShrink:P1, unless we can find a site which generates 10 times as much memory from regexps...
(In reply to Justin Lebar [:jlebar] from comment #27)

It's also a speed concern for compilation overhead of huge regexps (note the measurements in bug 672064).
(In reply to Justin Lebar [:jlebar] from comment #27)

Oh, and also making the |RegExpObject| lazy allows us to purge them all on GC (i.e. from idle compartments), which is probably the more MemShrink:P1 issue.
Duke, just wanted to reiterate that I'm sorry to have to steal this one away! (Per comment 14.)
Whiteboard: [MemShrink:P1][mentor=cdleary] → [MemShrink:P1]
I'm definitely not saying you shouldn't fix this bug.  :)  It just seems that the expected memory implications aren't particularly large.
(In reply to Justin Lebar [:jlebar] from comment #31)

Oh yeah, I was just pointing out that the prolonged usage results are different from my one-off browsing session results. bz posted a troubling piece of long-term about:memory data WRT RegExp code that got me jumpstarted back on finishing this stuff.
> And we throw these regexps away when we destroy the compartment?

In my case I had several hundred megs of regexp code in the system compartment.  So no destroying of the compartment going on.
Attached patch WIP: lazify compilation. (obsolete) — Splinter Review
Depends on patches in bug 691695 and bug 692069. This looks like it may be passing all my shell tests. It lazifies the RegExpObject creating a RegExpPrivate. (The other option is to lazify the compilation but eagerly create the RegExpPrivate, but this way makes it easier to install a RegExpPrivate cache and probably reduces refcount bumping.)
Unless I'm doing something wrong, it looks like the twitter-text.js from bug 672064 comment 2 is an effective microbenchmark: the execution time is reduced from 1600ms to 10ms.

Looks green on try. (Should also be easy to add a RegExpPrivate cache in front of the lazy creation. Huzzah for well factored code.)
Attachment #565477 - Attachment is obsolete: true
Attachment #565677 - Flags: review?(mrbkap)
Comment on attachment 565677 [details] [diff] [review]
Lazify RegExpPrivate creation.

This looks really good. I complained to cleary on IRC about parenthesizing the conditionals on ?: expressions and he said he'd fix this.
Attachment #565677 - Flags: review?(mrbkap) → review+
were Dromaeo regressions expected? tree-management is showing CSS and jslib regressions on linux and linux64
https://hg.mozilla.org/mozilla-central/rev/61dd23c012ee
https://hg.mozilla.org/mozilla-central/rev/d50bd6d5b097
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Marco Bonardo [:mak] from comment #38)
> were Dromaeo regressions expected? tree-management is showing CSS and jslib
> regressions on linux and linux64

Kind of. Bug 694527 *should* quickly restore normalcy, which is why I don't think it's worth a back out -- what page / mailing list should I be watching to verify that we bounce back to our previous value when that lands today?
In graphs-new you have to select Custom Chart to see inbound graphs.
(In reply to Chris Leary [:cdleary] from comment #40)
> (In reply to Marco Bonardo [:mak] from comment #38)
> > were Dromaeo regressions expected? tree-management is showing CSS and jslib
> > regressions on linux and linux64
> 
> Kind of. Bug 694527 *should* quickly restore normalcy, which is why I don't
> think it's worth a back out -- what page / mailing list should I be watching
> to verify that we bounce back to our previous value when that lands today?

Can I please suggest that we should back this out now and reland it with bug 694527 at the same time?  That's the only way to be sure that nothing is regressed, by not seeing any regression emails on dev.tree-management...
(In reply to Ehsan Akhgari [:ehsan] from comment #43)
> (In reply to Chris Leary [:cdleary] from comment #40)
> > (In reply to Marco Bonardo [:mak] from comment #38)
> > > were Dromaeo regressions expected? tree-management is showing CSS and jslib
> > > regressions on linux and linux64
> > 
> > Kind of. Bug 694527 *should* quickly restore normalcy, which is why I don't
> > think it's worth a back out -- what page / mailing list should I be watching
> > to verify that we bounce back to our previous value when that lands today?
> 
> Can I please suggest that we should back this out now and reland it with bug
> 694527 at the same time?  That's the only way to be sure that nothing is
> regressed, by not seeing any regression emails on dev.tree-management...

Except that I just saw that bug 694527 landed on inbound.  :((
Depends on: 694762
Depends on: 694752
Believed to have caused bug 694752.

I have a backout based on m-c tip going to try (see https://bugzilla.mozilla.org/show_bug.cgi?id=694752#c8), and if that helps, I'll be backing this out.

This would also resolve the problems in comment 43, since everything could be re-landed at once, making the dev.tree-management results clearer.
Nick mentioned in his blog post that the ultimate state of this bug was unclear. The regressions were resolved in a followup fix -- there was confusion because I didn't land that on mozilla-central post-merge. Everything is cool, and we should get some nice thread-data-level consolidation wins out of bug 634654.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: