Closed Bug 573504 Opened 14 years ago Closed 14 years ago

Regular expression hoisting or caching is desirable

Categories

(Tamarin Graveyard :: Library, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: lhansen, Assigned: lhansen)

Details

(Whiteboard: PACMAN, has-patch)

Attachments

(6 files, 7 obsolete files)

Two simple benchmarks that perform a simple loop of the form

  for ( ... )
    re.exec(s)

where re is either a literal regular expression or a variable holding a regular expression constructed outside the loop, show that the latter case is 2.9 times the speed of the former (score 110 vs score 38).

It is nearly trivial to cache the internal compiled form of the regular expression in the AvmCore, so we should try that first (typically a single-element cache will do a lot).  The alternative is to try to sniff the regex construction and then hoist it into the method prologue.  However, caching can work well across invocations, and hoisting will not
Attached file Benchmark 1: literal regex inside loop (obsolete) —
Attached patch Tentative patch (obsolete) — Splinter Review
We have a single-element cache in AvmCore, it's flushed during GC to avoid holding onto long patterns or large compiled regexes.  The cache is used to bypass both regex compilation and redundant re-scanning of the regex string in pre-compilation (to build up the regex flags).  The latter bit avoids further allocation, as the zero-terminated strings allocate temporary storage.

For the benchmark in this test the cache raises the score from 38 to 60 (MacPro), which is not bad (58% speedup) but that's still only half the speed of manually hoisting the regular expression out of the loop.  The remaining time must be spent in allocating and constructing the RegExp object itself; I'm a little puzzled at the cost and will need to investigate further.

Not asking for a review yet, as there are some rough spots here and even a FIXME, but comments are welcome all the same.
Attached file Benchmark 1: literal regex inside loop (obsolete) —
Attachment #452751 - Attachment is obsolete: true
There was a bug in the old benchmark: The combination of /g and RegExp.prototype.exec is not good if the regular expression is hoisted, since exec both reads and writes the regex's lastIndex field.  In the case of the hoisted benchmark the matching would start at the end of the string as of the second iteration, and always fail immediately, and that gives an unfair advantage to the hoisted version.  I should have seen that sooner - it's why the benchmark prints out the last result - but did not.

With this new program, we have new scores:

 hoisted   79
 cached    62

That is, hoisted/cached = 1.27.  That is probably defensible, since the 'cached' version performs more work per iteration:

 - look up RegExp
 - call the RegExp constructor
 - allocate and initialize a RegExp object
 - destruct the RegExp object

(The new benchmarks run for 2 seconds but still report iterations/sec.)
Attachment #452752 - Attachment is obsolete: true
Attached patch Patch (obsolete) — Splinter Review
Cleaned-up code.

Here we have a single-element global (AvmCore-relative) cache only, which solves the hot-regex-in-a-loop use case (which is probably somewhat important), but which will work less well on more complex problems.

Alternative structures include:

 - multi-element global caches
 - local caches, eg, relative to a MethodInfo
 - JIT optimization, eg, precompiling regex strings

Precompilation is actually a little dicey because compilation can cause errors which must not be visible unless control flow reaches the point of where the regex would have been compiled.

Local caches are also a little tricky because we want to be sure not to retain storage indefinitely, the global cache is flushed on every GC to avoid this.

In contrast, a multi-element global cache would be relatively straightforward.
Attachment #453050 - Flags: feedback?(stejohns)
Attachment #452820 - Attachment is obsolete: true
Comment on attachment 453050 [details] [diff] [review]
Patch

-- Nit: RegexCache should restructure its members to avoid compiler-inserted padding. (bool-int-bool-ptr should be ptr-int-bool-bool). Ditto RegExpObject (bool-int-int-bool-ptr -> ptr-int-int-bool-bool).

-- Uber-nit: "an trailing flags string" -> "a trailing flags string"
Attachment #453050 - Flags: feedback?(stejohns) → feedback+
Suggested reorg of existing class (only).
Attachment #453355 - Flags: review?(stejohns)
Comment on attachment 453355 [details] [diff] [review]
Reorganize the fields of RegExpObject

That's a silly patch.
Attachment #453355 - Attachment is obsolete: true
Attachment #453355 - Flags: review?(stejohns)
Attached patch PatchSplinter Review
I've addressed the feedback.

In addition the regex cache has been split out of AvmCore as a separate data structure, and now encapsulates a multi-entry cache.  This seems pretty clean.  The cache size is set to 4, on the principle that 1 is too brittle in the face of reasonable programs but 4 is likely better, and anything larger would seem unreasonable for this approach.  Cache replacement is LRU using a simple time stamping mechanism.

I also corrected a bug in the caching code; in the case where the AS3 code is "new RegExp(re)" where re is a regular expression we should just copy that regular expression.  (There was an obscure correctness problem also but anyway that's gone.)

Asking for a review now, though I still need to verify that benchmarks with up to four hot regular expressions see a speedup as expected and that once we get five hot expressions we're back to the old behavior, more or less.  (A significant slowdown would be bad even if the case is atypical.)
Attachment #453050 - Attachment is obsolete: true
Attachment #453377 - Flags: review?(stejohns)
Whiteboard: PACMAN → PACMAN, has-patch
(In reply to comment #9)
> (From update of attachment 453355 [details] [diff] [review])
> That's a silly patch.

From the Ministry of Silly Patches?
Comment on attachment 453377 [details] [diff] [review]
Patch

Agreed that we should benchmark before landing.

-- nit: RegexCache and RegexCacheEntry should probably have comments pointing out that they expect to live in a GCRoot, due to DRC (vs DRCWB) usage.

-- nit: named constant for RegexCache size, rather than "4"....

-- style nit: most existing code in tamarin tends to do

	for (init; test; incr)

rather than 

	for ( init; test; incr )

your call, though.
Attachment #453377 - Flags: review?(stejohns) → review+
(In reply to comment #12)
> -- nit: named constant for RegexCache size, rather than "4"....

I can fix that, but I hope you observed that that's the only occurence of that '4'.  Everywhere else uses ARRAY_SIZE(m_entries).

Agreed on the other issues.
I've honed the code some more to disable the cache if it's thrashing, new patch will follow, as will benchmark code.  Below are results for four benchmarks:

RE1 - one regex literal inside the loop
RE2 - one regex literal outside the loop
RE3 - five regex literals inside the loop
RE4 - four regex literals inside the loop

The cache size is four, so RE3 should thrash and RE4 should work very well indeed.

NOTE benchmarks compiled without -AS3, for now, so calls to exec() are prototype calls, which will tend to hide effects of the regex cache.

NOTE scores for RE3 and RE4 use a different scale than those for RE1 and RE2, in order to get precision.

Runs are release-debugger -Dnodebugger, Mac Pro

Old code:

RE1 40, 40
RE2 69, 69
RE3 128, 127, 128
RE4 156, 156, 156

New code:

RE1 58, 59, 59
RE2 72, 72, 71
RE3 122, 122, 121
RE4 265, 266, 265

The numbers show:

RE1 48% speedup - as expected, the regex always hits the cache
RE2  4% speedup - strange, should not really see any speedup here, as regex
                  construction is not hot at all
RE3  5% slowdown - as expected, no benefit from the cache, and some overhead
                   to discover that
RE4 70% speedup - all regexes fit in the cache, excellent utilization

The main issue is the 5% slowdown when the cache is disabled.  The code to track utilization and disable the cache if it's thrashing came about as a result of observing a 9% slowdown due to the thrashing (117, 118, 117 on RE3), so 5% is an improvement.

It seems that disabling the cache more aggressively improves RE3, but it's unclear why doing so would be the right thing - real programs probably benefit from the cache being on, and suffer little from it being on uselessly.
Again release-debugger -Dnodebugger, but test programs compiled with -AS3

Old code:

RE1 43 43 44
RE2 75 75 75
RE3 145 145 145
RE4 174 175 174

New code:

RE1 65 66 65
RE2 76 76 76
RE3 135 136 136
RE4 316 315 315

The numbers show:

RE1 53% speedup
RE2  1% speedup
RE3  6% slowdown
RE4 81% speedup

The trends and figures are pretty much the same.

The main issue for the patch is whether we believe there's temporal locality in how regular expressions are used (really in how they are compiled from literals), or whether programs that use regular expressions use a large number of them and won't benefit from a small cache.

The latter class of program can only be helped by a much larger - perhaps perfect - "cache", or indeed by caching locally in each method, which would amount to the same thing - remembering all compiled expressions, if only for a while.  A method-local cache is probably straightforward if the RegExp constructor is early bound by the JIT, but otherwise ... maybe the RegExp constructor can sniff its call stack and find its caller, and attach information to it.
Attachment #453040 - Attachment is obsolete: true
Attachment #453041 - Attachment is obsolete: true
... for cache size 'n', ie, this will cause thrashing in the cache.
... also for cache size 'n', ie this has 100% hit rate and 100% utilization and represents a best case.
Attached patch Alternate patchSplinter Review
This is like the first patch, with the nits tidied up, and with a cache-disabling mechanism: if utilization falls below some percentage (here set to 10%) after a while (here set to 1000 compilations) then the cache is disabled.  It is re-enabled by the garbage collector when the GC clears the regex cache in presweep.

I'd like opinions on (a) whether the cache-disabling complexity, which is pretty localized, pays for itself and (b) whether we should land this or the other patch, given that we're seeing a small slowdown when the cache is thrashing.
Attachment #453727 - Flags: review?(stejohns)
(In reply to comment #20)
> I'd like opinions on (a) whether the cache-disabling complexity, which is
> pretty localized, pays for itself and (b) whether we should land this or the
> other patch, given that we're seeing a small slowdown when the cache is
> thrashing.

Perhaps re-running the benches (RE3 in particular) would provide useful data...?
Comment on attachment 453727 [details] [diff] [review]
Alternate patch

Looks fine -- I'd be curious to see how the numbers compare from re-running the benchmarks (especially RE3). I'm also a little puzzled as to why the slowdown in the thrash case should be that large; it doesn't appear that the cache lookup overhead should be very high, relative to the cost of compiling the regexp in the first place. Is it possible the slowdown is not in the cache operation per se but in some of the rewritten code in RegExpObject?

On that note, a minor optimization there: instead of

	c == '(' && 
	pos+3 < length &&
	pattern->charAt(pos+1) == '?' &&
	pattern->charAt(pos+2) == 'P' &&
	pattern->charAt(pos+3) == '<')

you could just do

	pattern->indexOfLatin1("(?P<", 4, pos) == pos
Attachment #453727 - Flags: review?(stejohns) → review+
The numbers in comment #15 are for the code posted in comment #20 ("Alternate patch").
(In reply to comment #22)
> I'm also a little puzzled as to why the slowdown
> in the thrash case should be that large; it doesn't appear that the cache
> lookup overhead should be very high, relative to the cost of compiling the
> regexp in the first place. Is it possible the slowdown is not in the cache
> operation per se but in some of the rewritten code in RegExpObject?

I'm also puzzled.  I'd be surprised if the precompilation is the culprit; but there could be many contributory factors.  There's an RCObject that's allocated to serve as a proxy for the compiled regex, it is pure overhead if the cache thrashes; there are write barriers on the cache elements, for three RC objects; there's the cache search; and of course the somewhat slower precompilation.
Attachment #453727 - Flags: superreview?(edwsmith)
Comment on attachment 453727 [details] [diff] [review]
Alternate patch

Ed, maybe you can weigh in on whether we should land one of these (and if so which one), or if we should hold off until we're not regressing in the cache-thrashing case.
(In reply to comment #24)
> (In reply to comment #22)
> 
> ...there could be many contributory factors.

> There's an RCObject that's allocated to serve as a proxy for the
> compiled regex, it is pure overhead if the cache thrashes

Could possibly be fixed, at some increase in complexity, by using custom reference counting on the underlying pcre object

> there are write barriers on the cache elements, for three RC objects;

Faster write barriers will be here someday, but for now we'd remove the barrier for the new RC object by using custom reference counting.  It's possible using WB() would be a hair faster.

> there's the cache search; 

We could abandon the O(n) search and instead use a larger direct-mapped table indexed by string / options value; the problem there is that we can get a different kind of thrashing when two inputs map to the same slot.

> and of course the somewhat slower precompilation.

That can be optimized by providing more access to the string representation.

Even then it is likely that the cache-thrashing case will incur measurable overheads.
(In reply to comment #15)
>  but otherwise ... maybe the RegExp
> constructor can sniff its call stack and find its caller, and attach
> information to it.

This strikes me as crazy-talk in the short term, and in the medium/long term I'm hopeful that inlining might expose enough context to specialize the RE and enable inline caching to help.
Can user code extend RegExp and populate the cache with instance objects from their code origin (pool)? if so then the cache could prevent unloading by creating a reference chain from avmcore, to the cache, to the user PoolObject.  Clearing the cache on gc sweep should be good enough, right?  could it cause the user PoolObject to live one gc cycle longer than expected?

I find myself wondering if the regexp cache code really belongs in RegExp[Class|Object].h instead, but its a minor point.   I also wondered about whether a 1-entry inline cache would be more effective.  One would certianly think so for /re-literals/, right?  you talked about a per-method cache, but didn't mention inline caching so i'm not sure if you equate the two.

Its worth noting that with inline caching, individual instances using an inline cache could enable/disable themselves independently, and it would be not hard to switch between monomorphic/polymorphic/megamorphic cases.  (the last case would just disable the cache).

to answer the question, i think the disabling code does pay for itself, and we should go with that version of the patch, but man, i wish we had better data to go on.
(In reply to comment #28)
> Can user code extend RegExp and populate the cache with instance objects from
> their code origin (pool)?

Arguably, though I'm not sure how the pool figures into it.  Are you thinking that the RegExpSubtype instance references the class/traits for RegExpSubtype which in turn references the pool?

> if so then the cache could prevent unloading by
> creating a reference chain from avmcore, to the cache, to the user 
> poolObject. Clearing the cache on gc sweep should be good enough, right?
> could it cause the user PoolObject to live one gc cycle longer than
> expected?

It doesn't prevent unloading, but it may delay it.  If we have unloading hooks (callbacks called when an unloading event occurs) then perhaps they should flush the regex caches, in order to prevent the delay.

> I find myself wondering if the regexp cache code really belongs in
> RegExp[Class|Object].h instead, but its a minor point.   I also wondered about
> whether a 1-entry inline cache would be more effective.  One would certianly
> think so for /re-literals/, right?  you talked about a per-method cache, but
> didn't mention inline caching so i'm not sure if you equate the two.

You can only have an inline cache if you can early-bind to the RegExp constructor.  Arguably this is a bug (actually several bugs) in Tamarin and/or ASC, but a literal regex /abc/g is compiled roughly as

  findpropstrict RegExp
  pushstring abc
  pushstring g
  constructprop RegExp, 2

and the jit does not seem to early-bind to RegExp.  This is a bug not because we can't early-bind but because it means RegExp can be spoofed, which is wrong for literals - they should be inviolable.  (The other bug is that compiling regexes this way is against ES3, which demands that the literal is compiled once - although it's been argued that that's a bug in ES3.)

> to answer the question, i think the disabling code does pay for itself, 
> and we should go with that version of the patch, but man, i wish we had
> better data to go on.

Well, yes.

The reason I brought up the stack sniffing earlier is because it's one way of dealing with late binding.  There are others, possibly - we'd have to have an inline cache that handles a late-bound RegExp constructor and when that hits, handles a previously compiled regex.
there's two aspects to early binding constructors.  one is statically deriving the types (knowing the class is T$ and thus knowing the result is T), and then also early binding the allocation and constructor invocation.

The first aspect is disabled for builtin classes with constructors that don't follow the standard "newInstance(); invoke constructor" pattern (traits->hasCustomConstruct == true).  This in turn disables the specialized allocation and constructor invocation.

There's no good reason we couldn't specialize that path for a few important classes, like RegExp (or Array or anything else for that matter).  We end up early binding to the result type (public::RegExp) even though we don't generate an early bound constructor call.

  2:findpropstrict RegExp
                       [Object~[A]] {} (global~[O] global~[O])
  4:getproperty RegExp
                       [Object~[A]] {} (global~[O] RegExp$[O])
  6:pushstring "abc"
                       [Object~[A]] {} (global~[O] RegExp$[O] String~[S])
  8:pushstring "g"
                       [Object~[A]] {} (global~[O] RegExp$[O] String~[S] String~[S])
  10:construct 2
                       [Object~[A]] {} (global~[O] RegExp~[O])

Seems like possible low hanging fruit; the builtin classes are a known set, (Array, RegExp, Date, etc).  out of scope for this bug but probably something good for PACMAN.
After talking to Ed, we've decided as follows:

- I'll land the patch with the fallback, though we agree it pushes up against
the limit of allowable complexity.

- I annotate the patch with more prose re: the ad-hocness of the cache tuning parameters and their possible bogosity.

- Ed will work on some early binding to the RegExp constructor (as part of a broader agenda of binding to important built-in classes, like Array); once that lands we can reconsider whether we might not want to use an in-line cache where it's possible to do so.
Attachment #453727 - Flags: superreview?(edwsmith) → superreview+
(In reply to comment #31)
> - Ed will work on some early binding to the RegExp constructor (as part of a
> broader agenda of binding to important built-in classes, like Array)

Bug #575848.
tamarin-redux changeset:   4893:8993c851682d
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #28)
> Can user code extend RegExp 

Although RegExp is not marked final, in practice, subclasses of RegExp have never worked properly, so this is a nonissue. (In a version-enabled future we should probably explicitly make RegExp final).
Flags: flashplayer-bug+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: