593195 - TM: Blacklist based on iteration count/weight

Reporter

Description

•

15 years ago

* Background In order to get good perf on certain workloads, we need to avoid running loops on trace that run for very few iterations. The cost of entering and leaving trace outweighs the benefit. In particular, there are 5 loops that we currently blacklist in earley-boyer; if we don't blacklist them, we run 2x slower. 2 of these loops are over arguments. The other 3 walk Scheme linked lists. The current solution to this is to blacklist a tree if we enter that tree 300 times. This saves earley-boyer, but it hurts a lot on Dromaeo. In particular, Dromaeo runs 3bit inside a timing loop. The workload loop is fine to run on trace: we run it 5-10x faster that way, which is a big deal when scores are aggregated with geometric mean (bleh!!). But the workload loop is called 6000 times (when we don't blacklist), so it gets blacklisted, and our score goes way down. It appears that what really makes a loop not worth running on trace is if it runs too quickly. One idea that has been proposed in the past is to measure the amount of time spent on trace, and blacklist if that is too small. That proved somewhat hard to get working. It also adds nondeterminism, which is terrible for debugging. * Proposed Solution We can instrument each trace with an iteration counter. The counter is initialized to 0 before entering the trace. When the counter reaches a threshold value |high| (e.g., high=300), we strip out the instrumentation immediately. When we take a loop exit, we make a decision based on the counter value N and a threshold value |low|, (e.g., low=10): if N < low, blacklist the loop otherwise, remove the instrumentation All the usual variations can be tried, such as: - Instead of just using the count, use a weight value W = N * L, where L is the number of JSOPs in the trace, or the length of the native code in the trace. - Average multiple counter results. Apparently blacklisting after 300 entries does the trick on earley-boyer, so we could, for example, do a moving average and decide after several entries. * Implementation Details The other important bit is to make the instrumentation cheap when we are not using it. We can do it like this: 1. After we generate a trace T1, generate a second tiny trace Tc that just increments the counter and calls a strip-instrumentation function if counter > |high|. Note that so far, this trace is a loop. 2. Patch the jump at the end of T1 to Tc. 3. Patch the jump at the end of Tc to the loop header of T1. To strip the instrumentation, we simply patch the jump at the end of T1 to the loop header of T1 again. There seems to be one problem remaining: how to make sure Tc doesn't clobber the register holding 'state'.

Nicholas Nethercote [inactive]

Comment 1

•

15 years ago

(In reply to comment #0) > > * Implementation Details > > The other important bit is to make the instrumentation cheap when we are not > using it. We can do it like this: > > 1. After we generate a trace T1, generate a second tiny trace Tc that just > increments the counter and calls a strip-instrumentation function if counter > > |high|. Note that so far, this trace is a loop. > 2. Patch the jump at the end of T1 to Tc. > 3. Patch the jump at the end of Tc to the loop header of T1. > > To strip the instrumentation, we simply patch the jump at the end of T1 to the > loop header of T1 again. > > There seems to be one problem remaining: how to make sure Tc doesn't clobber > the register holding 'state'. That sounds complicated. For prototyping purposes I'd just put this in at the top of the loop: if (count < low) { count++; } On loops that run more than 'low' times you'll end up having an extra (failing) test for every iteration after the 'low'th one, but I suspect that won't be so bad. Every loop already has a callbackFlag check and that doesn't cost much (ie. I've tried removing it and the effect was negligible). Then if you find that this approach works well, you could try to jump through the hoops to avoid the instrumentation cost :)

David Mandelin [:dmandelin]

Reporter

Comment 2

•

15 years ago

(In reply to comment #1) > That sounds complicated. For prototyping purposes I'd just put this in at the > top of the loop: > > if (count < low) { > count++; > } Yay!! It looks like that version has no cost, at least in shell SunSpider. ++njn for sure on making this idea a lot easier to hack up.

David Mandelin [:dmandelin]

Reporter

Comment 3

•

15 years ago

Attached patch WIP (obsolete) — Details — Splinter Review

This does the basic strategy, and makes earley-boyer work OK, but it regresses SunSpider with -m -j by about 20 ms. I'm not sure why--it just makes most benchmarks 5% slower. I'm thinking it's probably too aggressive about blacklisting. I'll just have to play with different parameters and blacklisting patterns to see what happens.

Boris Zbarsky [:bzbarsky]

Comment 4

•

15 years ago

Could we end up running into issues with this setup if a short-running traceable loop is inside the body of a long-running traceable loop? Or will the blacklist keep us from entering the short-running loop from JM but allow it to work from interp/tracing?

Nicholas Nethercote [inactive]

Comment 5

•

15 years ago

You're increasing the counter up to 300, but there's no need once you hit 5 since that's the value you're checking for on exit. And those two values shouldn't be magic numbers :)

Nicholas Nethercote [inactive]

Comment 6

•

15 years ago

I don't think the counter load needs to be marked as volatile. I measured with Cachegrind, AFAICT the SS slowdown is entirely due to the instrumentation. The worst case is bitwise-and which executes 1.08x more instructions. And I only saw a 1% instruction count improvement for earley-boyer.

Nicholas Nethercote [inactive]

Comment 7

•

15 years ago

(In reply to comment #5) > And those two values shouldn't be magic numbers :) If you counted down from 5 to 0 you wouldn't need to have the numbers in two places -- the loop exit would just test for zero.

David Mandelin [:dmandelin]

Reporter

Comment 8

•

15 years ago

(In reply to comment #5) > You're increasing the counter up to 300, but there's no need once you hit 5 > since that's the value you're checking for on exit. And those two values > shouldn't be magic numbers :) Hey, that's why it's a WIP. But thanks for your help on this. I think you are right about the volatile part. I'm not sure about the slowdown: I get pretty much the same time score when I just add the instrumentation. It only seems to slow down when I use it to blacklist. Also, if I use a threshold of 2 (blacklist only if count < 2), I seem to get the same SS score. It's confusing, though, because there are at least 3 variables in play: instrumentation, old blacklisting algorithm, and new blacklisting algorithm. Anyway, I think the WIP is pretty close to "working", so I will clean it up and post for review once it is performing correctly. I think it's possible to do better with more experimentation, but this bug blocks landing to m-c.

Boris Zbarsky [:bzbarsky]

Comment 9

•

15 years ago

fwiw, with this wip applied and the title patch applied I get about 3900runs/s on 3bits-in-byte on t-m. I get 4600runs/s on m-c.

David Mandelin [:dmandelin]

Reporter

Comment 10

•

15 years ago

Attached patch Patch — Details — Splinter Review

OK, this patch fixes the worst problems on Dromaeo. With this patch, we regress only 2% relative to beta4. It doesn't hurt shell V8 at all. It regresses shell SS by 5 ms or so. I'm sure I can improve it a bit more, but it would be nice to get us landable to m-c. Not sure what the requirement is there.

Attachment #471732 - Attachment is obsolete: true

David Mandelin [:dmandelin]

Reporter

Comment 11

•

15 years ago

Comment on attachment 471973 [details] [diff] [review] Patch OK, I think this is about good enough to make us landable, but I will keep working to improve it. Suggestions welcome. I checked browser V8: this patch gave a 70-point improvement, but that may be just noise. At any rate, it is as good as tip now, which is an almost 2x improvement over beta 4.

Attachment #471973 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 12

•

15 years ago

Comment on attachment 471973 [details] [diff] [review] Patch >+ //printf("%s:%u: %d\n", f->treeFileName, f->treeLineNumber, JS_THREAD_DATA(cx)->iterationCounter); this, and >+ //printf("blacklist %s:%u %d\n", f->treeFileName, f->treeLineNumber, JS_THREAD_DATA(cx)->iterationCounter); this, looks like old debug code.

Attachment #471973 - Flags: review?(dvander) → review+

David Mandelin [:dmandelin]

Reporter

Comment 13

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/5d10c2119565 Marking this bug fixed--I'll file more bugs if I come up with new patches worth landing.

Whiteboard: fixed-in-tracemonkey

David Mandelin [:dmandelin]

Reporter

Updated

•

15 years ago

Depends on: 593497

Boris Zbarsky [:bzbarsky]

Comment 14

•

15 years ago

> With this patch, we regress only 2% relative to beta4. On Dromaeo sunspider total? Or on 3bits in particular?

David Mandelin [:dmandelin]

Reporter

Comment 15

•

15 years ago

(In reply to comment #14) > > With this patch, we regress only 2% relative to beta4. > > On Dromaeo sunspider total? Or on 3bits in particular? On Dromaeo total. Bug 593495 has more about why. fannkuch and nsieve-bits take a big hit due to the fall-into-interp problem. If we fixed that, we'd probably be improving things instead.

Boris Zbarsky [:bzbarsky]

Comment 16

•

15 years ago

Aha, thanks! With this and the title thing fixed, how do things look on the DOM dromaeo tests that had regressed?

Chris Leary [:cdleary] (not checking bugmail)

Comment 17

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/5d10c2119565

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

WIP 15 years ago David Mandelin [:dmandelin] 4.25 KB, patch		Details \| Diff \| Splinter Review
Patch 15 years ago David Mandelin [:dmandelin] 5.30 KB, patch	dvander : review+	Details \| Diff \| Splinter Review