Closed Bug 492673 Opened 15 years ago Closed 15 years ago

TM: Fix memory pressure measurement

Categories

(Core :: JavaScript Engine, defect, P2)

x86
Windows Vista
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: Natch, Assigned: graydon)

References

()

Details

(Keywords: fixed1.9.1, regression, Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files)

Gonna try to nail down a regression range for this, but in the past I would get numbers in the mid-800's, now it's down to about 150.
Bug is on 1.9.1 as well.

Regression range:
20090414: works
20090415: regressed
pushlog: http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=49643dfd3008&tochange=78a21b8efe1b

I don't have a tracemonkey tree for building, so I can't easily bisect, but I hope this is a start. Make sure you purge the browser cache before attempting to bisect as it seems to play with the results.
Flags: blocking1.9.1?
Keywords: regression
Assignee: general → gal
This is easy to produce in the shell. We get 1000+ score if run separately, but if run as part of the other tests we get hammered (280).

h-243:v8 gal$ ../Darwin_OPT.OBJ/js -j run-crypto.js 
Crypto: 1054
----
Score: 1054
h-243:v8 gal$ ../Darwin_OPT.OBJ/js -j run.js 
Richards: 1758
DeltaBlue: 169
Crypto: 284
RayTrace: 244
EarleyBoyer: 390
----
Score: 381
I think we should block on this until its know what causes this, mostly to make sure we get someone to look at this in time to identify the severity. sayrer?
(I wouldn't strictly block on this if we are just slow for some low-risk, soon to be fixed-on-trunk reason, but I would like to be sure)
Flags: blocking1.9.1? → blocking1.9.1+
Ok, this is a GC problem. Crypto hits the GC while running, causing performance to suck. This only happens when other stuff runs first, not when run solo.
(the attached patch makes the slowdown disappear by causing a gc between benchmarks)
I am just going to have to make a snippy comment about this benchmark program. People who manage to implement a crypto algorithm (integer data read from an array manipulated using arithmetic operations and stored into an array as integer data) in a way that causes excessive GC pressure shouldn't be allowed near computers.

I will try to find out where most of the garbage is coming from, but I suspect the OO-ish BigNumber implementation.
Flags: blocking1.9.1+ → blocking1.9.1?
still a blocker. why should it have gotten slower?
Flags: blocking1.9.1? → blocking1.9.1+
It didn't get slower. Its just as far as before. We are gcing during the benchmark, which kills the score. Previously the sun and the stars aligned slightly differently, causing the GC somewhere else.
Andreas: the same benchmark page was being used to test the speed, if there was a change in gc that should (IMHO) be investigated:

1.9.1 (2009-04-14 build): mid-500s
trunk (2009-04-14 build): mid-800s

current 1.9.1: 70-90
current trunk: mid-100s
Yes, this most definitively needs to be investigated. I am trying to hack together some annotations to better observe what happens. I think this can be fixed for 1.9.1.
(If someone wants to bisect down to the exact patch that caused this behavior first, that would be great).
Whiteboard: regression-window-wanted
hg bisect says:

The first bad revision is:
changeset:   27205:78a21b8efe1b
user:        Brendan Eich <brendan@mozilla.org>
date:        Wed Apr 15 01:57:13 2009 -0700
summary:     Bug 488015 - Crash [@ js_GetUpvar ] (also bogus JS errors, also probably Crash [@js_Interpret]) (future r=mrbkap, see bug).
Blocks: 488015
Whiteboard: regression-window-wanted
I bisected the shell slowdown. It flags the following patch:

The first bad revision is:
changeset:   28170:05ea1afb04f8
user:        Nicholas Nethercote <nnethercote@mozilla.com>
date:        Sat May 09 14:38:34 2009 -0400
summary:     Bug 488775 - TM: widen LIR instructions. r=graydon,edwsmith
Yeah, something is definitively wrong with the LIR patch:

h-243:v8 gal$ hg up -r 28169
9 files updated, 0 files merged, 0 files removed, 0 files unresolved
h-243:v8 gal$ TRACEMONKEY=stats ../Darwin_DBG.OBJ/js -j slow.js
DeltaBlue: 14
Crypto: 277
----
Score: 62
recorder: started(186), aborted(102), completed(321), different header(3), trees trashed(2), slot promoted(0), unstable loop variable(132), breaks(0), returns(2), unstableInnerCalls(30)
monitor: triggered(21250), exits(21250), type mismatch(0), global mismatch(1)
h-243:v8 gal$ hg up -r 28170
9 files updated, 0 files merged, 0 files removed, 0 files unresolved
h-243:v8 gal$ TRACEMONKEY=stats ../Darwin_DBG.OBJ/js -j slow.js
DeltaBlue: 14
Crypto: 177
----
Score: 50
recorder: started(678), aborted(91), completed(289), different header(2), trees trashed(2), slot promoted(0), unstable loop variable(128), breaks(0), returns(2), unstableInnerCalls(27)
monitor: triggered(21315), exits(21315), type mismatch(0), global mismatch(1)
I can replicate this, although the numbers are quite variable.  But post the widen-LIR patch has worst crypto performance when run after DeltaBlue.

The most interesting thing I've found so far is this:  I've been working on a patch to narrow the LIR again, by compacting it more (but not as much as it was prior to the widening).  I just tried the pre/post measurements with it, and crypto's speed rating jumps from around 200 to around 800 or 900!  (That's a debug build.)  This would seem to corroborate the theory that the increased LIR size is somehow causing the problem.
The increased LIR size is definitively the trigger but we have to figure out what exactly happens. A 4x drop is scary.
Does the change bump it over a cache size?
What cache size?
I don't know.  What CPU is being used?  At least some Core 2's have 32 KB of L1 data cache and 32 KB of L1 instruction cache.
LIR is write once read once, both in quick succession. I think icache and dcache are irrelevant here, despite the possible increase in L2 write misses (which is probably simply due to higher write throughput).
Increasing the code cache size doesn't solve the problem. This bug isn't caused by the larger LIR patch. That merely triggers something else.

This is 28170 with 32MB code cache.

DeltaBlue: 14
Crypto: 158
----
Score: 47
recorder: started(677), aborted(90), completed(286), different header(3), trees trashed(2), slot promoted(0), unstable loop variable(128), breaks(0), returns(2), unstableInnerCalls(25)
monitor: triggered(21133), exits(21133), type mismatch(0), global mismatch(1)

This is 28170 with 64MB code cache.

DeltaBlue: 14
Crypto: 123
----
Score: 42
recorder: started(672), aborted(91), completed(270), different header(2), trees trashed(2), slot promoted(0), unstable loop variable(128), breaks(0), returns(2), unstableInnerCalls(24)
monitor: triggered(20101), exits(20101), type mismatch(0), global mismatch(1)

This is 28170 with 128MB code cache.

h-243:v8 gal$ TRACEMONKEY=stats ../Darwin_DBG.OBJ/js -j slow.js
DeltaBlue: 14
Crypto: 122
----
Score: 41
recorder: started(673), aborted(91), completed(270), different header(2), trees trashed(2), slot promoted(0), unstable loop variable(128), breaks(0), returns(2), unstableInnerCalls(24)
monitor: triggered(20101), exits(20101), type mismatch(0), global mismatch(1)
This is the slow revision with 16MB code cache (default):

h-243:v8 gal$ TRACEMONKEY=verbose ../Darwin_DBG.OBJ/js -j slow.js | grep "lush"
No global slotlist for global shape 1151, flushing cache.
Flushing cache.

We only flush once early on and never again.
The slow version allocates another 16MB slab for the code cache (one large mmap, 16MB). It probably costs some, but doesn't explain the stats difference.
The trace output just visually looks bad:

ecording starting from slow.js:729@26
recording completed at slow.js:729@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:682@26: No compatible inner tree.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
recording starting from slow.js:790@36
Abort recording of tree slow.js:790@36 at slow.js:684@43: Inner tree is trying to grow, abort outer recording.
recording starting from slow.js:682@26
recording completed at slow.js:682@26 via closeLoop
Andreas: this slow down is visible on 1.9.1 where the LIR patch has *not* landed yet, it cannot be the cause AFAICT. Was my bisection in comment 14 wrong? I don't see how, but I'll retest tonight...
I can't reproduce the slowdown with your revision.

We have pretty much proven at this point that the patch I bisected triggers a bug, but doesn't cause it. I am working on the assumption that both your and my rev merely trigger and don't cause the same issue.
On my machine I get lots of variation in the timings.  I wouldn't trust a single run of each version to give reliable numbers.
(In reply to comment #28)

Err, I fail :( . I was sure the regression-range was right, turns out the original regression range I posted was totally off.

/me tries to corroborate the regression-range again...
Some updated results:

- Running run.js (ie. Richards, then DeltaBlue, then Crypto) I see the slowdown
  caused by widen-LIR:  Crypto goes from 600--700 to about 250

- Running just DeltaBlue then Crypto, I don't see a slowdown after widen-LIR:
  in both cases it's 250-ish

- Inserting the gc() between benchmarks causes the pre/post-widen-LIR differences
  to go away in both the above cases.

Conclusion?  GC weirdness that just happends to manifest most strongly with run.js with the widen-LIR patch.  But if you bisected on a different benchmark (eg. just DeltaBlue then Crypto) I suspect you'd get a different patch to blame.
So the problem is that I can reproduce that regression range 100% with tracemonkey nightlies, however when I build that revision myself I can't reproduce. The reason why bisect gave me that revision is because it was the last one left, I hadn't tested it until now, and it does not regress it for me. However, local builds of trunk and 3.5 are 100% reproducible for me, as are nightlies from tracemonkey. I'm gonna build tracemonkey tip and see how it goes for me...
No longer blocks: 488015
To clarify comment 31, it seems that the order in which benchmarks are run makes a huge difference, for whatever reason.  Which makes benchmarking very dubious, because the numbers we get within a run of v8, for example, are quite different to those when running a subset of v8, or v8 in a different order... it's quite worrying, really.
The shell harness doesn't GC and we might be seeing some oracle poisoning (which shouldn't happen since the oracle is supposed to be precise these days).
(In reply to comment #34)
> The shell harness doesn't GC and we might be seeing some oracle poisoning
> (which shouldn't happen since the oracle is supposed to be precise these days).

The oracle is *not* precise. You keep saying this, I don't know where you got that impression. I also don't have a completed patch to make it such long term. Only one small debugging patch that makes a precise oracle using an STL map -- entirely inappropriate for shipping. We'll need something more carefully done for the long term. 

I can reproduce the GC-sensitivity of crypto in general now, definitely. On practically any rev. Something in deltablue kills it. There are a few loops in deltablue that wind up somewhat odd. For example, the one anchored at line 741 winds up having 33 side exits. It's a 3 line loop with 1 branch (I guess 2 given the short-circuit eval in it) that only calls 4 functions, each of which is a one-liner that delegates to a relatively plain JS object method (length, push, etc.) 33 seems a bit high. Otherwise I don't really get why deltablue is thrashing so badly. Will keep digging.
Not the oracle (checked with precise-oracle debugging patch from 487337, also counted oracle entries and there are only a few).

On to blacklisting!
Not related to blacklisting. Turning off compilation-backoff blacklisting makes the total blacklisting count drop to zero, but perf difference is still there.
Scratch that; it *may* be slightly related to blacklisting. Or rather, to the possibility of crypto needing to do some blacklisting, but not being able to. Or doing too much when it should only do a bit. I've spent several hours fiddling with the blacklist knob and it appears to be sensitive to it but not in an obvious fashion.

Also curious fact discovered: it's quite sensitive to the setting of HOTLOOP, as is richards in the inverse direction (richards gets faster as HOTLOOP is decreased, crypto gets faster as it's increased).

This is silly.
Priority: -- → P2
Graydon is working on this one right now.
Assignee: gal → graydon
Attached patch Bozotic bug fixSplinter Review
I cannot believe this survived the way it is as long as it did. We *don't measure the right thing* for memory pressure! Deltablue happens to get close to the ceiling; with LIR-widen it actually uses up enough space that crypto tips over the line. Thereafter *every* recording attempt in crypto fails and attempts to flush the cache. That's why it "slows down". Guh.

Another day, another horrible bug I introduced. Sorry.
Attachment #377550 - Flags: review?(gal)
Comment on attachment 377550 [details] [diff] [review]
Bozotic bug fix

Based on grep and reading, and my well-known ability to do a dead-on Andreas Impression ;-), I can r+ this.

nanojit/Fragmento.h:			uint32_t cacheUsed() const { return (_stats.pages-_freePages.size())<<NJ_LOG2_PAGE_SIZE; }

/be
Attachment #377550 - Flags: review+
Wanted to facilitate getting this landed on tm tip...

/be
Summary: TM: Recent regression in v8's crypto module → TM: Fix memory pressure measurement
I'm glad my widen-LIR patch has proven useful ;)  Nice hunting, Graydon.
http://hg.mozilla.org/tracemonkey/rev/a75d552b0c64
Whiteboard: fixed-in-tracemonkey
Woohoo:

Score: 367
Richards: 1506
DeltaBlue: 116
Crypto: 850
RayTrace: 231
EarleyBoyer: 302
RegExp: 236

In the browser version, every time now! Thanks for the fix!!
http://hg.mozilla.org/mozilla-central/rev/a75d552b0c64
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Attachment #377550 - Flags: review+
Attachment #377550 - Flags: review?(gal)
Blocks: 492694
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: