<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

16 years ago

A couple of preliminary comments: js/src/v8/run-richards.js isn't so useful for repeatable comparative benchmarking, because it appears to run as many iterations as it can in one second. Am using SunSpider/tests/v8-richards.js instead, as it repeatably runs for 350 iterations. Initial results from a 32-bit optimized jsshell show that it (jsshell + generated code) run for 98.2 million basic blocks. Of these, 83.5% are in the functions js_UnboxDouble(int), js_UnboxInt32(int), js_BoxInt32 and js_BoxDouble. The number of basic blocks in JIT-generated code is only about 1% of the total, by comparison. Of course this doesn't say anything specific about instruction counts or where all the cycles went. But it gives some ballpark feel for what the main costs are. Continuing to investigate.

Comment 2

•

16 years ago

status update: jsshell running v8-richards runs for about 3.5 billion x86 insns, depending on build options. Of these: 246 050 915 insns in 22 368 256 calls to js_UnboxDouble 68 058 108 insns in 7 562 012 calls to js_UnboxInt32 70 388 157 insns in 3 351 817 calls to js_BoxInt32 94 199 942 insns in 2 140 908 calls to js_BoxDouble Further investigation shows almost all of the insns in these 4 calls are on the fast paths (lowest bit is set, so just >> 1 and return, etc). I implemented a partial inline of js_UnboxInt32, so that the fast path is done directly in LIR, calling out to the fn only if the lowest bit is not set. (using forward branches and labels in LIR). That works, but is slower. Profiling shows a substantial increase in icache misses and branch mispredictions. Am investigating why. This benchmark appears to have a surprisingly high icache miss rate anyway, about 1.2%. Considering that there are only 34 active traces, this is a bit strange. (can 34 traces fill up a 64k I1 cache?)

Comment 3

•

16 years ago

Attached patch patch which does partial unlining of js_UnboxInt32 — Details — Splinter Review

This is very obviously WIP, but shows it's relatively easy to do. It gives a small (1%) slowdown on v8-richards, so is not recommended. New strategy (now I realise I'm looking for roughly a factor of 2 performance loss) is to build chrome and try reduce the size of the test case whilst preserving the performance difference.

Comment 4

•

16 years ago

Some more low level figures. These compare 32-bit TM compiled "-O2" with 32-bit V8 compiled "-O3 -fomit-frame-pointer", so TM is at a small relative disadvantage, but not much. What's evident is: * V8 requires just over half the instructions that TM does * V8 manages to do the whole thing in fixed point, no FP, which is not the case with TM. TM does almost 29 million FP loads/stores, V8 does about 34 thousand. * Both V8 and TM comfortably keep the data in D1 (essentially zero D1 miss rate), but only V8 manages to keep the insns in I1 (268k I1 misses vs almost 42 million for TM). I'm sure this has nothing much to do with Nanojit -- it must be a higher level phenomenon to do with jstracer. -------------- #### v8 #### -------------- user times: 0.308 0.332 0.360 ==24316== I refs: 1,669,315,545 ==24316== I1 misses: 268,223 ==24316== L2i misses: 8,569 ==24316== I1 miss rate: 0.01% ==24316== L2i miss rate: 0.00% ==24316== ==24316== D refs: 713,605,023 (500,682,881 rd + 212,922,142 wr) ==24316== D1 misses: 148,545 ( 88,461 rd + 60,084 wr) ==24316== L2d misses: 40,063 ( 6,893 rd + 33,170 wr) ==24316== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==24316== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==24316== ==24316== L2 refs: 416,768 ( 356,684 rd + 60,084 wr) ==24316== L2 misses: 48,632 ( 15,462 rd + 33,170 wr) ==24316== L2 miss rate: 0.0% ( 0.0% + 0.0% ) ==24316== ==24316== Branches: 346,592,089 (343,912,720 cond + 2,679,369 ind) ==24316== Mispredicts: 9,462,357 ( 8,114,948 cond + 1,347,409 ind) ==24316== Mispred rate: 2.7% ( 2.3% + 50.2% ) ==24435== IR-level counts by type: ==24435== Type Loads Stores AluOps ==24435== ------------------------------------------- ==24435== I1 0 0 688,484,494 ==24435== I8 10,831,622 612,699 135,873,871 ==24435== I16 16,250 20,675 140,534 ==24435== I32 489,816,561 212,667,488 1,496,876,771 ==24435== I64 1,340 2,040 6,550 ==24435== I128 0 0 0 ==24435== F32 870 6 6 ==24435== F64 15,241 18,652 44,521 ==24435== V128 0 0 0 -------------- JSSHELL ORIGINAL -------------- user times: 0.864 0.872 0.888 ==21158== I refs: 3,584,825,170 ==21158== I1 misses: 41,953,492 ==21158== L2i misses: 5,672 ==21158== I1 miss rate: 1.17% ==21158== L2i miss rate: 0.00% ==21158== ==21158== D refs: 1,670,310,720 (1,122,641,178 rd + 547,669,542 wr) ==21158== D1 misses: 158,251 ( 132,178 rd + 26,073 wr) ==21158== L2d misses: 36,050 ( 17,362 rd + 18,688 wr) ==21158== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==21158== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==21158== ==21158== L2 refs: 42,111,743 ( 42,085,670 rd + 26,073 wr) ==21158== L2 misses: 41,722 ( 23,034 rd + 18,688 wr) ==21158== L2 miss rate: 0.0% ( 0.0% + 0.0% ) ==21158== ==21158== Branches: 542,647,807 ( 542,522,855 cond + 124,952 ind) ==21158== Mispredicts: 11,744,745 ( 11,713,792 cond + 30,953 ind) ==21158== Mispred rate: 2.1% ( 2.1% + 24.7% ) ==25013== IR-level counts by type: ==25013== Type Loads Stores AluOps ==25013== ------------------------------------------- ==25013== I1 0 0 1,234,299,003 ==25013== I8 1,630,040 591,149 131,858,369 ==25013== I16 6,766,107 4,364,883 4,385,942 ==25013== I32 1,081,810,453 491,730,952 2,599,907,232 ==25013== I64 30,245,994 25,237,135 4,287 ==25013== I128 0 0 0 ==25013== F32 14 0 0 ==25013== F64 2,176,798 26,686,976 94,042,506 ==25013== V128 0 0 2,140,908

Comment 5

•

16 years ago

(In reply to comment #1) > Initial results from a 32-bit optimized jsshell show that it > (jsshell + generated code) run for 98.2 million basic blocks. > Of these, 83.5% are in the functions js_UnboxDouble(int), > js_UnboxInt32(int), js_BoxInt32 and js_BoxDouble. The number > of basic blocks in JIT-generated code is only about 1% of the > total, by comparison. Ignore these numbers, they are bogus. AFAIK all other numbers in this report are legit, though.

Comment 6

•

Comment 22

•

16 years ago

Sigh, this was a forgotten action item from early tracemonkey daze: get CSE going for shape guards. I remember talking about it with Andreas but I failed to file a bug. Sorry about that. We should be able to use ldc instructions for things like loading obj->map and then obj->map (cast to JSScope*) -> shape, for a given obj. Jason, thoughts? Comment 21 suggests a better jsval tagging scheme (bug 360324), to separate null from object tag (we already distinguish the trace types). I'll update bug 360324. /be

Comment 23

•

16 years ago

I would suggest solving this at a slightly higher level, since shape guards are not merely redundant when they are common-able. Consider the following sequence of operations: x.a = 1 x.b = 2 x.c = 3 The shape evolves, and each time we emit a guard for a different shape. That can't be cse-d. However, the guards are still redundant because traces are a linear sequence of instructions. You can't get to x.b = 2 without going past x.a = 1, which already contains a guard. The shape evolves in a predictable (constant) fashion. My suggested solution is a set of objects that have been shape-guarded along this trace (or a bitset in parallel to the tracker). Its not as high level and pretty as solving this in lir, but it might be easier and more effective. Along branches it would be nice if this information can be communicated to attaching traces, but I guess that is not super important. It will lead to a couple more redundant guards at every branch. At least we don't get tons of redundant guards on the same trace though. Lets just hack it up and see how much things improve with the simple per-trace fix. I bet this also helps sunspider here and there.

Assignee

Comment 24

•

16 years ago

Comment 30

•

16 years ago

Attached patch remove obj-is-native tests/guards, use ldc*, and minimize shape guard to once per obj_ins per trace (obsolete) — Details — Splinter Review

Thanks, Andreas -- I should have known better, was in a hurry. Indeed still have not had time to measure this patch's effect on SS or other benchmarketing scores. This rev does fix the raytrace regression, and still wins big on richards (1744 new, 1445 old best of 3). Help wanted on perf-testing and analysis, as before. /be

Attachment #393245 - Attachment is obsolete: true

Comment 31

•

16 years ago

Did someone say perf analysis? Adding Gregor :)

Comment 32

•

•

16 years ago

Attached patch patch, v3 (obsolete) — Details — Splinter Review

Maybe I should spin this out into a separate bug. Happy to do so. I could also split it up, but I'm trying to go forward in a straight line (which carries risk, for sure). /be

Attachment #393317 - Attachment is obsolete: true

Attachment #393693 - Flags: review?(jorendorff)

Assignee

Comment 42

•

16 years ago

(In reply to comment #28) > This one is safe not because the classword is actually constant over the > lifetime of all objects--for arrays, it isn't--but because we never mutate > functions into nonfunctions or vice versa. I didn't comment on this (the context shows the array class exclusion), but I will do it now. /be

Comment 43

•

16 years ago

(In reply to comment #41) > Created an attachment (id=393693) [details] > patch, v3 Cool. Runs trace-tests ok. Gives a fairly reliable 8%-9% reduction in run time for a less than 6% reduction in insn count, possibly due to 23% reduction in I1 misses. Other figures (data refs, etc) are unsurprising. cpu(s) insns(M) I1miss(M) Before: 0.81 3581 30.52 After: 0.74 3381 23.61 Cpu times averaged over 10 runs. Further details to follow, incl overall sunspider measurements.

Assignee

Comment 44

•

16 years ago

Attached patch patch, v3a (obsolete) — Details — Splinter Review

Attachment #393693 - Attachment is obsolete: true

•

•

16 years ago

Attachment #394137 - Flags: review?(jorendorff) → review-

Assignee

Comment 65

•

16 years ago

(In reply to comment #55) > (In reply to comment #47) > > Ahem. > > Brendan, what's the difference between v5 and v4? Is it just > spaceleak avoidance fixes, or do I need to rerun all the perf > measurements? Belated reply: yes, just the leak fix. (In reply to comment #56) > ... > This urgently calls for block shrink-wrapping vars (var to let conversion), or > add liveness analysis that can tell that the undefined value of the var is > never read. But that basically is the proof necessary for shrink wrapping, so I > think we should go with that. Bug 456588, on my list and I should get to it soon. But mrbkap or jorendorff could steal and get it done with my advice. (In reply to comment #62) > The holes I think I see are: > > 1. It seems like this > > >@@ -7969,6 +8018,11 @@ TraceRecorder::guardPropertyCacheHit(LIn > > JSPropCacheEntry* entry, > > jsuword& pcval) > > { > >+ if (alreadyGuardedShape(obj_ins, aobj)) { > >+ pcval = entry->vword; > >+ return JSRS_CONTINUE; > >+ } > >+ > > should only skip the kshape guard, if there is one, not a > kobj guard or any other guards this method emits. Oops. Will fix. Jason, many thanks for reading this -- I've been preoccupied over the last week with various stuff that has meant I needed your brain (but in a good non-zombie way ;-). > 2. (In reply to comment #40) > > If this happens during recording, we should purge the guardedShapeTable entries > > for the reshaped object. If it happens on-trace then we should bail off trace. > > I see the purge in JSScope::generateOwnShape, but I don't see where we bail if > an unpredictable shape change happens on-trace. I changed JSScope::generateOwnShape to call js_LeaveTrace, but then convinced myself that the recording-time defense of purging guardedShapeTable meant we did not need the runtime pessimism. The assumption is that shape evolution other than by generateOwnShape() is deterministic from recording start to end, and then replays exactly once we've guarded (due to LIR's linear SSA nature) on trace -- and of course re-guarded after a purge. Is this sound? It ought to be, in my opinion. Anything adding non-determinism here is a bug to fix. (In reply to comment #63) > 3. It looks like JSScope::extend and JSScope::replacingShapeChange change an > object's shape without going through JSScope::generateOwnShape. But such changes are predictable. > (JSScope::remove could do it too, but doesn't bother.) Yes, that's long-standing code. I'm not sure we can rewind to the previous shape when lastProp is deleted and there are no middle deletes. Have to think about delete harder. Will work on a new patch based on all these comments now. /be

Assignee

Comment 66

•

16 years ago

To be more precise, if the recorder witnesses a generateOwnShape, it purges guardedShapeTable but does not abort recording. New shapes for the given scope evolve unpredictably from this point on, and the first dependency on such a shape will be guarded again. When the trace executes, the generateOwnShape need not leave trace, however. Perhaps there is no subsequent guard on a mispredicted shape. So it's premature to leave trace always. The re-guarded shape should mismatch. This all suggests that generateOwnShape is too blunt an instrument, and we should build up more deterministic and trace-stable shape evolutionary paths. But that is for bug 497789. /be

Jason Orendorff [:jorendorff]

Assignee

Comment 67

•

16 years ago

Attached patch patch, v6 (obsolete) — Details — Splinter Review

Passes the spanky new trace-test python3.1-driven tests. Hope it's ok to take this bug -- further work in followup bugs so we can have one patch landing per resolved bug still seems best. Julian, please keep up the great anaysis, here or in new bugs. /be

Assignee: jseward → brendan

Attachment #394137 - Attachment is obsolete: true

Attachment #394567 - Flags: review?(jorendorff)

Comment 68

•

16 years ago

Comment on attachment 394567 [details] [diff] [review] patch, v6 function B() {} B.prototype.x = 1; var d = new B; var names = ['z', 'z', 'z', 'z', 'z', 'z', 'z', 'x']; for (var i = 0; i < names.length; i++) { x = d.x; // guard on shapeOf(d) d[names[i]] = 2; // unpredicted shape change y = d.x; // guard here is elided } assertEq(y, 2); // Assertion failed: got 1, expected 2

Attachment #394567 - Flags: review?(jorendorff) → review-

Jason Orendorff [:jorendorff]

•

16 years ago

Attached patch patch, v7 (obsolete) — Details — Splinter Review

Attachment #394567 - Attachment is obsolete: true

Attachment #394617 - Flags: review?(jorendorff)

Assignee

Comment 73

•

16 years ago

Bugzilla interdiff lies. Here's the non-noise diff-of-patches output: 545c545,555 < @@ -10880,13 +10934,10 @@ TraceRecorder::prop(JSObject* obj, LIns* --- > @@ -9894,6 +9948,9 @@ TraceRecorder::enterDeepBailCall() > // Tell nanojit not to discard or defer stack writes before this call. > LIns* guardRec = createGuardRecord(exit); > lir->insGuard(LIR_xbarrier, guardRec, guardRec); > + > + // Forget about guarded shapes, since deep bailers can reshape the world. > + forgetGuardedShapes(); > } > > JS_REQUIRES_STACK void /be

Assignee

•

16 years ago

Attached patch patch, v8 (obsolete) — Details — Splinter Review

As requested (cx->bailExit, not cx->deepBail). Passes good old js/tests and the new trace-test regime. Tryserver'ing now, expect a clean bill of health. Great to see daylight at the end of this tunnel (and it's not an oncoming train! ;-). /be

Attachment #394617 - Attachment is obsolete: true

Attachment #394617 - Flags: review?(jorendorff)

Nicholas Nethercote [inactive]

Assignee

Comment 78

•

16 years ago

(In reply to comment #61) > (In reply to comment #60) > > Yes. I thought it odd that it produces different traces for > deltablue, not merely shorter versions of the same ones. Julian, if you have time could you please test the latest patch and confirm it doesn't do something unexpected on deltablue? Thanks, /be

Comment 79

•

16 years ago

(In reply to comment #77) > Passes good old js/tests and the new trace-test regime. Can someone explain what the new trace-test regime is? I seemed to have missed the memo...

Comment 80

•

16 years ago

You have to install python from source and then run the python script in trace-test.

Robert Sayre

Reporter

Comment 81

•

16 years ago

(In reply to comment #80) > You have to install python from source and then run the python script in > trace-test. there should be one that works with python 2.3+ at this point, covering any recent OSX or linux.

Jason Orendorff [:jorendorff]

Comment 82

•

16 years ago

I tested patch v7 (not the latest) overnight and it still doesn't seem conclusively a win to me. I'll try patch v8 today. One thing I've learn the hard way is that Linux is a really sucky platform to do measurement runs on, since the per-run variation is so high. Compared to the patch v4 results, the gain for richards has fallen from 8.5% to 6.9%, and the 4% gain for crypto is now less than 1%. This might be measurement noise or it might reflect increasing conservatism in v7 resulting from efforts to make it completely correct. I'll try with patch v8 on a quiet MacOS box. Am tired of dealing with Linux's measurement noise. Are we sure that v8-crypto is entirely deterministic? (eg, it doesn't start off doing any random number generation nonsense that varies from run to run?) I ask because it seems to have a much higher measurement noise level than the rest of them (see below). (50 run averages): ** TOTAL **: - 13182.4ms +/- 0.3% 13177.2ms +/- 0.3% v8: - 13182.4ms +/- 0.3% 13177.2ms +/- 0.3% crypto: - 948.8ms +/- 1.9% 941.8ms +/- 2.2% deltablue: ?? 8140.9ms +/- 0.4% 8153.4ms +/- 0.4% earley-boyer: *1.013x as slow* 1976.7ms +/- 0.4% 2002.4ms +/- 0.3% raytrace: *1.012x as slow* 1312.3ms +/- 0.3% 1327.8ms +/- 0.4% richards: 1.069x as fast 803.7ms +/- 0.1% 751.7ms +/- 0.1%

Comment 83

•

16 years ago

\\(In reply to comment #79) > (In reply to comment #77) > > Passes good old js/tests and the new trace-test regime. > > Can someone explain what the new trace-test regime is? I seemed to have missed > the memo... $ make -n check /usr/bin/python ../trace-test/trace-test.py \ -x slow ./dist/bin/js To add a new test, put some code in a file, js/src/trace-test/tests/*/*.js and hg add it. The new regime will automatically pick it up.

Assignee

Comment 84

•

16 years ago

$ grep -w random v8/crypto.js while(rng_pptr < rng_psize) { // extract some randomness from Math.random() t = Math.floor(65536 * Math.random()); // PKCS#1 (type 2, random) pad input string s to n bytes, and return a bigint while(n > 2) { // random non-zero pad // Undo PKCS#1 (type 2, random) padding and, if valid, return the plaintext // Generate a new random private key B bits long, using public expt E /be

Comment 85

•

16 years ago

(In reply to comment #77) > Created an attachment (id=394756) [details] > patch, v8 Numbers for v8-tests of patch v8, Mac Mini, Core 2 Duo, 2.13 GHz, 100 iterations. Looks like a clear win. Interestingly the patch helps more than just richards. Default-test-set numbers to follow. v8: 1.031x as fast 16845.0ms +/- 0.1% 16345.7ms +/- 0.1% crypto: - 1074.4ms +/- 0.8% 1066.8ms +/- 0.9% deltablue: 1.046x as fast 10225.3ms +/- 0.0% 9774.1ms +/- 0.1% earley-boyer: - 2742.4ms +/- 0.3% 2740.6ms +/- 0.3% raytrace: *1.006x as slow* 1785.6ms +/- 0.2% 1795.7ms +/- 0.2% richards: 1.050x as fast 1017.3ms +/- 0.0% 968.4ms +/- 0.1%

Comment 86

•

16 years ago

(In reply to comment #77) > Created an attachment (id=394756) [details] > patch, v8 Numbers for default test set for patch v8, same setup as #85, 500 iterations. Seems like much of a muchness. ** TOTAL **: 1.001x as fast 1214.3ms +/- 0.0% 1213.1ms +/- 0.0% 3d: 1.003x as fast 188.6ms +/- 0.0% 188.1ms +/- 0.0% cube: - 51.0ms +/- 0.1% 51.0ms +/- 0.1% morph: - 39.2ms +/- 0.1% 39.2ms +/- 0.1% raytrace: 1.005x as fast 98.3ms +/- 0.0% 97.9ms +/- 0.1% access: *1.005x as slow* 178.9ms +/- 0.1% 179.9ms +/- 0.0% binary-trees: *1.040x as slow* 50.5ms +/- 0.1% 52.5ms +/- 0.1% fannkuch: 1.006x as fast 76.2ms +/- 0.0% 75.8ms +/- 0.1% nbody: 1.024x as fast 34.6ms +/- 0.1% 33.8ms +/- 0.1% nsieve: *1.013x as slow* 17.6ms +/- 0.2% 17.8ms +/- 0.2% bitops: 1.010x as fast 47.8ms +/- 0.1% 47.4ms +/- 0.1% 3bit-bits-in-byte: - 2.1ms +/- 1.4% 2.1ms +/- 1.4% bits-in-byte: 1.002x as fast 11.0ms +/- 0.2% 11.0ms +/- 0.1% bitwise-and: - 3.2ms +/- 1.1% 3.2ms +/- 1.1% nsieve-bits: 1.013x as fast 31.4ms +/- 0.1% 31.0ms +/- 0.1% controlflow: *1.017x as slow* 44.8ms +/- 0.1% 45.5ms +/- 0.1% recursive: *1.017x as slow* 44.8ms +/- 0.1% 45.5ms +/- 0.1% crypto: *1.011x as slow* 72.7ms +/- 0.2% 73.6ms +/- 0.2% aes: ?? 41.0ms +/- 0.4% 41.1ms +/- 0.4% md5: *1.031x as slow* 20.3ms +/- 0.2% 20.9ms +/- 0.1% sha1: *1.010x as slow* 11.4ms +/- 0.4% 11.5ms +/- 0.4% date: 1.009x as fast 173.9ms +/- 0.0% 172.4ms +/- 0.0% format-tofte: 1.014x as fast 82.5ms +/- 0.1% 81.4ms +/- 0.1% format-xparb: 1.004x as fast 91.4ms +/- 0.0% 91.0ms +/- 0.0% math: *1.003x as slow* 43.3ms +/- 0.2% 43.4ms +/- 0.2% cordic: *1.011x as slow* 14.1ms +/- 0.2% 14.3ms +/- 0.3% partial-sums: - 19.9ms +/- 0.2% 19.8ms +/- 0.2% spectral-norm: ?? 9.3ms +/- 0.4% 9.4ms +/- 0.4% regexp: ?? 52.7ms +/- 0.1% 52.8ms +/- 0.1% dna: ?? 52.7ms +/- 0.1% 52.8ms +/- 0.1% string: 1.004x as fast 411.5ms +/- 0.0% 409.9ms +/- 0.0% base64: 1.013x as fast 24.0ms +/- 0.1% 23.7ms +/- 0.2% fasta: *1.004x as slow* 82.7ms +/- 0.1% 83.0ms +/- 0.0% tagcloud: *1.002x as slow* 130.2ms +/- 0.0% 130.4ms +/- 0.0% unpack-code: 1.013x as fast 132.9ms +/- 0.1% 131.2ms +/- 0.0% validate-input: - 41.6ms +/- 0.1% 41.6ms +/- 0.1%

Comment 87

•

16 years ago

(In reply to comment #85) > > patch, v8 > > Numbers for v8-tests of patch v8, Mac Mini, Core 2 Duo, 2.13 GHz, > 100 iterations. Looks like a clear win. Interestingly the patch > helps more than just richards. Default-test-set numbers to follow. > > v8: 1.031x as fast 16845.0ms +/- 0.1% 16345.7ms +/- 0.1% > crypto: - 1074.4ms +/- 0.8% 1066.8ms +/- 0.9% > deltablue: 1.046x as fast 10225.3ms +/- 0.0% 9774.1ms +/- 0.1% > earley-boyer: - 2742.4ms +/- 0.3% 2740.6ms +/- 0.3% > raytrace: *1.006x as slow* 1785.6ms +/- 0.2% 1795.7ms +/- 0.2% > richards: 1.050x as fast 1017.3ms +/- 0.0% 968.4ms +/- 0.1% For the raytrace regression there, instruction counts are down around 1.5% but the I1 and to a lesser extent L2 miss rates are up. I'll investigate, but requires first rebasing the frag profiling patch.

Assignee

Comment 88

•

16 years ago

Attached patch patch, v9 with dvander's lir->insAssert/LIR_dbreak patch (obsolete) — Details — Splinter Review

No LIR_dbreak-generated int3 traps in my testing. /be

Attachment #394756 - Attachment is obsolete: true

Attachment #394883 - Flags: review?(jorendorff)

Assignee

Comment 89

•

16 years ago

Comment on attachment 394883 [details] [diff] [review] patch, v9 with dvander's lir->insAssert/LIR_dbreak patch For the nanojit changes. /be

Attachment #394883 - Flags: review?(graydon)

Assignee

Updated

•

16 years ago

Status: NEW → ASSIGNED

Priority: -- → P1

Target Milestone: --- → mozilla1.9.2

Graydon Hoare :graydon

Updated

•

16 years ago

Attachment #394883 - Flags: review?(graydon) → review+

Graydon Hoare :graydon

Comment 90

•

16 years ago

Comment on attachment 394883 [details] [diff] [review] patch, v9 with dvander's lir->insAssert/LIR_dbreak patch Debug-break stuff? looks fine to me.

Assignee

Comment 91

•

16 years ago

Attached patch v9a, refreshed to tm tip (obsolete) — Details — Splinter Review

Attachment #394883 - Attachment is obsolete: true

Attachment #394944 - Flags: review?(jorendorff)

Attachment #394883 - Flags: review?(jorendorff)

Comment 92

•

16 years ago

(In reply to comment #90) > (From update of attachment 394883 [details] [diff] [review]) > Debug-break stuff? looks fine to me. Looks ok to me too. I tried changing the insAssert condition to something obviously false, and checked it really does trap. Which it does.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 93

•

16 years ago

Fragment profiling results for patch v8 (which, aiui, is functionally equivalent to v9 and v9a). Compared to baseline (which you can see in comment #9), the number of fragment starts increases from 11.836 million to 12.863 million. This is worrying, since my understanding was that the patch merely removed unused guards, and would not affect the higher level trace selection mechanism. Indeed, this is new behavior. In version 4 of patch (see comment #51), which was the last version I fragprofiled, the number of frag starts was unaffected by the patch (11.836 mill). The program still runs circa 5% faster because the traces are shorter overall, despite the extra 1 million trace-to-trace transfers. ------------- Details: I think almost all the extra activity manifests itself in extra entries to FragID=000013. This trace is 50ish guards long. I just show the first three here: Recording starting from ../SunSpider/tests/v8-richards.js:184@11 (FragID=000013) 00070: 331 callprop "run" About to try emitting guard code for SideExit=0x825f9bc exitType=BRANCH start state = iparam 0 ecx sp = ld state[0] rp = ld state[4] cx = ld state[8] eos = ld state[12] eor = ld state[16] ld630 = ld state[880] $global0 = i2f ld630 ld631 = ld state[872] $global1 = i2f ld631 ld632 = ld state[888] $global2 = i2f ld632 ld633 = ld state[864] $global3 = i2f ld633 ld634 = ld state[840] $global4 = i2f ld634 ld635 = ld state[856] $global5 = i2f ld635 ld636 = ld state[936] $global6 = i2f ld636 $args0 = ld sp[-24] $args1 = ld sp[-16] $arguments0 = ld sp[-8] $stack0 = ld sp[0] $stack1 = ld sp[8] $stack2 = ld sp[16] $arguments0 = ld sp[24] $var0 = ld sp[32] $stack0 = ld sp[40] map = ld $stack0[0] ops = ld map[0] ld637 = ld ops[12] ptr guard(native-map) = eq ld637, ptr xf463: xf guard(native-map) -> pc=0x81ee922 imacpc=(nil) sp+48 rp+4 (GuardID=001) About to try emitting guard code for SideExit=0x825fa60 exitType=BRANCH shape = ld map[16] 265 guard_kshape = eq shape, 265 xf464: xf guard_kshape -> pc=0x81ee922 imacpc=(nil) sp+48 rp+4 (GuardID=002) About to try emitting guard code for SideExit=0x825fb04 exitType=BRANCH proto = ld $stack0[8] 0 eq398 = eq proto, 0 xt359: xt eq398 -> pc=0x81ee922 imacpc=(nil) sp+48 rp+4 (GuardID=003) Without the patch, the trace is executed 81898 times, and always goes to the end: FragID=000013, total count 81898: Looped (057) 81898 (100.00%) With patch v8, this fragment is executed 12 x more often. But all of those extra entries are thrown out at the first shape check: FragID=000013, total count 1079398: GuardID=002 997500 (92.41%) Looped (052) 81898 ( 7.59%) So it seems like the trace is being entered a lot more, but all those new entries fail the first shape check.

Jeff Walden [:Waldo]

Comment 94

•

16 years ago

(In reply to comment #76) > (In reply to comment #70) > > > > Should we just add this right away? > > yes. r=me A few days behind on this, but I think everyone should feel free, at any time, to add any tracing testcase to the suite, reviewed at the discretion of the committer. I don't think anyone will disagree, but it seems best to make sure everyone's on the same page with this.

Comment 95

•

16 years ago

at tm-tip, before this patch: recorder: started(6), aborted(2), completed(34), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(2), breaks(0), returns(0), unstableInnerCalls(1), blacklisted(0) monitor: triggered(72), exits(72), type mismatch(0), global mismatch(0) after: recorder: started(4), aborted(2), completed(31), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(1), breaks(0), returns(0), unstableInnerCalls(1), blacklisted(0) monitor: triggered(66), exits(66), type mismatch(0), global mismatch(0) I could have sworn there was something much different yesterday but now I can't reproduce it.

Assignee

Comment 96

•

16 years ago

Attached patch patch, v10 (obsolete) — Details — Splinter Review

•

16 years ago

Depends on: 513160

Assignee

Comment 102

•

16 years ago

Attached patch patch v10, refreshed (obsolete) — Details — Splinter Review

No change other than merging. We understand better what's going on. I'll write it up when I have time and see about a real fix. /be

Attachment #395192 - Attachment is obsolete: true

Updated

•

16 years ago

Depends on: 513407

Assignee

Comment 103

•

16 years ago

Attached patch patch v11 (obsolete) — Details — Splinter Review

Tested on top of patch for bug 471214 (see bug 471214 comment 73), I stopped in every JSScope::generateOwnShape and verified that each such shape was generated for a specific method assignment to a prototype object, or a scope branding on first call to a method of an as-yet unbranded prototype. These look minimal, although there is more to do in bug 497789 to share shape evolutionary paths. However, this patch gives three aborts: Abort recording of tree richards.js:190@11 at richards.js:531@30: No compatible inner tree. Abort recording of tree base.js:159@23 at richards.js:470@47: Inner tree is trying to grow, abort outer recording. Abort recording of tree base.js:159@23 at richards.js:337@70: Inner tree is trying to grow, abort outer recording. Without this patch, with the patch for bug 471214, the aborts are: Abort recording of tree richards.js:190@11 at richards.js:531@30: No compatible inner tree. Abort recording of tree base.js:159@23 at richards.js:470@47: Inner tree is trying to grow, abort outer recording. With no patches applied, the same two aborts present. The reason testDeepPropertyShadowing gets two traces with the patch, one without, is that the guardedShapeTable eliminates redundant shape guards, so that the hits count on the guard's side exit's target meets the HOTEXIT threshold. Without the patch we get two guards, two side exits, and insufficient heat at the first's branch exit to keep recording. I could use fragment profiler data, again -- Julian, I hope your travel went well enough! Thanks, /be

Attachment #397381 - Attachment is obsolete: true

Assignee

Comment 104

•

16 years ago

(In reply to comment #103) > The reason testDeepPropertyShadowing gets two traces with the patch, one > without, is that the guardedShapeTable eliminates redundant shape guards, so > that the hits count on the guard's side exit's target meets the HOTEXIT > threshold. Without the patch we get two guards, two side exits, and > insufficient heat at the first's branch exit to keep recording. The relevant code is in AttemptToExtendTree: Fragment* c; if (!(c = anchor->target)) { Allocator& alloc = *JS_TRACE_MONITOR(cx).allocator; c = new (alloc) Fragment(cx->fp->regs->pc); c->root = anchor->from->root; debug_only_printf(...); anchor->target = c; ... Where anchor is the VMSideExit* passed in. With redundant guards, we have two anchors for the same ip, so we see null anchor->target twice and create a new fragment, c, with zero hits count. With the patch, we have one guard, one side exit, one place to create and save anchor->target aka c, so we find it and its hits counter reaches the critical HOTEXIT threshold. With the v11 patch on top of the bug 471214 patch, I wonder whether the fragment profiler still shows same compiled fragments, more executions of one particular one -- where the extra executions all branch-exit early. Need that profiler! /be

Assignee

Comment 105

•

16 years ago

Attached patch patch v12 (obsolete) — Details — Splinter Review

Attachment #397494 - Attachment is obsolete: true

Assignee

Comment 106

•

16 years ago

Interdiff lies, patch v12 is just a refresh. /be

Comment 107

•

16 years ago

(In reply to comment #105) > Created an attachment (id=397582) [details] > patch v12 Comparing (32099:c1a97865c476 + bug 471214 att 397899) to (32099:c1a97865c476 + bug 471214 att 397899 + this bug patch v12) Note that bug 471214 att 397899 is not the final version of that bug's patch, but at least does not appear to cause any perf regressions. Summary: v12 is faster than baseline (as before), and has lower insn counts and I1 misses. However it still has different behaviour at the fragprofiling level -- entries increase from 11,836,222 to 12,863,820. ten runs native: base 8.76 v12 8.36 base ==8163== I refs: 3,735,417,294 ==8163== I1 misses: 32,483,178 ==8163== L2i misses: 5,789 ==8163== I1 miss rate: 0.86% ==8163== L2i miss rate: 0.00% ==8163== ==8163== D refs: 1,790,558,493 (1,187,668,439 rd + 602,890,054 wr) ==8163== D1 misses: 159,349 ( 133,304 rd + 26,045 wr) ==8163== L2d misses: 41,320 ( 20,611 rd + 20,709 wr) ==8163== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==8163== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==8163== ==8163== L2 refs: 32,642,527 ( 32,616,482 rd + 26,045 wr) ==8163== L2 misses: 47,109 ( 26,400 rd + 20,709 wr) ==8163== L2 miss rate: 0.0% ( 0.0% + 0.0% ) ==8163== ==8163== Branches: 541,992,592 ( 541,898,572 cond + 94,020 ind) ==8163== Mispredicts: 11,523,392 ( 11,497,063 cond + 26,329 ind) ==8163== Mispred rate: 2.1% ( 2.1% + 28.0% ) v12 ==8187== I refs: 3,594,439,441 ==8187== I1 misses: 28,277,581 ==8187== L2i misses: 5,794 ==8187== I1 miss rate: 0.78% ==8187== L2i miss rate: 0.00% ==8187== ==8187== D refs: 1,737,612,513 (1,131,324,134 rd + 606,288,379 wr) ==8187== D1 misses: 191,409 ( 163,599 rd + 27,810 wr) ==8187== L2d misses: 44,319 ( 23,062 rd + 21,257 wr) ==8187== D1 miss rate: 0.0% ( 0.0% + 0.0% ) ==8187== L2d miss rate: 0.0% ( 0.0% + 0.0% ) ==8187== ==8187== L2 refs: 28,468,990 ( 28,441,180 rd + 27,810 wr) ==8187== L2 misses: 50,113 ( 28,856 rd + 21,257 wr) ==8187== L2 miss rate: 0.0% ( 0.0% + 0.0% ) ==8187== ==8187== Branches: 494,290,098 ( 494,191,015 cond + 99,083 ind) ==8187== Mispredicts: 11,331,061 ( 11,304,453 cond + 26,608 ind) ==8187== Mispred rate: 2.2% ( 2.2% + 26.8% ) base Total count = 11836222 Entry counts Entry counts Static ------Self------ ----Cumulative--- Exits IBytes FragID 0: 31.55% 3734848 31.55% 3734848 46 1043 000002 1: 19.44% 2300548 50.99% 6035396 150 3315 000003 2: 12.56% 1486448 63.55% 7521844 182 4107 000010 3: 8.43% 997498 71.98% 8519342 119 2874 000016 4: 4.38% 517997 76.35% 9037339 219 5047 000004 5: 2.96% 349997 79.31% 9387336 93 2390 000023 6: 2.77% 327598 82.08% 9714934 27 831 000008 7: 2.74% 324446 84.82% 10039380 186 4280 000019 8: 2.74% 323748 87.55% 10363128 53 1324 000017 9: 2.49% 294348 90.04% 10657476 68 1729 000018 v12 Total count = 12863820 Entry counts Entry counts Static ------Self------ ----Cumulative--- Exits IBytes FragID 0: 29.03% 3734848 29.03% 3734848 45 1079 000002 1: 17.88% 2300548 46.92% 6035396 142 3432 000003 2: 11.56% 1486448 58.47% 7521844 176 4238 000010 3: 8.39% 1079398 66.86% 8601242 52 1368 000013 4: 7.75% 997498 74.62% 9598740 112 2933 000016 5: 4.03% 517997 78.64% 10116737 211 5250 000004 6: 2.72% 349997 81.37% 10466734 88 2424 000023 7: 2.55% 327598 83.91% 10794332 27 855 000008 8: 2.52% 324446 86.43% 11118778 180 4407 000018 9: 2.52% 323748 88.95% 11442526 50 1310 000017

Assignee

Comment 108

•

16 years ago

Attached patch rebased patch v12, plus cx->create<JSScope>(ops) goodness (obsolete) — Details — Splinter Review

Attachment #397582 - Attachment is obsolete: true

Assignee

Comment 109

•

16 years ago

Attached patch rebased again, still essentially v12 plus tracking changes (obsolete) — Details — Splinter Review

Attachment #398548 - Attachment is obsolete: true

•

16 years ago

Depends on: 516069

Assignee

Comment 113

•

16 years ago

Splitting the patch up into dependency bugs. First one is bug 516069. /be

Assignee

Comment 114

•

16 years ago

Attached patch rebased to tm tip after patch for bug 516069 (obsolete) — Details — Splinter Review

Attachment #399910 - Attachment is obsolete: true

Assignee

Updated

•

16 years ago

Depends on: 516075

Assignee

Comment 115

•

16 years ago

Attached patch the residuum (obsolete) — Details — Splinter Review

Fragment profiling help welcome. I really hope this doesn't introduce anything unexpected. The LIR_dbreak-based assertions should confirm it's memozing constant shape guards correctly. The SubShapeOf(cx, obj, shape) built-in, used when a predictable shape (one along OBJ_SCOPE(obj)'s property tree lineage at scope->lastProp, where the scope has not had its own overriding shape assigned) is guarded, tries to match the recording-time shape to an older property on the ancestor line, in case the object being used to access the wanted property has been extended with new properties since recording-time. SubShapeOf helps when objects related by pure property extension are accessed from the same site. The comment in TR::guardShape says what I think we must do next: figure out how to implement a PIC instead of branch-exiting on unrelated shape access from the same site. v8/richards.js is all about this: it uses different JS constructors with prototypes, where the unrelated shapes have methods and properties named the same and at the same offset (slot), and it expects the VM to optimize. Taking a branch exit means re-recording and growing the tree, but with the same code except for the guarded shape. Branch exit from an inner tree while recording an outer tree gets you an aborted outer tree, and we hope to re-record inner then outer with the new shape guarded in the inner, but again if the code is the same (same slot, etc.), then this is suboptimcal compard to a PIC. And branch exits can reach MAX_BRANCHES and fail hard. I'll file a followup bug if this patch passes review and frag-profiling muster, and get on with the PIC work tomorrow. /be

Attachment #400163 - Attachment is obsolete: true

Attachment #400655 - Flags: review?(jorendorff)

Assignee

Updated

•

16 years ago

Attachment #400655 - Attachment is obsolete: true

Attachment #400655 - Flags: review?(jorendorff)

Assignee

Comment 116

•

16 years ago

Comment on attachment 400655 [details] [diff] [review] the residuum SubShapeOf is badly broken (say trace-tests) but I have no time left today to work on this. /be

Assignee

Updated

•

16 years ago

Depends on: 518448

Assignee

Updated

•

16 years ago

Depends on: TM:PIC