Closed Bug 483457 Opened 16 years ago Closed 7 years ago

IonMonkey: Maybe slower than baseline, probably slower than v8 and jsc on attached testcase

Categories

(Core :: JavaScript Engine, defect, P3)

x86
Windows XP
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: duncan.loveday, Assigned: Waldo)

References

Details

(Keywords: perf, regression, testcase)

Attachments

(4 files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090312 Minefield/3.2a1pre Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090312 Minefield/3.2a1pre The attached runs much more slowly with JIT enabled than with it disabled. My results are 3-4 ms without JIT and anywhere from 20-50ms with JIT. Reproducible: Always Steps to Reproduce: 1. Load the attached and press the button. Note the timing. 2. Do step 1 with "JIT content" disabled and enabled. 3. The biggest difference is observed after reloading the page (e.g. press ctrl-R). Actual Results: Performance with JIT is slower Expected Results: Performance with JIT should be at least as fast as without. This code is an extract from a much larger application. When it's running "in situ" I get much larger variation in the timings with JIT, I've even seen times of over a second. Possibly there's more scope for GC to interrupt it and produce much longer than usual timings ?
Attached file Test case
I see a certain average increase in time (2x to 3x) after these checkins: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f3d5f4a980a0&tochange=859496f90e66 Tested on Windows XP.
I'm not very happy with this test case, just been trying to improve on it but without success. In my full app, in addition to the slower running of the function itself what also happens is that after the function in the test case is run the code goes on to execute "setTimeout('nextBitOfWork();', 0)". With JIT off the 'nextBitOfWork()' function starts running pretty much immediately. But with JIT on there are intermittent pauses in the range 500-1500ms between issuing "setTimeout()" and the named function running. It's the pauses I really want to get to the bottom of - they are killing performance for me. But when I do the same thing in a small test case I don't get any pauses. FWIW The use of setTimeout() is via a variant of Neil Mix's javascript 1.7 trampolining library using generator/iterator loops. Anyone got a suggestion as to why JIT would cause a delay in the execution of a function scheduled with setTimeout() in this way ? And on how it could be reproduced systematically, e.g. if it's to do with more GC activity is there a way of forcing GC to run ? There are some scary numbers in my full app for certain functions, e.g. non JIT => JIT 4 => 8.4 2.9 => 9.4 2.9 => 13.0 10.4 => 37.6 0.6 => 7.8 1.5 => 17.4 but they're not reproducible in small test programs.
(In reply to comment #2) > I see a certain average increase in time (2x to 3x) after these checkins: > http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f3d5f4a980a0&tochange=859496f90e66 I meant that the problem suddenly worsened significantly.
With the latest trunk nightly, JIT disabled, I get an average of 10 ms. With JIT enabled before 28 Feb 2009 it was about 22 ms. With JIT enabled now it is about 55 ms.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.9.1?
The slowdown of the test case appears to be due to the problem bug 478525 is trying to fix. The problem is that there are loops that grow objects while also accessing them. The growing part causes the shape to change. The accessing part causes side exits on a shape guard (the one generated at the very end of guardElemOp). Those shape guards are supposed to be unnecessary for this kind of access, and bug 478525 is going to remove them. The time variance appears to be related to blacklisting. After a few runs, it gets faster, because the problematic object-building loops get blacklisted, so we close to interpreter performance. Occasionally, the time blows up to a really big number. It seems to be related to tracer stuff and not GC. I want to wait until bug 478525 is fixed before trying to figure that out.
Depends on: 478525
(In reply to comment #6) > Occasionally, the time blows up to a > really big number. It seems to be related to tracer stuff and not GC. Interesting...how big is a really big number ? I haven't seen more than 50ms or so with the attached but with the same code running as part of a bigger program I sometimes see much bigger numbers, a second or more.
(In reply to comment #7) > (In reply to comment #6) > > Occasionally, the time blows up to a > > really big number. It seems to be related to tracer stuff and not GC. > > Interesting...how big is a really big number ? I haven't seen more than 50ms or > so with the attached but with the same code running as part of a bigger program > I sometimes see much bigger numbers, a second or more. I got a 500ms run of the attached test case.
Thanks, strange as it may sound, that's a relief. As in, I'm glad someone other than me has seen that happen. It'd be nice if there was a systematic way of reproducing those exceptionally high times just so we can be sure they've gone away when the various patches land.
Flags: blocking1.9.1? → blocking1.9.1+
This seems to be a lot better after changeset 9ae617e1d224 from bug 478525. Can you try it again once this gets out to nightlies?
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090318 Minefield/3.6a1pre ID:20090318053518 I can still reproduce the bug with the latest nightly and hourly on Windows.
I'd say there's a slight improvement Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090312 Minefield/3.2a1pre JIT off => 4ms JIT on => 40-45ms Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090318 Minefield/3.2a1pre JIT off => 4ms JIT on => 35-40ms Some way to go but there is progress.
(In reply to comment #10) > This seems to be a lot better after changeset 9ae617e1d224 from bug 478525. Can > you try it again once this gets out to nightlies? Actually I don't see that changeset on the bug history for 478525. Has it definitely landed ?
> Actually I don't see that changeset on the bug history for 478525. Has it > definitely landed ? Yes I do. Revision != changeset.
The amount of actual execution, as in number of source-level statements executed, in this test case is *tiny*. That's why it runs so fast in the interpreter. Is it just that the implementation is over-eager to start tracing ? Could a solution be as simple as saying "don't start tracing until you've executed some threshold amount of code" (assuming the interpreter could compute this statement count cheaply). Or possibly until you've been running for a period of time ? I know Sun's JVM has thresholds below which JIT is not used, they can even be tweaked with command line options.
Priority: -- → P3
Assignee: general → jwalden+bmo
Sadly there's no obvious improvement here after all the hard work that landed recently
Attached file Shell testcase
Definitely still around, with number roughly approximating these in practice: find-waldo-now:~/moz/js-tm/js/src jwalden$ ../../obj-i386-apple-darwin8.11.1/dist/bin/js ~/moz/inflight/10x.js Time=6 find-waldo-now:~/moz/js-tm/js/src jwalden$ ../../obj-i386-apple-darwin8.11.1/dist/bin/js -j ~/moz/inflight/10x.js Time=136
Attached file With loop counts
(In reply to comment #15) > The amount of actual execution, as in number of source-level statements > executed, in this test case is *tiny*....... This is the same thing with some counts added to the various loops. The hottest loop in it is only traversed about 1500 times and each traversal does a handful of assignments, comparisons, property and array accesses.
There is a lot of compilation going on, with a short runtime that doesn't leave a lot of time to recoup the compilation time. For a longer run the overhead drops to 4x (still too high of course).
This comment got long, so I'll start with a short take: this is probably unfixable in the short term without taking a 100ms+ hit on SunSpider. I recommend removing blocking status. Details: One thing I keep forgetting to say is that tests are a lot easier for me to work with if they output results in HTML instead of via an alert. Alerts mess up my detailed timers because the entire time the alert is up looks like JS execution time. Also, you can't copy and paste the output. Anyway, here's a version of the last test done that way. Also, I run the test 10 times so you can see that the jit runs faster than the interpreter after a few runs. The initial slow runs are due to trace recording and compilation overhead. Recording and compilation of a trace takes about 200x as long as interpreter execution of the same bytecodes. Here are the loop counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Given the 200x cost of the second loop iteration in the tracer, most of these loops should be expected to run 2-100x more slowly in a single test run in the tracer. Thus the 10x slowdown for a single run. Runs 2-4 are also a bit slow, probably because branch traces get added to some of the short loops, so they take longer to finish compiling. The only solutions I see are: - increase HOTLOOP and HOTEXIT. I found 200 to be a better value. Of course, this may hurt certain other benchmarks. - Make the recorder and compiler faster. This is a long-term project, though. Running them in a thread would probably help a lot, too. My raw times: no-jit 5 4 4 4 4 3 3 3 2 2 3 2 3 2 2 3 2 3 2 2 (avg 2.9) jit 29 19 7 4 1 2 5 2 1 1 1 1 1 1 1 1 1 1 1 1 (avg 4.1) retuned jit(*) 8 4 8 13 5 2 2 2 2 2 2 2 3 1 1 2 1 1 3 1 (avg 3.3) (*) HOTLOOP=200, HOTEXIT=200, MAXEXIT=205. This takes a 200ms hit on SunSpider, about 140 of which is on fannkuch, the rest spread out in crypto, etc.
Point taken vs HTML output instead of alert. No matter how fast the recorder and compiler, there'll be some threshold of execution time below which their cost exceeds potential gain. Which I'm guessing is what your HOTLOOP/HOTEXIT thing is about. I suppose the trick is to be able to make a reasonable guess as to where the cost/benefit threshold is. Which I'm guessing is your point about other benchmarks where tracing from the beginning is optimal. There might be a third way forward to add to the two you've given, namely to be smarter about when to start tracing and compiling. Not that I'm saying you aren't smart about it already, just that you might get even smarter ! Did you find anything to explain the occasional very long times as noted in comments 7 and 8 ? Aside: I ran your latest test case a few times - very nice - then did right click->view page source to see what you'd done - OK - then closed the view page source window and ran the test a few more times and the performance was way slower. Ctrl-R and it's back to normal. Repeat the whole cycle and the same thing happens. Is that another issue ?
FWIW I can copy and paste from an alert on windows. Suggest if you can't that should be a bug.
(In reply to comment #21) > Aside: I ran your latest test case a few times - very nice - then did right > click->view page source to see what you'd done - OK - then closed the view page > source window and ran the test a few more times and the performance was way > slower. Ctrl-R and it's back to normal. Repeat the whole cycle and the same > thing happens. Is that another issue ? Actually, View Page Source is nothing to do with it. It's just that sometimes you get consistently slower runs, look at these two examples. All I did in between was press ctrl-R. Times: 82 60 25 17 15 13 13 13 13 13 32 18 14 15 14 14 14 14 14 15 14 14 14 14 14 14 62 14 14 14 14 14 14 14 14 13 13 14 14 14 14 14 14 13 14 14 13 14 14 13 14 14 15 14 13 13 14 14 14 14 14 14 14 14 14 14 14 14 15 14 14 13 14 14 14 14 13 13 13 13 13 13 14 14 14 14 14 13 13 14 16 15 15 15 15 14 15 16 16 16 65 33 21 17 16 17 15 15 14 16 15 12 13 13 13 13 14 13 13 13 13 13 14 15 13 13 13 13 13 13 13 13 14 13 13 13 13 13 13 14 13 13 13 14 13 12 13 13 14 14 13 14 13 13 14 13 14 14 12 7 Total: 2480 (avg 15.5) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Times: 38 29 11 7 3 2 8 3 1 1 1 2 2 2 2 1 1 2 2 2 27 9 6 2 2 2 2 1 1 1 1 1 1 2 2 2 3 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 3 2 2 2 1 2 2 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 1 1 49 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 Total: 469 (avg 2.6055555555555556) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0
...Here's an example where it seemed to spontaneously slow down. I just kept on clicking. Times: 34 27 9 6 2 3 3 1 3 2 1 1 2 1 1 1 2 2 2 1 27 10 7 3 2 4 1 1 2 2 1 1 1 2 2 1 2 2 2 1 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 1 1 28 9 6 2 3 3 1 2 2 2 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 3 1 2 2 2 2 2 2 2 3 88 29 19 17 15 14 13 15 13 14 13 13 13 13 15 14 14 14 14 14 13 13 14 13 13 13 13 14 14 14 13 13 14 14 14 13 14 14 13 14 13 13 13 14 14 13 13 13 13 13 13 13 14 13 13 13 14 14 14 14 13 14 14 13 13 14 14 62 13 13 14 14 14 14 13 14 14 14 14 13 14 13 13 13 14 14 14 14 14 14 14 14 14 13 14 14 14 13 14 14 13 14 13 14 14 14 14 14 14 14 15 14 14 13 13 14 14 14 14 14 15 14 14 14 14 14 14 14 14 14 14 14 14 14 15 14 14 14 14 14 14 14 15 14 15 14 14 14 14 15 14 14 14 14 14 14 15 14 14 14 14 14 14 15 14 15 14 16 15 14 15 16 10 8 9 9 9 8 9 8 7 7 7 8 8 8 7 11 14 14 14 14 15 15 15 16 16 13 14 13 15 14 14 14 15 15 16 15 16 16 16 15 16 16 16 16 15 15 16 16 16 15 14 15 14 14 14 14 16 16 15 14 14 14 16 16 16 14 14 14 Total: 3846 (avg 9.615) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0
I reran the shell test through Shark again, and the 10x number still roughly holds. Shark says that 85% of the time with -j is spent in TraceRecorder::monitorRecording and js_MonitorLoopEdge; merely speeding up tracing and recording isn't going to make this slowdown go away. Couple that with the lateness in the cycle, and I think we have to punt this. The difference between 1-2ms and 25ms is negligible overall. Speeding that up will have no perceptible impact on browser speed unless this test's code is repeated, and if it's repeated we should start to win as we can avoid interpretation in favor of running the previously-JITted code. It would be better to be faster, but schedule and the nature of the test counsel leaving further work on this for later.
Flags: blocking1.9.1+ → blocking1.9.1?
Flags: blocking1.9.1? → wanted1.9.2+
I find the somewhat unpredicatable nature of the slowdown more troubling than the slowdown itself. E.g. comment #19 suggests a slowdown of a factor of 4 even with longer run times but that's not what we see in comment #20. And the occasional big numbers noted in comment #7 and comment #8 remain unexplained, as do my observations of variable response times in comment #23 and comment #24. Not disagreeing that this should be left until later, just making an observation.
Current JS shell numbers for the attached testcase: Interp: 14.988 ms -j: 12.501 ms -m: 18.920 ms -m -n: 23.949 ms Looks like the tracejit actually got ahead of interp by a bit, but JM and JM+TI are still slower.
Blocks: 467263
Summary: TM: Slowdown by factor of up to 10 in this example → JM+TI: 2x slowdown of attached testcase compared to interpreter
Using testcase attached by David Mandelin Firefox 19 (Ion+JM+TI) (20121022) Times: 24 8 2 6 3 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Total: 75 (avg 3.75) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Firefox 19 (Interpreter) (20121022) Times: 9 8 8 8 8 7 6 7 6 6 6 6 7 6 6 6 7 6 7 6 Total: 136 (avg 6.8) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Firefox 17 (JM+TI) Times: 23 8 7 2 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 Total: 60 (avg 3) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Chrome 22 Times: 0 16 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 Total: 32 (avg 1.6) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0
Taken at face value, Fx 19 is slower than Fx 17 but the question is whether that is repeatable, give the variation we've seen in the numbers from one test run to another.
I always get something around 60 on Fx17 and 70 on Fx19. When I disable Ion in about:config, Fx19 gets similar numbers to Fx17.
Mmh, this is kinda hard to judge. Apparently, all browsers made quite a jump here, causing many runs to result in 0 or 1. Still, numbers: Firefox 25 (2013-07-18 Nightly) Times: 4 3 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 Total: 20 (avg 1) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Chrome 28: Times: 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Total: 8 (avg 0.4) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Safari 6.0.5: Times: 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 Total: 5 (avg 0.25) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 So, it looks like we take more time to warm up, and even then, our times *seem* to be slightly higher, judging by the more common occurrence of ones in the results. Also, it *looks* like we might be faster in baseline than in ion: the shell test always prints 2 or 3 without args or with --no-baseline, and always 1 with --no-ion. Of course, that could just be an artifact of baseline compilation being faster and ion not helping much with a very short-running test like this.
Forgot to update the subject to something slightly more fitting, if very nebulous.
Summary: JM+TI: 2x slowdown of attached testcase compared to interpreter → IonMonkey: Maybe slower than baseline, probably slower than v8 and jsc on attached testcase
When repeating the loop in |test()| 1000 times, I get the following results. andre@VBdev:~$ mozjs /tmp/t.js Time=796 andre@VBdev:~$ mozjs --no-ion /tmp/t.js Time=1078 andre@VBdev:~$ mozjs --no-baseline /tmp/t.js Time=2262 Therefore resolving as WFM.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: