483457 - IonMonkey: Maybe slower than baseline, probably slower than v8 and jsc on attached testcase

Reporter

Description

•

16 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090312 Minefield/3.2a1pre Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090312 Minefield/3.2a1pre The attached runs much more slowly with JIT enabled than with it disabled. My results are 3-4 ms without JIT and anywhere from 20-50ms with JIT. Reproducible: Always Steps to Reproduce: 1. Load the attached and press the button. Note the timing. 2. Do step 1 with "JIT content" disabled and enabled. 3. The biggest difference is observed after reloading the page (e.g. press ctrl-R). Actual Results: Performance with JIT is slower Expected Results: Performance with JIT should be at least as fast as without. This code is an extract from a much larger application. When it's running "in situ" I get much larger variation in the timings with JIT, I've even seen times of over a second. Possibly there's more scope for GC to interrupt it and produce much longer than usual timings ?

Duncan Loveday

Reporter

Comment 1

•

16 years ago

Attached file Test case — Details

Duncan Loveday

Reporter

Updated

•

16 years ago

Keywords: perf, regression, testcase

Ria Klaassen (not reading all bugmail)

Comment 2

•

16 years ago

I see a certain average increase in time (2x to 3x) after these checkins: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f3d5f4a980a0&tochange=859496f90e66 Tested on Windows XP.

Duncan Loveday

Reporter

Comment 3

•

16 years ago

I'm not very happy with this test case, just been trying to improve on it but without success. In my full app, in addition to the slower running of the function itself what also happens is that after the function in the test case is run the code goes on to execute "setTimeout('nextBitOfWork();', 0)". With JIT off the 'nextBitOfWork()' function starts running pretty much immediately. But with JIT on there are intermittent pauses in the range 500-1500ms between issuing "setTimeout()" and the named function running. It's the pauses I really want to get to the bottom of - they are killing performance for me. But when I do the same thing in a small test case I don't get any pauses. FWIW The use of setTimeout() is via a variant of Neil Mix's javascript 1.7 trampolining library using generator/iterator loops. Anyone got a suggestion as to why JIT would cause a delay in the execution of a function scheduled with setTimeout() in this way ? And on how it could be reproduced systematically, e.g. if it's to do with more GC activity is there a way of forcing GC to run ? There are some scary numbers in my full app for certain functions, e.g. non JIT => JIT 4 => 8.4 2.9 => 9.4 2.9 => 13.0 10.4 => 37.6 0.6 => 7.8 1.5 => 17.4 but they're not reproducible in small test programs.

Ria Klaassen (not reading all bugmail)

Comment 4

•

16 years ago

(In reply to comment #2) > I see a certain average increase in time (2x to 3x) after these checkins: > http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f3d5f4a980a0&tochange=859496f90e66 I meant that the problem suddenly worsened significantly.

Ria Klaassen (not reading all bugmail)

Comment 5

•

16 years ago

With the latest trunk nightly, JIT disabled, I get an average of 10 ms. With JIT enabled before 28 Feb 2009 it was about 22 ms. With JIT enabled now it is about 55 ms.

Status: UNCONFIRMED → NEW

Ever confirmed: true

Ria Klaassen (not reading all bugmail)

Updated

•

16 years ago

Flags: blocking1.9.1?

David Mandelin [:dmandelin]

Comment 6

•

16 years ago

The slowdown of the test case appears to be due to the problem bug 478525 is trying to fix. The problem is that there are loops that grow objects while also accessing them. The growing part causes the shape to change. The accessing part causes side exits on a shape guard (the one generated at the very end of guardElemOp). Those shape guards are supposed to be unnecessary for this kind of access, and bug 478525 is going to remove them. The time variance appears to be related to blacklisting. After a few runs, it gets faster, because the problematic object-building loops get blacklisted, so we close to interpreter performance. Occasionally, the time blows up to a really big number. It seems to be related to tracer stuff and not GC. I want to wait until bug 478525 is fixed before trying to figure that out.

Depends on: 478525

Duncan Loveday

Reporter

Comment 7

•

16 years ago

(In reply to comment #6) > Occasionally, the time blows up to a > really big number. It seems to be related to tracer stuff and not GC. Interesting...how big is a really big number ? I haven't seen more than 50ms or so with the attached but with the same code running as part of a bigger program I sometimes see much bigger numbers, a second or more.

David Mandelin [:dmandelin]

Comment 8

•

16 years ago

(In reply to comment #7) > (In reply to comment #6) > > Occasionally, the time blows up to a > > really big number. It seems to be related to tracer stuff and not GC. > > Interesting...how big is a really big number ? I haven't seen more than 50ms or > so with the attached but with the same code running as part of a bigger program > I sometimes see much bigger numbers, a second or more. I got a 500ms run of the attached test case.

Duncan Loveday

Reporter

Comment 9

•

16 years ago

Thanks, strange as it may sound, that's a relief. As in, I'm glad someone other than me has seen that happen. It'd be nice if there was a systematic way of reproducing those exceptionally high times just so we can be sure they've gone away when the various patches land.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

16 years ago

Flags: blocking1.9.1? → blocking1.9.1+

David Mandelin [:dmandelin]

Comment 10

•

16 years ago

This seems to be a lot better after changeset 9ae617e1d224 from bug 478525. Can you try it again once this gets out to nightlies?

Ria Klaassen (not reading all bugmail)

Comment 11

•

16 years ago

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090318 Minefield/3.6a1pre ID:20090318053518 I can still reproduce the bug with the latest nightly and hourly on Windows.

Duncan Loveday

Reporter

Comment 12

•

16 years ago

I'd say there's a slight improvement Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090312 Minefield/3.2a1pre JIT off => 4ms JIT on => 40-45ms Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090318 Minefield/3.2a1pre JIT off => 4ms JIT on => 35-40ms Some way to go but there is progress.

Duncan Loveday

Reporter

Comment 13

•

16 years ago

(In reply to comment #10) > This seems to be a lot better after changeset 9ae617e1d224 from bug 478525. Can > you try it again once this gets out to nightlies? Actually I don't see that changeset on the bug history for 478525. Has it definitely landed ?

Duncan Loveday

Reporter

Comment 14

•

16 years ago

> Actually I don't see that changeset on the bug history for 478525. Has it > definitely landed ? Yes I do. Revision != changeset.

Duncan Loveday

Reporter

Comment 15

•

16 years ago

The amount of actual execution, as in number of source-level statements executed, in this test case is *tiny*. That's why it runs so fast in the interpreter. Is it just that the implementation is over-eager to start tracing ? Could a solution be as simple as saying "don't start tracing until you've executed some threshold amount of code" (assuming the interpreter could compute this statement count cheaply). Or possibly until you've been running for a period of time ? I know Sun's JVM has thresholds below which JIT is not used, they can even be tweaked with command line options.

Robert Sayre

Updated

•

16 years ago

Priority: -- → P3

Jeff Walden [:Waldo]

Assignee

Updated

•

16 years ago

Assignee: general → jwalden+bmo

Duncan Loveday

Reporter

Comment 16

•

16 years ago

Sadly there's no obvious improvement here after all the hard work that landed recently

Jeff Walden [:Waldo]

Assignee

Comment 17

•

16 years ago

Attached file Shell testcase — Details

Definitely still around, with number roughly approximating these in practice: find-waldo-now:~/moz/js-tm/js/src jwalden$ ../../obj-i386-apple-darwin8.11.1/dist/bin/js ~/moz/inflight/10x.js Time=6 find-waldo-now:~/moz/js-tm/js/src jwalden$ ../../obj-i386-apple-darwin8.11.1/dist/bin/js -j ~/moz/inflight/10x.js Time=136

Duncan Loveday

Reporter

Comment 18

•

16 years ago

Attached file With loop counts — Details

(In reply to comment #15) > The amount of actual execution, as in number of source-level statements > executed, in this test case is *tiny*....... This is the same thing with some counts added to the various loops. The hottest loop in it is only traversed about 1500 times and each traversal does a handful of assignments, comparisons, property and array accesses.

Andreas Gal :gal

Comment 19

•

16 years ago

There is a lot of compilation going on, with a short runtime that doesn't leave a lot of time to recoup the compilation time. For a longer run the overhead drops to 4x (still too high of course).

David Mandelin [:dmandelin]

Comment 20

•

16 years ago

Attached file Attachment 371566 with multiple runs & html output — Details

This comment got long, so I'll start with a short take: this is probably unfixable in the short term without taking a 100ms+ hit on SunSpider. I recommend removing blocking status. Details: One thing I keep forgetting to say is that tests are a lot easier for me to work with if they output results in HTML instead of via an alert. Alerts mess up my detailed timers because the entire time the alert is up looks like JS execution time. Also, you can't copy and paste the output. Anyway, here's a version of the last test done that way. Also, I run the test 10 times so you can see that the jit runs faster than the interpreter after a few runs. The initial slow runs are due to trace recording and compilation overhead. Recording and compilation of a trace takes about 200x as long as interpreter execution of the same bytecodes. Here are the loop counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Given the 200x cost of the second loop iteration in the tracer, most of these loops should be expected to run 2-100x more slowly in a single test run in the tracer. Thus the 10x slowdown for a single run. Runs 2-4 are also a bit slow, probably because branch traces get added to some of the short loops, so they take longer to finish compiling. The only solutions I see are: - increase HOTLOOP and HOTEXIT. I found 200 to be a better value. Of course, this may hurt certain other benchmarks. - Make the recorder and compiler faster. This is a long-term project, though. Running them in a thread would probably help a lot, too. My raw times: no-jit 5 4 4 4 4 3 3 3 2 2 3 2 3 2 2 3 2 3 2 2 (avg 2.9) jit 29 19 7 4 1 2 5 2 1 1 1 1 1 1 1 1 1 1 1 1 (avg 4.1) retuned jit(*) 8 4 8 13 5 2 2 2 2 2 2 2 3 1 1 2 1 1 3 1 (avg 3.3) (*) HOTLOOP=200, HOTEXIT=200, MAXEXIT=205. This takes a 200ms hit on SunSpider, about 140 of which is on fannkuch, the rest spread out in crypto, etc.

Duncan Loveday

Reporter

Comment 21

•

16 years ago

Point taken vs HTML output instead of alert. No matter how fast the recorder and compiler, there'll be some threshold of execution time below which their cost exceeds potential gain. Which I'm guessing is what your HOTLOOP/HOTEXIT thing is about. I suppose the trick is to be able to make a reasonable guess as to where the cost/benefit threshold is. Which I'm guessing is your point about other benchmarks where tracing from the beginning is optimal. There might be a third way forward to add to the two you've given, namely to be smarter about when to start tracing and compiling. Not that I'm saying you aren't smart about it already, just that you might get even smarter ! Did you find anything to explain the occasional very long times as noted in comments 7 and 8 ? Aside: I ran your latest test case a few times - very nice - then did right click->view page source to see what you'd done - OK - then closed the view page source window and ran the test a few more times and the performance was way slower. Ctrl-R and it's back to normal. Repeat the whole cycle and the same thing happens. Is that another issue ?

Duncan Loveday

Reporter

Comment 22

•

16 years ago

FWIW I can copy and paste from an alert on windows. Suggest if you can't that should be a bug.

Duncan Loveday

Reporter

Comment 23

•

16 years ago

(In reply to comment #21) > Aside: I ran your latest test case a few times - very nice - then did right > click->view page source to see what you'd done - OK - then closed the view page > source window and ran the test a few more times and the performance was way > slower. Ctrl-R and it's back to normal. Repeat the whole cycle and the same > thing happens. Is that another issue ? Actually, View Page Source is nothing to do with it. It's just that sometimes you get consistently slower runs, look at these two examples. All I did in between was press ctrl-R. Times: 82 60 25 17 15 13 13 13 13 13 32 18 14 15 14 14 14 14 14 15 14 14 14 14 14 14 62 14 14 14 14 14 14 14 14 13 13 14 14 14 14 14 14 13 14 14 13 14 14 13 14 14 15 14 13 13 14 14 14 14 14 14 14 14 14 14 14 14 15 14 14 13 14 14 14 14 13 13 13 13 13 13 14 14 14 14 14 13 13 14 16 15 15 15 15 14 15 16 16 16 65 33 21 17 16 17 15 15 14 16 15 12 13 13 13 13 14 13 13 13 13 13 14 15 13 13 13 13 13 13 13 13 14 13 13 13 13 13 13 14 13 13 13 14 13 12 13 13 14 14 13 14 13 13 14 13 14 14 12 7 Total: 2480 (avg 15.5) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Times: 38 29 11 7 3 2 8 3 1 1 1 2 2 2 2 1 1 2 2 2 27 9 6 2 2 2 2 1 1 1 1 1 1 2 2 2 3 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 3 2 2 2 1 2 2 2 2 2 3 2 2 2 2 2 1 2 2 2 2 2 1 1 49 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 1 Total: 469 (avg 2.6055555555555556) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0

Duncan Loveday

Reporter

Comment 24

•

16 years ago

...Here's an example where it seemed to spontaneously slow down. I just kept on clicking. Times: 34 27 9 6 2 3 3 1 3 2 1 1 2 1 1 1 2 2 2 1 27 10 7 3 2 4 1 1 2 2 1 1 1 2 2 1 2 2 2 1 2 2 2 1 1 1 2 2 1 1 1 2 2 2 2 2 1 1 1 1 28 9 6 2 3 3 1 2 2 2 1 1 1 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 3 1 2 2 2 2 2 2 2 3 88 29 19 17 15 14 13 15 13 14 13 13 13 13 15 14 14 14 14 14 13 13 14 13 13 13 13 14 14 14 13 13 14 14 14 13 14 14 13 14 13 13 13 14 14 13 13 13 13 13 13 13 14 13 13 13 14 14 14 14 13 14 14 13 13 14 14 62 13 13 14 14 14 14 13 14 14 14 14 13 14 13 13 13 14 14 14 14 14 14 14 14 14 13 14 14 14 13 14 14 13 14 13 14 14 14 14 14 14 14 15 14 14 13 13 14 14 14 14 14 15 14 14 14 14 14 14 14 14 14 14 14 14 14 15 14 14 14 14 14 14 14 15 14 15 14 14 14 14 15 14 14 14 14 14 14 15 14 14 14 14 14 14 15 14 15 14 16 15 14 15 16 10 8 9 9 9 8 9 8 7 7 7 8 8 8 7 11 14 14 14 14 15 15 15 16 16 13 14 13 15 14 14 14 15 15 16 15 16 16 16 15 16 16 16 16 15 15 16 16 16 15 14 15 14 14 14 14 16 16 15 14 14 14 16 16 16 14 14 14 Total: 3846 (avg 9.615) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0

Jeff Walden [:Waldo]

Assignee

Comment 25

•

16 years ago

I reran the shell test through Shark again, and the 10x number still roughly holds. Shark says that 85% of the time with -j is spent in TraceRecorder::monitorRecording and js_MonitorLoopEdge; merely speeding up tracing and recording isn't going to make this slowdown go away. Couple that with the lateness in the cycle, and I think we have to punt this. The difference between 1-2ms and 25ms is negligible overall. Speeding that up will have no perceptible impact on browser speed unless this test's code is repeated, and if it's repeated we should start to win as we can avoid interpretation in favor of running the previously-JITted code. It would be better to be faster, but schedule and the nature of the test counsel leaving further work on this for later.

Flags: blocking1.9.1+ → blocking1.9.1?

Robert Sayre

Updated

•

16 years ago

Flags: blocking1.9.1? → wanted1.9.2+

Duncan Loveday

Reporter

Comment 26

•

16 years ago

I find the somewhat unpredicatable nature of the slowdown more troubling than the slowdown itself. E.g. comment #19 suggests a slowdown of a factor of 4 even with longer run times but that's not what we see in comment #20. And the occasional big numbers noted in comment #7 and comment #8 remain unexplained, as do my observations of variable response times in comment #23 and comment #24. Not disagreeing that this should be left until later, just making an observation.

Ryan VanderMeulen [:RyanVM]

Comment 27

•

14 years ago

Current JS shell numbers for the attached testcase: Interp: 14.988 ms -j: 12.501 ms -m: 18.920 ms -m -n: 23.949 ms Looks like the tracejit actually got ahead of interp by a bit, but JM and JM+TI are still slower.

Blocks: 467263

Summary: TM: Slowdown by factor of up to 10 in this example → JM+TI: 2x slowdown of attached testcase compared to interpreter

Guilherme Lima

Comment 28

•

13 years ago

Using testcase attached by David Mandelin Firefox 19 (Ion+JM+TI) (20121022) Times: 24 8 2 6 3 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Total: 75 (avg 3.75) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Firefox 19 (Interpreter) (20121022) Times: 9 8 8 8 8 7 6 7 6 6 6 6 7 6 6 6 7 6 7 6 Total: 136 (avg 6.8) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Firefox 17 (JM+TI) Times: 23 8 7 2 1 1 1 1 1 1 1 1 5 1 1 1 1 1 1 1 Total: 60 (avg 3) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Chrome 22 Times: 0 16 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 0 0 Total: 32 (avg 1.6) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0

Duncan Loveday

Reporter

Comment 29

•

13 years ago

Taken at face value, Fx 19 is slower than Fx 17 but the question is whether that is repeatable, give the variation we've seen in the numbers from one test run to another.

Guilherme Lima

Comment 30

•

13 years ago

I always get something around 60 on Fx17 and 70 on Fx19. When I disable Ion in about:config, Fx19 gets similar numbers to Fx17.

Till Schneidereit [:till]

Comment 31

•

12 years ago

Mmh, this is kinda hard to judge. Apparently, all browsers made quite a jump here, causing many runs to result in 0 or 1. Still, numbers: Firefox 25 (2013-07-18 Nightly) Times: 4 3 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 Total: 20 (avg 1) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Chrome 28: Times: 1 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 Total: 8 (avg 0.4) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 Safari 6.0.5: Times: 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 Total: 5 (avg 0.25) Counts: cnt1=7, cnt2=109, cnt3=109, cnt4=170, cnt5=27, cnt6=107, cnt7=18, cnt8=56, cnt9=109, cnt10=2, cnt11=2, cnt12=2, cnt13=2, cnt14=1524, cnt15=0, cnt16=0 So, it looks like we take more time to warm up, and even then, our times *seem* to be slightly higher, judging by the more common occurrence of ones in the results. Also, it *looks* like we might be faster in baseline than in ion: the shell test always prints 2 or 3 without args or with --no-baseline, and always 1 with --no-ion. Of course, that could just be an artifact of baseline compilation being faster and ion not helping much with a very short-running test like this.

Till Schneidereit [:till]

Comment 32

•

12 years ago

Forgot to update the subject to something slightly more fitting, if very nebulous.

Summary: JM+TI: 2x slowdown of attached testcase compared to interpreter → IonMonkey: Maybe slower than baseline, probably slower than v8 and jsc on attached testcase

André Bargull [:anba]

Comment 33

•

7 years ago

When repeating the loop in |test()| 1000 times, I get the following results. andre@VBdev:~$ mozjs /tmp/t.js Time=796 andre@VBdev:~$ mozjs --no-ion /tmp/t.js Time=1078 andre@VBdev:~$ mozjs --no-baseline /tmp/t.js Time=2262 Therefore resolving as WFM.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WORKSFORME

Test case 16 years ago Duncan Loveday 39.21 KB, text/html		Details
Shell testcase 16 years ago Jeff Walden [:Waldo] 38.83 KB, text/plain		Details
With loop counts 16 years ago Duncan Loveday 39.86 KB, text/html		Details
Attachment 371566 with multiple runs & html output 16 years ago David Mandelin [:dmandelin] 40.16 KB, text/html		Details