I get 12-13fps in t-m; Safari 4 gets closer to 40fps. Stats: recorder: started(116), aborted(104), completed(15), different header(1), trees trashed(0), slot promoted(0), unstable loop variable(0), breaks(0), returns(0), unstableInnerCalls(4), blacklisted(26) monitor: triggered(27), exits(27), type mismatch(0), global mismatch(4) Typical abort: abort: 6918: fp->scopeChain is not global or active call object Abort recording of tree http://nerget.com/fluidSim/pressure.js:290@137 at http://nerget.com/fluidSim/pressure.js:291@138: name.
Oh, and a profile shows 55% of the time spent in (not under) js_Interpret. Another 11% allocating doubles, another 15% on js_NativeGet, esp with Call objects. That adds up to 80% or so. 17% is painting, and 3% is sort of here and there (2/3 of that is minor stuff under js_Interpret, the remaining 1/3 is event loop overhead, I think). In any case, this is definitely in the right component, at least until we get this tracing...
I also see a lot of: Abort recording of tree https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392847:47@96 at https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392847:49@109: Inner tree is trying to grow, abort outer recording. Line 49 is: var d = field.getDensity(x, y) * 255 / 5; where |field| comes from this: displayFunc(new Field(dens, u, v)); and Field has a getDensity function. I have no idea why that's causing us trace aborts; perhaps something akin to bug 497789?
Or some sort of branding thing going on, perhaps? Brendan? The relevant guard seems to be : About to try emitting guard code for SideExit=0x17bb1ac exitType=BRANCH shape = ld map #0x8756 guard_kshape = eq shape, #0x8756 xf1512: xf guard_kshape -> pc=0x160ded9 imacpc=0x0 sp+8 rp+0 The other branch exit I see some of is: Abort recording of tree https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392846:62@120 at https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392846:64@138: Inner tree is trying to grow, abort outer recording. Line 64 is: var lastRow = (j - 1) * rowSize; Relevant guard: About to try emitting guard code for SideExit=0x17478e0 exitType=OVERFLOW eq134 = eq mul19, callDepth xt165: xt eq134 -> pc=0x16156d6 imacpc=0x0 sp+16 rp+0
The 2nd guard looks like upvar access. Paging dmandelin.
'Inner tree is trying to grow, abort outer recording.' is usually not a problem. In the normal case, we back off from tracing the outer loop for 32 iterations, by which time the inner tree hopefully has stabilized. We should be able to get the outer loop after one or a few rounds of that. The abort in comment 0 looks like the returned closures abort.
Yes, the data in comment 6 is with the returned closures patch applied (which makes the abort in comment 0 go away). In this case, we don't stabilize as far as I can see; I'm seeing the trying to grow aborts all through the log, and it's certainly running long enough that the outer tree ends up blacklisted as a result of all the aborts.
Hmmm. One thing I thought of is that we might generate so much code that we end up OOMing the code buffer and throwing it all away, but it sounds like you are saying that's not happening. I wonder if the inner loop code is so branchy that it can't stabilize in the number of backoffs we give. Based on: About to try emitting guard code for SideExit=0x17bb1ac exitType=BRANCH shape = ld map #0x8756 guard_kshape = eq shape, #0x8756 xf1512: xf guard_kshape -> pc=0x160ded9 imacpc=0x0 sp+8 rp+0 This looks like a guard from a property access. If a zillion shapes go through, then we would try to grow a lot of branch exits.
Yeah, that was my gut reaction too. Hence wondering about bug 497789 or branding or some such...
dvander, how does trace merging behave on this one?
The fix for bug 536564 seems to have helped quite a bit: with the latest tm nightly I get >30fps with black canvas and >26fps upon stirring up some green. Boris, can you regen some stats? Or perhaps someone can make a shell testcase. wish (along with Oliver) we had TraceVis built in... /be
(In reply to comment #11) > Yeah, that was my gut reaction too. Hence wondering about bug 497789 No .prototype in either .js file, so 497789 is not relevant. > or branding or some such... Indeed bug 536564 was all about unbranding to despecialize module pattern and power constructor (if that's what it is called) pattern code. /be
I should probably give this bug to someone. Volunteers? /be
I don't see any impact from bug 536564's fix on this testcase in terms of fps. I see about 22-30fps (depending on amount of green) both before and after that fix here. I'll try to get some stats.
I'm still seeing the abort from comment 5 on t-m tip.
And in particular, it's not clear to me why bug 536564 would lead to less branding here, offhand...
(In reply to comment #18) > And in particular, it's not clear to me why bug 536564 would lead to less > branding here, offhand... brendan: bz: we see this.foo = function... where the function is not a null closure, we unbrand this brendan: first thing in the [Field] ctor Comment 5 does not demonstrate shape regeneration per new Field, and I confirmed via dis('-r', FluidField) that Field unbrands |this|. Need to see what guards are exiting trace. /be
OK. So if I look at the trace exits, the vast majority are one of: leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@124, op=call, lr=0x16a439c, exitType=BRANCH, sp=4, calldepth=0, cycles=0 leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@115, op=callprop, lr=0x1652224, exitType=BRANCH, sp=1, calldepth=0, cycles=0 leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@124, op=call, lr=0x16b3c04, exitType=BRANCH, sp=4, calldepth=0, cycles=0 leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@115, op=callprop, lr=0x169c4a4, exitType=BRANCH, sp=1, calldepth=0, cycles=0 The relevant side exits seem to be: About to try emitting guard code for SideExit=0x16a439c exitType=BRANCH ... callee_obj->getPrivate() = int 357565760 ld1 = ld $stack14 eq1 = eq ld1, callee_obj->getPrivate() xf1: xf eq1 -> pc=0x1637ae8 imacpc=0x0 sp+32 rp+0 (GuardID=001) About to try emitting guard code for SideExit=0x16a439c exitType=BRANCH OBJ_GET_PARENT(cx, callee_obj) = int 357591200 ld2 = ld $stack14 eq2 = eq ld2, OBJ_GET_PARENT(cx, callee_obj) xf2: xf eq2 -> pc=0x1637ae8 imacpc=0x0 sp+32 rp+0 (GuardID=002) About to try emitting guard code for SideExit=0x1652224 exitType=BRANCH shape = ld map 37285 = int 37285 guard_kshape = eq shape, 37285 xf3: xf guard_kshape -> pc=0x1637adf imacpc=0x0 sp+8 rp+0 (GuardID=003) 0x16b3c04 is the function-identity and callee parent guard again, and 0x169c4a4 is another kshape guard. This is all on t-m tip.
When I say "vast majority", btw, of 135k total exits, about 91k are on the callee_obj guard and about 37k are on the kshape guard.
OK. The callee guard that fails is the Call object identity guard. See TraceRecorder::guardCallee the second guard it puts in. If I comment out that code block I see a bit of a win on the testase (maybe from 30fps to 36fps) and BRANCH exits more or less disappear from the log (283 exits out of a total of 10500).
With that code block commented out, profile looks like: 30% painting 58% under js_Interpret, almost all of it on trace. None inside js_Interpret. 4% vm_fault stuff Of the time under js_Interpret, about half is actually jit-generated code. The rest is js_DoubleToInt32 (10% of overall time), js_Array_dense_setelem_double (10% of overall time, about half creating new gcobjects), 6% js_UnboxDouble. Everything else is minor. With the code block left in, however, time in js_Interpret is 5.5%. Time under js_Interpret is 62%. So dropping this guard and not exiting on it would give us something like 10% faster performance (which is about reflected in the "just slightly higher" fps). To win more than that we need to at least stop all that double-to-int-and-back thrashing. And speed up the painting part (need separate bugs).
Created attachment 423311 [details] [diff] [review] patch to test This improves the all-or-nothing nature of flat closures by making an escaping function (funarg) flat if any of its upvars is invariant with initialization that dominates the closure. Any upvars not flattened must be accessed via Call objects but this wins (big, I think) over bailing to the all-upvars-need-Call-homes state. I will clean up and incorporate any testing feedback before soliciting review. /be
That patch may help a bit. Hard to tell; would need to profile; not a high priority right this second. But it should definitely go into a bug that blocks this one, not in this tracking bug.
(In reply to comment #26) > That patch may help a bit. With the parent guard elimination that depends on the changes in this patch, it should help a lot. But I need to add the elimination logic ;-)). > ... it should definitely go into a bug that blocks > this one, not in this tracking bug. Will file, thanks for reminding me. /be
Comment on attachment 423311 [details] [diff] [review] patch to test See bug 542002. /be
dmandelin got 48fps today (on tm tip, I think -- JaegerMonkey landed there earlier today). Want to get up to high 50s like the competition. ;-) /be
Over here, after the merge, I see 38fps on TM, and about 49-50 in Chrome and Safari. Opera latest is at 54. Doing a profile now.
Sorry that took a bit; shark seems to need the frame pointer to produce useful calltree output, so I had to recompile first. Looks like we're at about (on Mac; on Windows/Linux this might all be different because painting is so much of the profile): 55% painting (argb32_mark and argb32_image for the most part; mostly painting the canvas layer, but some painting of some thebes layer too, and a few percent in display list construction and such). 5% Gecko event processing (I was mousing about and whatnot) 40% JS engine. The JS engine time splits up as follows (all percentages that follow are percentages of JS engine time): 86.6% in jm-generated symbol-less code. 5.2% under stubs::BitOr (half self time, half calling ValueToECMAInt32Slow) 2% under PutImageData 1.2% under stubs::SetElem (looks like for the image data) Oh, all the above was without supervisor callstacks. If I put those back I get the usual 3.5% of vm_fault and whatnot. Upshot: The only way to make this faster is to generate something different codewise on the JM end and/or make the painting faster. Unfortunately, shark is not all that much use for the jit-generated stuff; it seems to have one entry per instruction or something, for the most part... The most common ones are (all self times, obviously): 1.7% add byte ptr [eax], al 1.5% jmp 0x4c86e883 1.5% jmp 0x1886e987 1.4% jmp 0x7486e22f 1.4% jmp 0xb886ea4d 1.3% a 9.5KB chunk of code. Looks like your typical code; lots of cmp, jmp, mov. 1.2% sldt dword ptr [eax] 1.1% sldt dword ptr [eax] 1.0% jmp 0xb086e4fa 0.7% mov dword ptr [eax], eax Anyway, the upshot is that we have jmps, we presumably have context switches (the sldt), and that's about all I can tel from this tool.
I suspect the bitor is '<expr> | 0' to convert a double to int, in which case simply exiting to the slowcase is probably very expensive relative to simply doing cvttsd2si (and associated checks)
(In reply to comment #32) > I suspect the bitor is '<expr> | 0' to convert a double to int, in which case > simply exiting to the slowcase is probably very expensive relative to simply > doing cvttsd2si (and associated checks) Thanks, I will file a bug.
The patch in bug 592631 gets us to 42fps or so on my hardware (depending on how much I move the mouse; can go as low as 37 and as high as 44).
Created attachment 471373 [details] Shell test case Here's a shell version. The timing loop is all the way at the bottom if you want to tweak it.
Results from the shell benchmark, time taken to render each frame (in a run of 1000 frames), all 32-bit builds on OSX10.6: TM tip 6.8 ms TM tip + bug 592631 6.5 ms JSC 7.0 ms
One problem here is that x/y in JM and JSC does not convert the result to int32 if it's 0. V8 does division differently (they try idiv first), so they end up with ints everywhere which is obviously faster. I'm working on a patch to speedup JSOP_DIV by making our double-to-int logic work when the result is 0.
The problem is that there are two different 0s, right? So you have to be a little careful with that.... In any case, dmandelin's numbers suggest that painting is the place we really need to improve here at this point.
(In reply to comment #39) > The problem is that there are two different 0s, right? So you have to be a > little careful with that.... Yes.. > In any case, dmandelin's numbers suggest that painting is the place we really > need to improve here at this point. Sure. I don't know how much this jsop_div thing helps. dmandelin's numbers show that SM is fast enough here. Maybe my patch is not worth it and I will just drop it then.
Jan: don't give up on account of just one benchmark! What bug are you using for your idiv work? /be
Yeah, I wouldn't give up on idiv yet, either. Although the problem does seem to be more on the gfx side, your x|0 patch makes a very noticeable difference here. Any JS wins we can find are worthwhile.
Right; I wasn't trying to discourage you. The 0 thing has come up on other testcases too, so if we can fix it, great. I just don't expect it to give us 10fps here. ;)
On my Mac, we spend 78% of the time in and under JS_CallFunctionValue on trunk. I guess that's just lack of JM. We spend 16% of the time (i.e. most of the rest) in BasicCanvasLayer::Paint just copying the canvas surface to the window, upscaled by by 4. Now the interesting thing is that BasicCanvasLayer::Paint uses EXTEND_PAD --- which is correct, but for cairo-quartz this always means taking a path involving CGPatterns which is quite likely to be ... not as fast as it could be.
> I guess that's just lack of JM. Yeah. Is there a way to deal with the CGPattern/EXTEND_PAD thing?
Yes. By treating it as EXTEND_NONE (which is completely reasonable, because as it turns out the current code is treating it as EXTEND_REPEAT, which is much more wrong), things get a little bit better. From 16% of my profile I get down to 13.5%. That's taken off about 15% of the paint time, actually more since I get more FPS now. Not great, but worth having. Filed bug 593270. It's not clear that we can do much better than that using Quartz, though. Now we're just doing a bog-standard CGContextDrawImage, which does scaling, and there's no caching available since the image is different every time. The good news is that we have some bigger guns in our graphics arsenal now. With D2D, D3D9 or GL enabled, the canvas compositing happens on the GPU. Indeed, in my Mac build if I set MOZ_ACCELERATED=11, painting goes from 13.5% to 2% of the profile. So I would really like to know how we perform in a JM build with MOZ_ACCELERATED=11 and/or D2D enabled!
(In reply to comment #46) > The good news is that we have some bigger guns in our graphics arsenal now. > With D2D, D3D9 or GL enabled, the canvas compositing happens on the GPU. > Indeed, in my Mac build if I set MOZ_ACCELERATED=11, painting goes from 13.5% > to 2% of the profile. Actually that was overselling the benefit, because on trunk GL-layers, the main thread ends up blocked waiting for the GPU for a while when we call glFlush at the end of each paint. In the profile I did, Shark hadn't sampled the main thread while it was blocked. Fixing that shows that GL layers isn't really making any difference to this benchmarks on Mac. We shouldn't actually have to call glFlush and block though. If I remove it, performance is great, although unfortunately nothing actually renders on the screen. I filed bug 593342 to look into that.
> So I would really like to know how we perform in a JM build with > MOZ_ACCELERATED=11 and/or D2D enabled! JM+TM + bug 592631, no moz_accel: 42fps, going down to 39 if I mouse JM+TM + bug 592631, moz_accel: 55fps, going down to 53 if I mouse This is without doing anything interesting with glFlush... ;) "Let's ship it".
Note, btw, that CPU usage is about 60% in the latter test, and 97% in the former. So we're definitely blocking on the flush; it's just _still_ faster than doing the composite ourselves. At least on my hardware. Just grinding out the numbers, without accel we take 24ms per frame, of which about half is painting. With accel we take 18ms per frame. So accel cut the painting time in half. If not having to glFlush will further cut it by a factor of 3 (good approximation per comment 46), then we'll be at 14ms per frame or 71fps, with no changes to js engine. At that point the idiv fix is likely to give a measurable improvement (since even a 2% change will be several fps).
(In reply to comment #43) > Right; I wasn't trying to discourage you. No problem. > The 0 thing has come up on other > testcases too, so if we can fix it, great. Problem is that the most recent version no longer does division; it multiplies instead.. Do you remember one of these other benchmarks where this matters? Fortunately, my current WIP-patch is very small and hackish. I can polish it when I know it helps somewhere.
> Do you remember one of these other benchmarks where this matters? Sorry, not offhand.... :(
Jeff: this bug is meta, although against JS Engine (for want of a "fluidsim" or _sui generis_ component). Please file a painting perf bug and make it block this one. Thanks, /be
We don't need a new bug, I filed a bug for the only clear painting issue here.
(In reply to comment #53) > We don't need a new bug, I filed a bug for the only clear painting issue here. Ok, then [painting-perf] goes there, not here. /be
Is this still being worked on? I've been checking back on the nighly builds since post @http://blog.mozilla.com/dmandelin/2010/09/08/presenting-jagermonkey/ The primary demo works great (~45-47FPS), but when the alternate drawing mode is used, frame rate takes a dive (~4-15FPS). P.S. MOZ_ACCELERATED=11 doesn't work here yet (Linux)
I don't know that anyone's profiled the alternate drawing mode yet. On Linux, if your drivers are good canvas is already accelerated via XRender. OF course most drivers on Linux are not good....
On the other hand, I dunno whether Render accelerates the canvas upscale, which is the bottleneck here. It might not.
For what it's worth, on a Mac I see the primary demo at 65fps and the alternate drawing mode at 55fps in a current trunk build. Chrome 9 dev on the same hardware is 55fps and 45fps respectively, for comparison. So I'd guess your Linux issues are all graphics.
Just retested Fx (Dec 10 build, from http://nightly.mozilla.org/js-preview.html): * primary mode: 44-47 FPS * alternate mode: 3-15 FPS You're right, it finally works with MOZ_ACCELERATED=11! Unfortunately, it works slower :( * primary mode: 41-44 FPS * alternate mode: 2-10 FPS Chromium 8.0.560.0 (from http://repos.fedorapeople.org/repos/spot/chromium/fedora-14/): * primary mode: 35 FPS (no variation while moving the mouse) * alternate mode: 15-24 FPS (animation appears smooth, even at 15) I also tried Chromium with --disable-accelerated-compositing (it's apparently enabled by default), but no major variation in speed
So, should I file another bug? Or is it a lost cause?
Probably worth filing one on the Linux graphics issue, yes. Make sure to include information about your Xorg in the bug report!
Added bug 620065
This does very well with TI. 50+ fps in the browser with default resolution 64, trunk is somewhere between 40-45. With resolution 128 TI+JM gets 35+ fps, slightly faster than Chrome, trunk is at 25 fps. For the shell test case: -m -n : 366 ms d8 : 405 ms -m -j -p : 512 ms -m : 721 ms -j : 950 ms
Are we fast yet? I'm going to throw this back into the pool. Jan, anyone: feel free to take and drive if there are short-term wins. TI is trying to land, so mark FIXED when you can -- no point keeping an unfixable bug around. /be
(In reply to Brendan Eich [:brendan] from comment #64) > TI is trying to land, so mark FIXED > when you can -- no point keeping an unfixable bug around. Just tested this on m-c: 60+ fps and a bit faster than Chrome. Nothing to do here.
But it's still only 6fps at "Resolution: 512". Not sure it's worth worrying about, but if we can get _that_ fast, that would rock. ;)
Plus ours looks better than Chrome's because we use bilinear scaling to scale up the canvas.