Closed
Bug 508716
Opened 15 years ago
Closed 13 years ago
Fluid dynamics simulator is slow
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: bzbarsky, Unassigned)
References
()
Details
(Keywords: meta)
Attachments
(4 files, 1 obsolete file)
I get 12-13fps in t-m; Safari 4 gets closer to 40fps.
Stats:
recorder: started(116), aborted(104), completed(15), different header(1), trees trashed(0), slot promoted(0), unstable loop variable(0), breaks(0), returns(0), unstableInnerCalls(4), blacklisted(26)
monitor: triggered(27), exits(27), type mismatch(0), global mismatch(4)
Typical abort:
abort: 6918: fp->scopeChain is not global or active call object
Abort recording of tree http://nerget.com/fluidSim/pressure.js:290@137 at http://nerget.com/fluidSim/pressure.js:291@138: name.
Reporter | ||
Comment 1•15 years ago
|
||
Reporter | ||
Comment 2•15 years ago
|
||
Reporter | ||
Comment 3•15 years ago
|
||
Reporter | ||
Comment 4•15 years ago
|
||
Oh, and a profile shows 55% of the time spent in (not under) js_Interpret. Another 11% allocating doubles, another 15% on js_NativeGet, esp with Call objects.
That adds up to 80% or so. 17% is painting, and 3% is sort of here and there (2/3 of that is minor stuff under js_Interpret, the remaining 1/3 is event loop overhead, I think).
In any case, this is definitely in the right component, at least until we get this tracing...
Summary: Fluid dynamics simulators is slow → Fluid dynamics simulator is slow
Reporter | ||
Comment 5•15 years ago
|
||
I also see a lot of:
Abort recording of tree https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392847:47@96 at https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392847:49@109: Inner tree is trying to grow, abort outer recording.
Line 49 is:
var d = field.getDensity(x, y) * 255 / 5;
where |field| comes from this:
displayFunc(new Field(dens, u, v));
and Field has a getDensity function. I have no idea why that's causing us trace aborts; perhaps something akin to bug 497789?
Depends on: 497789
Reporter | ||
Comment 6•15 years ago
|
||
Or some sort of branding thing going on, perhaps? Brendan? The relevant guard seems to be :
About to try emitting guard code for SideExit=0x17bb1ac exitType=BRANCH
shape = ld map[32]
#0x8756
guard_kshape = eq shape, #0x8756
xf1512: xf guard_kshape -> pc=0x160ded9 imacpc=0x0 sp+8 rp+0
The other branch exit I see some of is:
Abort recording of tree https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392846:62@120 at https://bug508716.bugzilla.mozilla.org/attachment.cgi?id=392846:64@138: Inner tree is trying to grow, abort outer recording.
Line 64 is:
var lastRow = (j - 1) * rowSize;
Relevant guard:
About to try emitting guard code for SideExit=0x17478e0 exitType=OVERFLOW
eq134 = eq mul19, callDepth
xt165: xt eq134 -> pc=0x16156d6 imacpc=0x0 sp+16 rp+0
Comment 7•15 years ago
|
||
The 2nd guard looks like upvar access. Paging dmandelin.
Comment 8•15 years ago
|
||
'Inner tree is trying to grow, abort outer recording.' is usually not a problem. In the normal case, we back off from tracing the outer loop for 32 iterations, by which time the inner tree hopefully has stabilized. We should be able to get the outer loop after one or a few rounds of that.
The abort in comment 0 looks like the returned closures abort.
Reporter | ||
Comment 9•15 years ago
|
||
Yes, the data in comment 6 is with the returned closures patch applied (which makes the abort in comment 0 go away).
In this case, we don't stabilize as far as I can see; I'm seeing the trying to grow aborts all through the log, and it's certainly running long enough that the outer tree ends up blacklisted as a result of all the aborts.
Comment 10•15 years ago
|
||
Hmmm. One thing I thought of is that we might generate so much code that we end up OOMing the code buffer and throwing it all away, but it sounds like you are saying that's not happening. I wonder if the inner loop code is so branchy that it can't stabilize in the number of backoffs we give. Based on:
About to try emitting guard code for SideExit=0x17bb1ac exitType=BRANCH
shape = ld map[32]
#0x8756
guard_kshape = eq shape, #0x8756
xf1512: xf guard_kshape -> pc=0x160ded9 imacpc=0x0 sp+8 rp+0
This looks like a guard from a property access. If a zillion shapes go through, then we would try to grow a lot of branch exits.
Reporter | ||
Comment 11•15 years ago
|
||
Yeah, that was my gut reaction too. Hence wondering about bug 497789 or branding or some such...
Updated•15 years ago
|
Assignee: general → brendan
Reporter | ||
Comment 12•15 years ago
|
||
dvander, how does trace merging behave on this one?
Component: JavaScript Engine → jemalloc
Updated•15 years ago
|
Component: jemalloc → JavaScript Engine
Reporter | ||
Updated•15 years ago
|
Component: jemalloc → JavaScript Engine
Comment 13•15 years ago
|
||
The fix for bug 536564 seems to have helped quite a bit: with the latest tm nightly I get >30fps with black canvas and >26fps upon stirring up some green.
Boris, can you regen some stats? Or perhaps someone can make a shell testcase. wish (along with Oliver) we had TraceVis built in...
/be
Comment 14•15 years ago
|
||
(In reply to comment #11)
> Yeah, that was my gut reaction too. Hence wondering about bug 497789
No .prototype in either .js file, so 497789 is not relevant.
> or branding or some such...
Indeed bug 536564 was all about unbranding to despecialize module pattern and power constructor (if that's what it is called) pattern code.
/be
Updated•15 years ago
|
Comment 15•15 years ago
|
||
I should probably give this bug to someone. Volunteers?
/be
OS: Mac OS X → All
Hardware: x86 → All
Reporter | ||
Comment 16•15 years ago
|
||
I don't see any impact from bug 536564's fix on this testcase in terms of fps. I see about 22-30fps (depending on amount of green) both before and after that fix here. I'll try to get some stats.
Reporter | ||
Comment 17•15 years ago
|
||
I'm still seeing the abort from comment 5 on t-m tip.
Reporter | ||
Comment 18•15 years ago
|
||
And in particular, it's not clear to me why bug 536564 would lead to less branding here, offhand...
Comment 19•15 years ago
|
||
(In reply to comment #18)
> And in particular, it's not clear to me why bug 536564 would lead to less
> branding here, offhand...
brendan: bz: we see this.foo = function... where the function is not a null closure, we unbrand this
brendan: first thing in the [Field] ctor
Comment 5 does not demonstrate shape regeneration per new Field, and I confirmed via dis('-r', FluidField) that Field unbrands |this|. Need to see what guards are exiting trace.
/be
Reporter | ||
Comment 20•15 years ago
|
||
OK. So if I look at the trace exits, the vast majority are one of:
leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@124, op=call, lr=0x16a439c, exitType=BRANCH, sp=4, calldepth=0, cycles=0
leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@115, op=callprop, lr=0x1652224, exitType=BRANCH, sp=1, calldepth=0, cycles=0
leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@124, op=call, lr=0x16b3c04, exitType=BRANCH, sp=4, calldepth=0, cycles=0
leaving trace at http://nerget.com/fluidSim/pressure-display.js:49@115, op=callprop, lr=0x169c4a4, exitType=BRANCH, sp=1, calldepth=0, cycles=0
The relevant side exits seem to be:
About to try emitting guard code for SideExit=0x16a439c exitType=BRANCH
...
callee_obj->getPrivate() = int 357565760
ld1 = ld $stack14[16]
eq1 = eq ld1, callee_obj->getPrivate()
xf1: xf eq1 -> pc=0x1637ae8 imacpc=0x0 sp+32 rp+0 (GuardID=001)
About to try emitting guard code for SideExit=0x16a439c exitType=BRANCH
OBJ_GET_PARENT(cx, callee_obj) = int 357591200
ld2 = ld $stack14[12]
eq2 = eq ld2, OBJ_GET_PARENT(cx, callee_obj)
xf2: xf eq2 -> pc=0x1637ae8 imacpc=0x0 sp+32 rp+0 (GuardID=002)
About to try emitting guard code for SideExit=0x1652224 exitType=BRANCH
shape = ld map[4]
37285 = int 37285
guard_kshape = eq shape, 37285
xf3: xf guard_kshape -> pc=0x1637adf imacpc=0x0 sp+8 rp+0 (GuardID=003)
0x16b3c04 is the function-identity and callee parent guard again, and 0x169c4a4 is another kshape guard.
This is all on t-m tip.
Reporter | ||
Comment 21•15 years ago
|
||
When I say "vast majority", btw, of 135k total exits, about 91k are on the callee_obj guard and about 37k are on the kshape guard.
Reporter | ||
Comment 22•15 years ago
|
||
OK. The callee guard that fails is the Call object identity guard. See TraceRecorder::guardCallee the second guard it puts in.
If I comment out that code block I see a bit of a win on the testase (maybe from 30fps to 36fps) and BRANCH exits more or less disappear from the log (283 exits out of a total of 10500).
Reporter | ||
Comment 23•15 years ago
|
||
With that code block commented out, profile looks like:
30% painting
58% under js_Interpret, almost all of it on trace. None inside js_Interpret.
4% vm_fault stuff
Of the time under js_Interpret, about half is actually jit-generated code. The rest is js_DoubleToInt32 (10% of overall time), js_Array_dense_setelem_double (10% of overall time, about half creating new gcobjects), 6% js_UnboxDouble. Everything else is minor.
With the code block left in, however, time in js_Interpret is 5.5%. Time under js_Interpret is 62%. So dropping this guard and not exiting on it would give us something like 10% faster performance (which is about reflected in the "just slightly higher" fps).
To win more than that we need to at least stop all that double-to-int-and-back thrashing. And speed up the painting part (need separate bugs).
Comment 25•15 years ago
|
||
This improves the all-or-nothing nature of flat closures by making an escaping function (funarg) flat if any of its upvars is invariant with initialization that dominates the closure. Any upvars not flattened must be accessed via Call objects but this wins (big, I think) over bailing to the all-upvars-need-Call-homes state.
I will clean up and incorporate any testing feedback before soliciting review.
/be
Reporter | ||
Comment 26•15 years ago
|
||
That patch may help a bit. Hard to tell; would need to profile; not a high priority right this second. But it should definitely go into a bug that blocks this one, not in this tracking bug.
Comment 27•15 years ago
|
||
(In reply to comment #26)
> That patch may help a bit.
With the parent guard elimination that depends on the changes in this patch, it should help a lot. But I need to add the elimination logic ;-)).
> ... it should definitely go into a bug that blocks
> this one, not in this tracking bug.
Will file, thanks for reminding me.
/be
Comment 28•15 years ago
|
||
Attachment #423311 -
Attachment is obsolete: true
Comment 29•14 years ago
|
||
dmandelin got 48fps today (on tm tip, I think -- JaegerMonkey landed there earlier today). Want to get up to high 50s like the competition. ;-)
/be
Reporter | ||
Comment 30•14 years ago
|
||
Over here, after the merge, I see 38fps on TM, and about 49-50 in Chrome and Safari. Opera latest is at 54.
Doing a profile now.
Reporter | ||
Comment 31•14 years ago
|
||
Sorry that took a bit; shark seems to need the frame pointer to produce useful calltree output, so I had to recompile first.
Looks like we're at about (on Mac; on Windows/Linux this might all be different because painting is so much of the profile):
55% painting (argb32_mark and argb32_image for the most part; mostly painting
the canvas layer, but some painting of some thebes layer too, and a few
percent in display list construction and such).
5% Gecko event processing (I was mousing about and whatnot)
40% JS engine.
The JS engine time splits up as follows (all percentages that follow are percentages of JS engine time):
86.6% in jm-generated symbol-less code.
5.2% under stubs::BitOr (half self time, half calling ValueToECMAInt32Slow)
2% under PutImageData
1.2% under stubs::SetElem (looks like for the image data)
Oh, all the above was without supervisor callstacks. If I put those back I get the usual 3.5% of vm_fault and whatnot.
Upshot: The only way to make this faster is to generate something different codewise on the JM end and/or make the painting faster.
Unfortunately, shark is not all that much use for the jit-generated stuff; it seems to have one entry per instruction or something, for the most part... The most common ones are (all self times, obviously):
1.7% add byte ptr [eax], al
1.5% jmp 0x4c86e883
1.5% jmp 0x1886e987
1.4% jmp 0x7486e22f
1.4% jmp 0xb886ea4d
1.3% a 9.5KB chunk of code. Looks like your typical code; lots of cmp, jmp,
mov.
1.2% sldt dword ptr [eax]
1.1% sldt dword ptr [eax]
1.0% jmp 0xb086e4fa
0.7% mov dword ptr [eax], eax
Anyway, the upshot is that we have jmps, we presumably have context switches (the sldt), and that's about all I can tel from this tool.
Comment 32•14 years ago
|
||
I suspect the bitor is '<expr> | 0' to convert a double to int, in which case simply exiting to the slowcase is probably very expensive relative to simply doing cvttsd2si (and associated checks)
Comment 33•14 years ago
|
||
(In reply to comment #32)
> I suspect the bitor is '<expr> | 0' to convert a double to int, in which case
> simply exiting to the slowcase is probably very expensive relative to simply
> doing cvttsd2si (and associated checks)
Thanks, I will file a bug.
Reporter | ||
Comment 34•14 years ago
|
||
The patch in bug 592631 gets us to 42fps or so on my hardware (depending on how much I move the mouse; can go as low as 37 and as high as 44).
Comment 35•14 years ago
|
||
I Sharked Safari on this and got 47% of time in JavaScriptCore. Using your 50fps measurement, that corresponds to a JS run time of 9.4ms/frame.
Our "old" (before the x|0 patch) score of 38 fps corresponds with 40% of time in JS gives a JS run time of 10.5ms/frame.
I estimate our JS run time to be 8.8-9.2ms/frame after the x|0 patch, depending on whether I (a) use the fact that we spent 5.2% there before and compute 10.5ms/frame*(1-5.2/40) or (b) use the new fps of 42 with an estimate new JS fraction of 37%.
Anyway, at this point it seems more likely that we can improve our score by making the non-JS parts faster. But we should still make a shell benchmark and do a more direct comparison of the JS times to see if there we are taking any direct losses there.
Comment 36•14 years ago
|
||
Here's a shell version. The timing loop is all the way at the bottom if you want to tweak it.
Comment 37•14 years ago
|
||
Results from the shell benchmark, time taken to render each frame (in a run of 1000 frames), all 32-bit builds on OSX10.6:
TM tip 6.8 ms
TM tip + bug 592631 6.5 ms
JSC 7.0 ms
Comment 38•14 years ago
|
||
One problem here is that x/y in JM and JSC does not convert the result to int32 if it's 0. V8 does division differently (they try idiv first), so they end up with ints everywhere which is obviously faster. I'm working on a patch to speedup JSOP_DIV by making our double-to-int logic work when the result is 0.
Reporter | ||
Comment 39•14 years ago
|
||
The problem is that there are two different 0s, right? So you have to be a little careful with that....
In any case, dmandelin's numbers suggest that painting is the place we really need to improve here at this point.
Comment 40•14 years ago
|
||
(In reply to comment #39)
> The problem is that there are two different 0s, right? So you have to be a
> little careful with that....
Yes..
> In any case, dmandelin's numbers suggest that painting is the place we really
> need to improve here at this point.
Sure. I don't know how much this jsop_div thing helps. dmandelin's numbers show that SM is fast enough here. Maybe my patch is not worth it and I will just drop it then.
Comment 41•14 years ago
|
||
Jan: don't give up on account of just one benchmark! What bug are you using for your idiv work?
/be
Comment 42•14 years ago
|
||
Yeah, I wouldn't give up on idiv yet, either. Although the problem does seem to be more on the gfx side, your x|0 patch makes a very noticeable difference here. Any JS wins we can find are worthwhile.
Reporter | ||
Comment 43•14 years ago
|
||
Right; I wasn't trying to discourage you. The 0 thing has come up on other testcases too, so if we can fix it, great. I just don't expect it to give us 10fps here. ;)
On my Mac, we spend 78% of the time in and under JS_CallFunctionValue on trunk. I guess that's just lack of JM. We spend 16% of the time (i.e. most of the rest) in BasicCanvasLayer::Paint just copying the canvas surface to the window, upscaled by by 4.
Now the interesting thing is that BasicCanvasLayer::Paint uses EXTEND_PAD --- which is correct, but for cairo-quartz this always means taking a path involving CGPatterns which is quite likely to be ... not as fast as it could be.
Reporter | ||
Comment 45•14 years ago
|
||
> I guess that's just lack of JM.
Yeah.
Is there a way to deal with the CGPattern/EXTEND_PAD thing?
Yes. By treating it as EXTEND_NONE (which is completely reasonable, because as it turns out the current code is treating it as EXTEND_REPEAT, which is much more wrong), things get a little bit better. From 16% of my profile I get down to 13.5%. That's taken off about 15% of the paint time, actually more since I get more FPS now. Not great, but worth having. Filed bug 593270.
It's not clear that we can do much better than that using Quartz, though. Now we're just doing a bog-standard CGContextDrawImage, which does scaling, and there's no caching available since the image is different every time.
The good news is that we have some bigger guns in our graphics arsenal now. With D2D, D3D9 or GL enabled, the canvas compositing happens on the GPU. Indeed, in my Mac build if I set MOZ_ACCELERATED=11, painting goes from 13.5% to 2% of the profile.
So I would really like to know how we perform in a JM build with MOZ_ACCELERATED=11 and/or D2D enabled!
Depends on: 593270
Depends on: 593342
(In reply to comment #46)
> The good news is that we have some bigger guns in our graphics arsenal now.
> With D2D, D3D9 or GL enabled, the canvas compositing happens on the GPU.
> Indeed, in my Mac build if I set MOZ_ACCELERATED=11, painting goes from 13.5%
> to 2% of the profile.
Actually that was overselling the benefit, because on trunk GL-layers, the main thread ends up blocked waiting for the GPU for a while when we call glFlush at the end of each paint. In the profile I did, Shark hadn't sampled the main thread while it was blocked. Fixing that shows that GL layers isn't really making any difference to this benchmarks on Mac.
We shouldn't actually have to call glFlush and block though. If I remove it, performance is great, although unfortunately nothing actually renders on the screen. I filed bug 593342 to look into that.
Reporter | ||
Comment 48•14 years ago
|
||
> So I would really like to know how we perform in a JM build with
> MOZ_ACCELERATED=11 and/or D2D enabled!
JM+TM + bug 592631, no moz_accel: 42fps, going down to 39 if I mouse
JM+TM + bug 592631, moz_accel: 55fps, going down to 53 if I mouse
This is without doing anything interesting with glFlush... ;)
"Let's ship it".
Reporter | ||
Comment 49•14 years ago
|
||
Note, btw, that CPU usage is about 60% in the latter test, and 97% in the former. So we're definitely blocking on the flush; it's just _still_ faster than doing the composite ourselves. At least on my hardware.
Just grinding out the numbers, without accel we take 24ms per frame, of which about half is painting. With accel we take 18ms per frame. So accel cut the painting time in half. If not having to glFlush will further cut it by a factor of 3 (good approximation per comment 46), then we'll be at 14ms per frame or 71fps, with no changes to js engine. At that point the idiv fix is likely to give a measurable improvement (since even a 2% change will be several fps).
Comment 50•14 years ago
|
||
(In reply to comment #43)
> Right; I wasn't trying to discourage you.
No problem.
> The 0 thing has come up on other
> testcases too, so if we can fix it, great.
Problem is that the most recent version no longer does division; it multiplies instead.. Do you remember one of these other benchmarks where this matters?
Fortunately, my current WIP-patch is very small and hackish. I can polish it when I know it helps somewhere.
Reporter | ||
Comment 51•14 years ago
|
||
> Do you remember one of these other benchmarks where this matters?
Sorry, not offhand.... :(
Updated•14 years ago
|
Whiteboard: [painting-perf]
Comment 52•14 years ago
|
||
Jeff: this bug is meta, although against JS Engine (for want of a "fluidsim" or _sui generis_ component). Please file a painting perf bug and make it block this one. Thanks,
/be
Keywords: meta
We don't need a new bug, I filed a bug for the only clear painting issue here.
Comment 54•14 years ago
|
||
(In reply to comment #53)
> We don't need a new bug, I filed a bug for the only clear painting issue here.
Ok, then [painting-perf] goes there, not here.
/be
Whiteboard: [painting-perf]
Comment 55•14 years ago
|
||
Is this still being worked on? I've been checking back on the nighly builds since post @http://blog.mozilla.com/dmandelin/2010/09/08/presenting-jagermonkey/
The primary demo works great (~45-47FPS), but when the alternate drawing mode is used, frame rate takes a dive (~4-15FPS).
P.S. MOZ_ACCELERATED=11 doesn't work here yet (Linux)
Reporter | ||
Comment 56•14 years ago
|
||
I don't know that anyone's profiled the alternate drawing mode yet.
On Linux, if your drivers are good canvas is already accelerated via XRender. OF course most drivers on Linux are not good....
Reporter | ||
Comment 57•14 years ago
|
||
On the other hand, I dunno whether Render accelerates the canvas upscale, which is the bottleneck here. It might not.
Reporter | ||
Comment 58•14 years ago
|
||
For what it's worth, on a Mac I see the primary demo at 65fps and the alternate drawing mode at 55fps in a current trunk build. Chrome 9 dev on the same hardware is 55fps and 45fps respectively, for comparison.
So I'd guess your Linux issues are all graphics.
Comment 59•14 years ago
|
||
Just retested
Fx (Dec 10 build, from http://nightly.mozilla.org/js-preview.html):
* primary mode: 44-47 FPS
* alternate mode: 3-15 FPS
You're right, it finally works with MOZ_ACCELERATED=11! Unfortunately, it works slower :(
* primary mode: 41-44 FPS
* alternate mode: 2-10 FPS
Chromium 8.0.560.0 (from http://repos.fedorapeople.org/repos/spot/chromium/fedora-14/):
* primary mode: 35 FPS (no variation while moving the mouse)
* alternate mode: 15-24 FPS (animation appears smooth, even at 15)
I also tried Chromium with --disable-accelerated-compositing (it's apparently enabled by default), but no major variation in speed
Comment 60•14 years ago
|
||
So, should I file another bug? Or is it a lost cause?
Reporter | ||
Comment 61•14 years ago
|
||
Probably worth filing one on the Linux graphics issue, yes. Make sure to include information about your Xorg in the bug report!
Comment 62•14 years ago
|
||
Added bug 620065
Comment 63•14 years ago
|
||
This does very well with TI. 50+ fps in the browser with default resolution 64, trunk is somewhere between 40-45. With resolution 128 TI+JM gets 35+ fps, slightly faster than Chrome, trunk is at 25 fps.
For the shell test case:
-m -n : 366 ms
d8 : 405 ms
-m -j -p : 512 ms
-m : 721 ms
-j : 950 ms
Comment 64•13 years ago
|
||
Are we fast yet?
I'm going to throw this back into the pool. Jan, anyone: feel free to take and drive if there are short-term wins. TI is trying to land, so mark FIXED when you can -- no point keeping an unfixable bug around.
/be
Assignee: brendan → general
Comment 65•13 years ago
|
||
(In reply to Brendan Eich [:brendan] from comment #64)
> TI is trying to land, so mark FIXED
> when you can -- no point keeping an unfixable bug around.
Just tested this on m-c: 60+ fps and a bit faster than Chrome. Nothing to do here.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 66•13 years ago
|
||
But it's still only 6fps at "Resolution: 512".
Not sure it's worth worrying about, but if we can get _that_ fast, that would rock. ;)
Plus ours looks better than Chrome's because we use bilinear scaling to scale up the canvas.
You need to log in
before you can comment on or make changes to this bug.
Description
•