Open Bug 612930 Opened 14 years ago Updated 3 months ago

HTML5 Sudoku slower than Opera and Chrome

Categories

(Core :: JavaScript Engine, defect, P3)

defect

Tracking

()

People

(Reporter: jandem, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: perf, Whiteboard: [jsperf] ietestdrive)

Attachments

(3 files)

Attached file Shell test case
I get these numbers on the attached shell test case: v8: 1.321s -m: 4.356s -j: 4.805s -j -m: 4.306s -j -m -p: 4.322s Browser test case in Opera/Safari: Opera 11: 1.743s Safari 5: 2.014s For -m, shark shows 12.5% under SetElem, 11.6% under GetElem, and 6-8% for array_getProperty, js::StrictlyEqual and js_NumberToStringWithBase
Whiteboard: [jsperf]
This could be one of those places where we need dense arrays with properties.
I looked at the pics output. A big problem here is that there are integers-stored-as-doubles, which hit slow paths in the GetElem/SetElem ICs. If I add "|0" after board[..][..] on line 153, 156, 157 and 163, time drops to 2.2 seconds with -m, while v8 improves to 1.270s. Wasn't there a bug about converting to int32 in SetElem? Or should we add a double path to the ICs? I will investigate where these doubles come from.
For what it's worth, on Windows I see a current m-c browser build get about 5.2s (-m -j -p there, obviously) while IE9pp7 gets 1.2s. I did a profile of the shell -m testcase (64-bit, mac). Summary: 46% in mjit-generated code 27% under stubs::GetElem (7% self time, 11% calling js_ValueToString, 7% array_getProperty self time, 3% js_AtomizeString) 10% under stubs::SetElem (all self time) 6% under stubs::StrictEq (self time and js::StrictlyEqual; it looks like they use === and !== with abandon in this code). 4% json_parse+json_stringify (round-tripping the board or something) 2% stubs::NewArray 2% array_slice
Whiteboard: [jsperf] → [jsperf] ietestdrive
OK, so the shell testcase on tm tip runs in about 3.0s on my mac. If I apply the patch from bug 604905, this drops to 2.6s (presumably due to nixing all that double-to-string crap). v8 shell is at 1.1s on the same hardware, fwiw.
Depends on: 604905
And with that patch applied, we have: 55% mjit-generated 16% stubs::GetElem (10% self time, 6% array_getProperty) 12% stubs::SetElem (almost all self time) Other stuff is about as before.
I see no calls to makeDenseArraySlow() in this benchmark, other than the single call for Array.prototype.
Depends on: 590161
(In reply to comment #2) > I will investigate where these doubles come from. OK, so it clones the board using this line of code: var board = JSON.parse(JSON.stringify(inputBoard)); If I add board[i][j] |= 0; to the loop right after that, I get 1.6s, 1.324 in V8. So JSON.parse should really try to convert numbers to int32, I will file a follow-up bug.
As far as tracing this goes, we end up with a bunch of typemap mismatches for reasons I'm not smart enough to comprehend (I vs U, I vs D, etc), end up blacklisting a bunch of stuff, and then have order of 1e6 trace enters/exits during the run.....
Depends on: 612989
With the patch for bug 612989 applied we're still 300 ms slower than V8. solveSudoku is called 5718 times. For this micro-benchmark: ---- function f() { var t0 = new Date; for(var x=0; x<5718; x++) { var possibilities = [[], [], [], [], [], [], [], [], []]; for(var i = 0; i < 9; i++) { for(var j = 0; j < 9; j++) { possibilities[i][j] = [false, true, true, true, true, true, true, true, true, true]; } } } print(new Date - t0); } f(); ---- We get these numbers: -m -j -p: 154 ms v8: 18 ms 136 ms, almost half of the remaining slowdown vs V8; will measure again after bug 606477 lands.
Depends on: 606477
The rest of this benchmark is mostly about accessing arrays. Attached is an example which shows the kind of stuff the solver does over and over again. -m: 378 ms v8: 270 ms Can we make this as fast as V8? Perf will be close to Chrome (and maybe IE9) with this fixed.
> Can we make this as fast as V8? For what it's worth, that testcase is faster under -j than in v8 (170ms vs 200ms here; -m is at 320ms, as is -j -m -p)...
For the comment 10 benchmark, when running under -m all out time (well, 99.8% of it) is spent in mjit-generated code. So you'd need to stare at the instruction dump to see what can be improved; I just tried JMFLAGS=insns, but there are all sorts of jb calls whose targets aren't clear in the instruction stream, so the control flow is ... hard to follow. I bet TM does some CSE here, and possibly even some DCE, though maybe not.
(In reply to comment #12) > > I bet TM does some CSE here, and possibly even some DCE, though maybe not. for (var k=0; k<9; k++) { board[k][row]; board[k][row]; board[col][k]; } Lots of CSE: there are 6 GETELEMs, the middle two are completely removed because they are identical to the first two. Plus we only do the "is board an Array" test once. No high-level DCE, though, just the usual low-level Nanojit stuff. If you'd like me to write a fragile and unsound high-level DCE transformation, please file a follow-up bug ;) More seriously, should the tracer be used more for this code than it is?
Nicholas, see comment 8. :( If we can get those typemaps to match so we don't abort+blacklist everything, I bet the tracer would kick ass here.
(In reply to comment #14) > Nicholas, see comment 8. :( If we can get those typemaps to match so we don't > abort+blacklist everything, I bet the tracer would kick ass here. Btw, this test case uses arrays with booleans and then adds numbers to it. Could that be confusing the tracer? Might be an interesting test case for type inference too..
(In reply to comment #15) > Btw, this test case uses arrays with booleans and then adds numbers to it. > Could that be confusing the tracer? Might be an interesting test case for type > inference too.. Where do the booleans come into play? The test case in comment 10 gives me these times: v8: 282 -m: 322 -j: 205 -m (inference): 159 Since board and its component arrays are all packed (bug 604045) JM accesses them at a speed similar to Java --- main difference is that we still need regalloc (bug 609899) to keep the locals in registers around the loop.
(In reply to comment #16) > Where do the booleans come into play? I meant the shell version of the benchmark in comment 0 (sorry, shouldn't have used "test case" here). The "possibilities" array has arrays of booleans, then it assigns numbers to index 10 and 11: var currentPos = possibilities[i][j]; ... zoneRow = getZone(i) * 3; zoneCol = getZone(j) * 3; currentPos[10] = zoneRow; currentPos[11] = zoneCol; > v8: 282 > -m: 322 > -j: 205 > -m (inference): 159 > Impressive! The sudoku solver accesses board (and other arrays) a lot, if we can make that as fast as this I guess we will beat Chrome and IE9.
(In reply to comment #17) > (In reply to comment #16) > > Where do the booleans come into play? > > I meant the shell version of the benchmark in comment 0 (sorry, shouldn't have > used "test case" here). The "possibilities" array has arrays of booleans, then > it assigns numbers to index 10 and 11: > > var currentPos = possibilities[i][j]; > ... > zoneRow = getZone(i) * 3; > zoneCol = getZone(j) * 3; > currentPos[10] = zoneRow; > currentPos[11] = zoneCol; Ah, yeah, that will confuse the inference. It will know all these arrays are packed, but will not know whether the values read out of currentPos are ints or bools as it does not distinguish between different elements of an array. When zoneRow and zoneCol are subsequently read out of currentPos (zoneRow = currentPos[10] etc.) they will be treated as int|bool, along with everywhere they are passed. This could be fixed by sticking type checks on these assignments or on calls which pass zoneRow/zoneCol around, to limit the effects of the imprecision. Earlier versions of bug 557407 did this, but I don't think this is critical for this testcase as zoneRow/zoneCol don't look to be passed to very many places. (more important for the inference is better handling for JSON.parse, as it will not currently know anything about the contents of board).
I modified the shell test case to clone the board explicitly, rather than use JSON, so the inference can understand it (should not be too hard to get the inference to understand JSON.parse natively, or even see through JSON.parse(JSON.stringify(blah)) calls). I get these times: v8: 1.05s -m: 1.31s -m (inference): 1.02s I then modified the test case (attached) to reuse the ~470k arrays in possibilities and the ~250k arrays created by slice() in checkFreedoms. This decouples GC perf, to measure our time without considering the overhead of doing collections, allocating lots of objects or taking faults from having a huge heap (the working set here is small). I get these times: v8: 1.07s -m: 1.12s -m (inference): .78s With type inference and a generational GC this benchmark will look really good.
Keywords: perf
Added this testcase to the assorted page on AWFY (bug 649487, needs more tests!). Right now we're slower than V8, I think because we no longer track property types for objects produced by JSON (did not update after Waldo's JSON parser rewrite, should be easy to fix).
Just FYI, here you are my test results with this IE Test Drive demo. My PC specs ==> http://pastebin.com/6j8as3XE I run HTML5 Sudoku (Level: EASY - # of games: 5000) 5 times: 1.623s 1.569s 1.511s 1.453s 1.691s
(In reply to RNicoletto from comment #21) > Just FYI, here you are my test results with this IE Test Drive demo. > My PC specs ==> http://pastebin.com/6j8as3XE > > I run HTML5 Sudoku (Level: EASY - # of games: 5000) 5 times: > > 1.623s > 1.569s > 1.511s > 1.453s > 1.691s If you are going to post results, you need to also include the results from at least one of the other browsers to make them meaningful. Chrome would probably be the most useful comparison but IE and Opera would also be interesting to see. Remember to include their versions as well.
(In reply to Daniel Cater from comment #22) > If you are going to post results, you need to also include the results from > at least one of the other browsers to make them meaningful. UHMMM... For some reason my comment has been truncated. OK, I'll re-post it completely. Mozilla Firefox latest nightly + "layers.acceleration.force-enabled" = TRUE 1.364ms 1.276ms 1.286ms 1.335ms 1.320ms Google Chrome Canary + "Override software rendering list" = ENABLED 1.165s 1.061s 1.110s 1.048s 1.297s Opera Browser 11.61 2.091s 2.034s 2.060s 2.171s 2.143s
Win7 Nightly 27 - 0.91s Chrome 29 - 0.82s IE 10 - 1.07s
Assignee: general → nobody
Severity: normal → S3

"Minimal array testcase"
Nightly: 35ms
Chrome: 18ms

"Minimal array testcase" 100x loopcount
Nightly: 3300ms (https://share.firefox.dev/3AmOxWg)
Chrome: 1700ms

We are ~2x slow here.
Is this something to be looked into further?

Flags: needinfo?(iireland)

The fact that this testcase doesn't actually do anything with the result of loading is kind of weird. A really smart compiler could optimize this whole thing away.

This all looks vaguely similar to other array-related things I've looked at lately: we're spending time on bounds checks and unboxing, with no obviously bad code, and so is V8. Somewhere in all that, V8 is ending up a bit faster. There are enough similar-but-more-relevant benchmarks out there with the same problem to make me think that looking at this one specifically isn't a high priority.

Flags: needinfo?(iireland)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: