612930 - HTML5 Sudoku slower than Opera and Chrome

Reporter

Description

•

14 years ago

Attached file Shell test case — Details

I get these numbers on the attached shell test case: v8: 1.321s -m: 4.356s -j: 4.805s -j -m: 4.306s -j -m -p: 4.322s Browser test case in Opera/Safari: Opera 11: 1.743s Safari 5: 2.014s For -m, shark shows 12.5% under SetElem, 11.6% under GetElem, and 6-8% for array_getProperty, js::StrictlyEqual and js_NumberToStringWithBase

Jan de Mooij [:jandem]

Reporter

Updated

•

14 years ago

Whiteboard: [jsperf]

David Mandelin [:dmandelin]

Comment 1

•

14 years ago

This could be one of those places where we need dense arrays with properties.

Jan de Mooij [:jandem]

Reporter

Comment 2

•

14 years ago

I looked at the pics output. A big problem here is that there are integers-stored-as-doubles, which hit slow paths in the GetElem/SetElem ICs. If I add "|0" after board[..][..] on line 153, 156, 157 and 163, time drops to 2.2 seconds with -m, while v8 improves to 1.270s. Wasn't there a bug about converting to int32 in SetElem? Or should we add a double path to the ICs? I will investigate where these doubles come from.

Boris Zbarsky [:bzbarsky]

Comment 3

•

14 years ago

For what it's worth, on Windows I see a current m-c browser build get about 5.2s (-m -j -p there, obviously) while IE9pp7 gets 1.2s. I did a profile of the shell -m testcase (64-bit, mac). Summary: 46% in mjit-generated code 27% under stubs::GetElem (7% self time, 11% calling js_ValueToString, 7% array_getProperty self time, 3% js_AtomizeString) 10% under stubs::SetElem (all self time) 6% under stubs::StrictEq (self time and js::StrictlyEqual; it looks like they use === and !== with abandon in this code). 4% json_parse+json_stringify (round-tripping the board or something) 2% stubs::NewArray 2% array_slice

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Whiteboard: [jsperf] → [jsperf] ietestdrive

Boris Zbarsky [:bzbarsky]

Comment 4

•

14 years ago

OK, so the shell testcase on tm tip runs in about 3.0s on my mac. If I apply the patch from bug 604905, this drops to 2.6s (presumably due to nixing all that double-to-string crap). v8 shell is at 1.1s on the same hardware, fwiw.

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Depends on: 604905

Boris Zbarsky [:bzbarsky]

Comment 5

•

14 years ago

And with that patch applied, we have: 55% mjit-generated 16% stubs::GetElem (10% self time, 6% array_getProperty) 12% stubs::SetElem (almost all self time) Other stuff is about as before.

Boris Zbarsky [:bzbarsky]

Comment 6

•

14 years ago

I see no calls to makeDenseArraySlow() in this benchmark, other than the single call for Array.prototype.

Jan de Mooij [:jandem]

Reporter

Updated

•

14 years ago

Depends on: 590161

Jan de Mooij [:jandem]

Reporter

Comment 7

•

14 years ago

(In reply to comment #2) > I will investigate where these doubles come from. OK, so it clones the board using this line of code: var board = JSON.parse(JSON.stringify(inputBoard)); If I add board[i][j] |= 0; to the loop right after that, I get 1.6s, 1.324 in V8. So JSON.parse should really try to convert numbers to int32, I will file a follow-up bug.

Boris Zbarsky [:bzbarsky]

Comment 8

•

14 years ago

As far as tracing this goes, we end up with a bunch of typemap mismatches for reasons I'm not smart enough to comprehend (I vs U, I vs D, etc), end up blacklisting a bunch of stuff, and then have order of 1e6 trace enters/exits during the run.....

Jan de Mooij [:jandem]

Reporter

Updated

•

14 years ago

Depends on: 612989

Jan de Mooij [:jandem]

Reporter

Comment 9

•

14 years ago

With the patch for bug 612989 applied we're still 300 ms slower than V8. solveSudoku is called 5718 times. For this micro-benchmark: ---- function f() { var t0 = new Date; for(var x=0; x<5718; x++) { var possibilities = [[], [], [], [], [], [], [], [], []]; for(var i = 0; i < 9; i++) { for(var j = 0; j < 9; j++) { possibilities[i][j] = [false, true, true, true, true, true, true, true, true, true]; } } } print(new Date - t0); } f(); ---- We get these numbers: -m -j -p: 154 ms v8: 18 ms 136 ms, almost half of the remaining slowdown vs V8; will measure again after bug 606477 lands.

Depends on: 606477

Jan de Mooij [:jandem]

Reporter

Comment 10

•

14 years ago

Attached file Minimal array test case — Details

The rest of this benchmark is mostly about accessing arrays. Attached is an example which shows the kind of stuff the solver does over and over again. -m: 378 ms v8: 270 ms Can we make this as fast as V8? Perf will be close to Chrome (and maybe IE9) with this fixed.

Boris Zbarsky [:bzbarsky]

Comment 11

•

14 years ago

> Can we make this as fast as V8? For what it's worth, that testcase is faster under -j than in v8 (170ms vs 200ms here; -m is at 320ms, as is -j -m -p)...

Boris Zbarsky [:bzbarsky]

Comment 12

•

14 years ago

For the comment 10 benchmark, when running under -m all out time (well, 99.8% of it) is spent in mjit-generated code. So you'd need to stare at the instruction dump to see what can be improved; I just tried JMFLAGS=insns, but there are all sorts of jb calls whose targets aren't clear in the instruction stream, so the control flow is ... hard to follow. I bet TM does some CSE here, and possibly even some DCE, though maybe not.

Nicholas Nethercote [inactive]

Comment 13

•

14 years ago

(In reply to comment #12) > > I bet TM does some CSE here, and possibly even some DCE, though maybe not. for (var k=0; k<9; k++) { board[k][row]; board[k][row]; board[col][k]; } Lots of CSE: there are 6 GETELEMs, the middle two are completely removed because they are identical to the first two. Plus we only do the "is board an Array" test once. No high-level DCE, though, just the usual low-level Nanojit stuff. If you'd like me to write a fragile and unsound high-level DCE transformation, please file a follow-up bug ;) More seriously, should the tracer be used more for this code than it is?

Boris Zbarsky [:bzbarsky]

Comment 14

•

14 years ago

Nicholas, see comment 8. :( If we can get those typemaps to match so we don't abort+blacklist everything, I bet the tracer would kick ass here.

Jan de Mooij [:jandem]

Reporter

Comment 15

•

14 years ago

(In reply to comment #14) > Nicholas, see comment 8. :( If we can get those typemaps to match so we don't > abort+blacklist everything, I bet the tracer would kick ass here. Btw, this test case uses arrays with booleans and then adds numbers to it. Could that be confusing the tracer? Might be an interesting test case for type inference too..

Brian Hackett [Laid off!]

Comment 16

•

14 years ago

(In reply to comment #15) > Btw, this test case uses arrays with booleans and then adds numbers to it. > Could that be confusing the tracer? Might be an interesting test case for type > inference too.. Where do the booleans come into play? The test case in comment 10 gives me these times: v8: 282 -m: 322 -j: 205 -m (inference): 159 Since board and its component arrays are all packed (bug 604045) JM accesses them at a speed similar to Java --- main difference is that we still need regalloc (bug 609899) to keep the locals in registers around the loop.

Jan de Mooij [:jandem]

Reporter

Comment 17

•

14 years ago

(In reply to comment #16) > Where do the booleans come into play? I meant the shell version of the benchmark in comment 0 (sorry, shouldn't have used "test case" here). The "possibilities" array has arrays of booleans, then it assigns numbers to index 10 and 11: var currentPos = possibilities[i][j]; ... zoneRow = getZone(i) * 3; zoneCol = getZone(j) * 3; currentPos[10] = zoneRow; currentPos[11] = zoneCol; > v8: 282 > -m: 322 > -j: 205 > -m (inference): 159 > Impressive! The sudoku solver accesses board (and other arrays) a lot, if we can make that as fast as this I guess we will beat Chrome and IE9.

Brian Hackett [Laid off!]

Comment 18

•

14 years ago

(In reply to comment #17) > (In reply to comment #16) > > Where do the booleans come into play? > > I meant the shell version of the benchmark in comment 0 (sorry, shouldn't have > used "test case" here). The "possibilities" array has arrays of booleans, then > it assigns numbers to index 10 and 11: > > var currentPos = possibilities[i][j]; > ... > zoneRow = getZone(i) * 3; > zoneCol = getZone(j) * 3; > currentPos[10] = zoneRow; > currentPos[11] = zoneCol; Ah, yeah, that will confuse the inference. It will know all these arrays are packed, but will not know whether the values read out of currentPos are ints or bools as it does not distinguish between different elements of an array. When zoneRow and zoneCol are subsequently read out of currentPos (zoneRow = currentPos[10] etc.) they will be treated as int|bool, along with everywhere they are passed. This could be fixed by sticking type checks on these assignments or on calls which pass zoneRow/zoneCol around, to limit the effects of the imprecision. Earlier versions of bug 557407 did this, but I don't think this is critical for this testcase as zoneRow/zoneCol don't look to be passed to very many places. (more important for the inference is better handling for JSON.parse, as it will not currently know anything about the contents of board).

Brian Hackett [Laid off!]

Comment 19

•

14 years ago

Attached file mark/sweep friendly test case — Details

I modified the shell test case to clone the board explicitly, rather than use JSON, so the inference can understand it (should not be too hard to get the inference to understand JSON.parse natively, or even see through JSON.parse(JSON.stringify(blah)) calls). I get these times: v8: 1.05s -m: 1.31s -m (inference): 1.02s I then modified the test case (attached) to reuse the ~470k arrays in possibilities and the ~250k arrays created by slice() in checkFreedoms. This decouples GC perf, to measure our time without considering the overhead of doing collections, allocating lots of objects or taking faults from having a huge heap (the working set here is small). I get these times: v8: 1.07s -m: 1.12s -m (inference): .78s With type inference and a generational GC this benchmark will look really good.

Dão Gottwald [:dao]

Updated

•

14 years ago

Keywords: perf

Brian Hackett [Laid off!]

Comment 20

•

14 years ago

Added this testcase to the assorted page on AWFY (bug 649487, needs more tests!). Right now we're slower than V8, I think because we no longer track property types for objects produced by JSON (did not update after Waldo's JSON parser rewrite, should be easy to fix).

RNicoletto

Comment 21

•

13 years ago

Just FYI, here you are my test results with this IE Test Drive demo. My PC specs ==> http://pastebin.com/6j8as3XE I run HTML5 Sudoku (Level: EASY - # of games: 5000) 5 times: 1.623s 1.569s 1.511s 1.453s 1.691s

Daniel C

Comment 22

•

13 years ago

(In reply to RNicoletto from comment #21) > Just FYI, here you are my test results with this IE Test Drive demo. > My PC specs ==> http://pastebin.com/6j8as3XE > > I run HTML5 Sudoku (Level: EASY - # of games: 5000) 5 times: > > 1.623s > 1.569s > 1.511s > 1.453s > 1.691s If you are going to post results, you need to also include the results from at least one of the other browsers to make them meaningful. Chrome would probably be the most useful comparison but IE and Opera would also be interesting to see. Remember to include their versions as well.

RNicoletto

Comment 23

•

13 years ago

(In reply to Daniel Cater from comment #22) > If you are going to post results, you need to also include the results from > at least one of the other browsers to make them meaningful. UHMMM... For some reason my comment has been truncated. OK, I'll re-post it completely. Mozilla Firefox latest nightly + "layers.acceleration.force-enabled" = TRUE 1.364ms 1.276ms 1.286ms 1.335ms 1.320ms Google Chrome Canary + "Override software rendering list" = ENABLED 1.165s 1.061s 1.110s 1.048s 1.297s Opera Browser 11.61 2.091s 2.034s 2.060s 2.171s 2.143s

Chris Peterson [:cpeterson]

Updated

•

11 years ago

Blocks: ietestdrive

Guilherme Lima

Comment 24

•

11 years ago

Win7 Nightly 27 - 0.91s Chrome 29 - 0.82s IE 10 - 1.07s

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Assignee: general → nobody

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Mayank Bansal

Comment 25

•

4 months ago

"Minimal array testcase"
Nightly: 35ms
Chrome: 18ms

"Minimal array testcase" 100x loopcount
Nightly: 3300ms (https://share.firefox.dev/3AmOxWg)
Chrome: 1700ms

We are ~2x slow here.
Is this something to be looked into further?

Flags: needinfo?(iireland)

Iain Ireland [:iain]

Comment 26

•

3 months ago

The fact that this testcase doesn't actually do anything with the result of loading is kind of weird. A really smart compiler could optimize this whole thing away.

This all looks vaguely similar to other array-related things I've looked at lately: we're spending time on bounds checks and unboxing, with no obviously bad code, and so is V8. Somewhere in all that, V8 is ending up a bit faster. There are enough similar-but-more-relevant benchmarks out there with the same problem to make me think that looking at this one specifically isn't a high priority.

Flags: needinfo?(iireland)

Priority: -- → P3

Shell test case 14 years ago Jan de Mooij [:jandem] 21.08 KB, application/x-javascript		Details
Minimal array test case 14 years ago Jan de Mooij [:jandem] 551 bytes, application/x-javascript		Details
mark/sweep friendly test case 14 years ago Brian Hackett [Laid off!] 21.85 KB, application/x-javascript		Details