562455 - TM: Web Worker test case performance regression with JIT

Reporter

Description

•

14 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.17 Safari/533.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a5pre) Gecko/20100428 Minefield/3.7a5pre

**Test case may crash Firefox < 3.6.4**

The test is a library similar to Box2djs, run in a web worker to get it off the main thread.

When I originally wrote the test, Firefox 3.6.0 was able to handle the second "test level" on the worker test page at > 30fps (though it did crash occasionally). With Bug 538440 fixed, performance has degraded significantly (~2fps), much more so than is found running essentially the same script in the main thread (even though it is presumably JITed there as well):

http://extremelysatisfactorytotalitarianism.com/projects/experiments/2010/03/boxdemo/noworker.html

To see the old, faster behavior you'll have to either run 3.6.0 or disable JIT on some higher level, since the about:config preference doesn't work for workers yet (Bug 530641). Turning on Firebug Script/Console panels will demonstrate it at first, but if the test is reset performance degrades.

I apologize for the GWT-compiled test case, but I haven't been able to replicate the regression in any of my (much simpler) hand-written worker scripts. I can recompile and post a non-obfuscated version of the test if that would be helpful.

Reproducible: Always

Steps to Reproduce:
1. Run test case
2. Compare with test case run without JIT compilation, still within a worker.
3. Compare with test case run with JIT compilation, but not within a worker.


Expected Results:  
Ideally TraceMonkey would fall back to interpreted mode, but I would also expect the JITed behavior of a script run in a worker to approximate the JITed behavior of (almost) the same script run in the main thread.

Brendan Kenny

Reporter

Comment 1

•

14 years ago

Crashes with that page decreased to almost nothing with 3.6.4 (I think with some fix for JSString and workers, but I can't find it). I have not gotten it to crash for me in a nightly recently, but if the extra data point helps, here is a crash report for 3.6.4b:

http://crash-stats.mozilla.com/report/index/7dd2dcad-9cac-4d6b-aa9f-20edb2100428

Boris Zbarsky [:bzbarsky]

Comment 2

•

14 years ago

OK, with the worker testcase I see a lot more aborts, a lot less blacklisting.  It's hard to tell more than that offhand, since jitstats is so busted with workers...

Boris Zbarsky [:bzbarsky]

Comment 3

•

14 years ago

The extra aborts are disproportionately "no compatible inner tree" and "inner tree is trying to grow".

Brendan, a deobfuscated testcase would in fact be nice.

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Status: UNCONFIRMED → NEW

blocking1.9.2: --- → ?

blocking2.0: --- → ?

Ever confirmed: true

Keywords: regression

Andreas Gal :gal

Comment 4

•

14 years ago

I am probably overly conservative here, but I am going to hide this bug until we know that the crash cause has been fixed since we have a test case attached.

Group: core-security

Andreas Gal :gal

Updated

•

14 years ago

Whiteboard: [sg:investigate?]

Andreas Gal :gal

Comment 5

•

14 years ago

bz, can you attach a list of all aborts. either something makes us not trace an inner tree, or we somehow increased branchiness, causing the branch limit to kick in.

Boris Zbarsky [:bzbarsky]

Comment 6

•

14 years ago

Attached file 1200-some aborts, here you go — Details

Note that this is t-m close to tip.  I did verify it shows the problem.

Also note that the line numbers might not tell you much given the minified code.

Brendan Kenny

Reporter

Comment 7

•

14 years ago

Here is the same test with non-obfuscated code. The script files are quite large.

http://extremelysatisfactorytotalitarianism.com/projects/experiments/2010/04/boxfull/

It might not be relevant, but every time a different test level is selected or the "reset" button is clicked, the current worker is terminated and a new one is created (the test is for some worker emulation code).

The current crash might just be part of the larger performance issue, but it definitely seemed like a terminated worker was a necessary condition for the earlier threading/JSString crashes (my best guess is Bug 547814, but I'm not allowed to access it).

Boris Zbarsky [:bzbarsky]

Comment 8

•

14 years ago

> every time a different test level is selected or the "reset" button is clicked

I did neither in my testing, fwiw.

Andreas Gal :gal

Comment 9

•

14 years ago

#7: we fixed that one a while ago, so I think its ok if I grant you access. You might be right with your assumption. That looks very related.

Brendan Kenny

Reporter

Comment 10

•

14 years ago

Thanks. Of the two different crash sources I used to get, the JSString-related one was much more frequent. That suddenly ceased with 3.6.4, but I couldn't find any visible bugs in the change list that would account for the improvement.

Here's a report for that older type of crash
http://crash-stats.mozilla.com/report/index/bp-3ff73dce-1103-4f61-b965-8c8b42100407

Brendan Kenny

Reporter

Comment 11

•

14 years ago

I was able to get a crash this afternoon in the latest nightly.

http://crash-stats.mozilla.com/report/index/6fb2f37e-6685-46cd-9422-7fcc92100429

This may of course be a separate issue, just triggered by all the aborts and a bunch of terminated workers.

Andreas Gal :gal

Comment 12

•

14 years ago

That's a different and bad looking crash. Many aborts and possibly bug 562734 could cause that. We should split #11 off into its own bug and search for related stacks.

Mike Beltzner [:beltzner, not reading bugmail]

Comment 13

•

14 years ago

Can't tell if this is a bad regression/crash or not based on the comments so far, so minusing the blocking nomination for 1.9.2.x. Please renominate once we know more.

blocking1.9.2: ? → -

Boris Zbarsky [:bzbarsky]

Comment 14

•

14 years ago

The crash thing is not clear.

The 15x performance slowdown seems like a clear bad regression to me.  That, plus the indications that jit in workers is just not working right _somehow_....

blocking1.9.2: - → ?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 15

•

14 years ago

Ben: didn't you check in the patch to make web workers work with the JIT? Do we want to back that change out, or figure out a fix for this bug? Either way, I'm not sure that it blocks.

bz: the 15x regression is on code within web workers, yes?

Mike Beltzner [:beltzner, not reading bugmail]

Comment 16

•

14 years ago

Need a decision for 1.9.2.5 - either back out the change that caused the regression (I suspect bug 538440) or fix it.

Blocks: 538440

blocking1.9.2: ? → .5+

Boris Zbarsky [:bzbarsky]

Comment 17

•

14 years ago

> bz: the 15x regression is on code within web workers, yes?

Yes.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 18

•

14 years ago

(In reply to comment #15)
> Ben: didn't you check in the patch to make web workers work with the JIT?

Yep, that was bug 538440.

> Do we want to back that change out, or figure out a fix for this bug?

A user commented in bug 538440 comment 16 that the patch seemed to be working fine...

Boris Zbarsky [:bzbarsky]

Comment 19

•

14 years ago

Right.  It just seems that in this particular testcase it's not, somehow... would be good to figure out why.

Daniel Veditz [:dveditz]

Comment 20

•

14 years ago

(In reply to comment #12)
> That's a different and bad looking crash. Many aborts and possibly bug 562734
> could cause that. We should split #11 off into its own bug and search for
> related stacks.

Did a new bug ever get filed, or did the crashing turn out to be a duplicate of bug 562734 after all? It'd be good to keep this bug limited to the perf regression (which is not a security issue, but we could keep the bug hidden due to the testcase).

status1.9.1: --- → unaffected

status1.9.2: --- → wanted

Whiteboard: [sg:investigate?]

Robert Sayre

Updated

•

14 years ago

Whiteboard: [sg:nse]

Robert Sayre

Updated

•

14 years ago

blocking2.0: ? → final+

Daniel Veditz [:dveditz]

Updated

•

14 years ago

blocking1.9.2: .5+ → .6+

Whiteboard: [sg:nse] → [sg:nse][needs owner]

Brendan Eich [:brendan]

Comment 21

•

14 years ago

Ben, can you own this bug?

/be

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 22

•

14 years ago

I can investigate, sure. If it turns out to be something other than "oops, boolean logic is wrong" then I'll probably have to turn it over to a JIT guru though.

LegNeato

Comment 23

•

14 years ago

There doesn't look to be good movement on this, so looks like we will have to punt out of .7 and .11. Note there are crashes seen in 3.6.4, so it is not fixed (http://bit.ly/d3RTRn)

LegNeato

Updated

•

14 years ago

blocking1.9.2: .7+ → needed

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 24

•

14 years ago

Sorry, forgot about this. I just looked at the testcases linked above in today's 1.9.2 nightly on mac and js::TraceRecorder is all over the stack for the worker threads. JIT is clearly enabled but i don't see the normal |<unknown library>| samples that usually indicate actually running JIT'd code. In any case I haven't the foggiest idea about how to proceed, I think this needs a JS peer.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 25

•

14 years ago

If you have a debug build, can you set TMFLAGS=full and run the test above, recording the log that's sent to stderr?

Boris Zbarsky [:bzbarsky]

Comment 26

•

14 years ago

So....  That's been done already to some extent.  See comment 2, comment 3, comment 6.  What's needed is someone competent to analyze why we're aborting more in those particular ways.  Is the oracle busted for workers?  Is something else happening incorrectly in terms of guessing the right typemaps?  Something else?

Boris Zbarsky [:bzbarsky]

Comment 27

•

14 years ago

dvander says we have one oracle per tracemonitor, so per thread.

From irc:

> dvander: any idea what else could make a script running in a worker have issues
  finding peers?
> dvander: as compared to the exact same script running as part of a webpage?
<dvander> bz, hrm unless it's getting the wrong cx (and thus the wrong TM)
          somewhere, not off the top of my head

Obvious question: is this worker always running on the same thread?

Andreas Gal :gal

Comment 28

•

14 years ago

We should really make the code cache and oracle per-global.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 29

•

14 years ago

(In reply to comment #27)
> Obvious question: is this worker always running on the same thread?

Workers are never guaranteed to run on the same thread or in the same JSContext. They always have the same global.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 30

•

14 years ago

(In reply to comment #25)
> If you have a debug build, can you set TMFLAGS=full and run the test above,
> recording the log that's sent to stderr?

I have a 1.6 gb log for this... Any tips on how to extract something useful?

Boris Zbarsky [:bzbarsky]

Comment 31

•

14 years ago

> Workers are never guaranteed

Yes, I understand that.... but is this particular worker running on more than one thread?  For example, is it even running more than once?

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 32

•

14 years ago

Just got this testing:

http://crash-stats.mozilla.com/report/index/e437eea6-bc1a-41ce-8526-ea9b52100630

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 33

•

14 years ago

So I consistently see two threads being used for worker stuff when running this testcase. It's possible that two workers are being used, and it's also possible that this is the same worker being bounced from thread to thread. I can't tell because I can't find the script that actually launches the worker, this code is crazy. Building a debug version to trap it in C++ now.

David Mandelin [:dmandelin]

Comment 34

•

14 years ago

Question: what options are currently used with web workers? tracejit enabled only?

And, what do we want to do with this? Some easy ideas:

- Run methodjit on web workers, too? (Seems risky, but we could try it out and maybe it is ok.)
- Don't fix this for Fx4.

David Mandelin [:dmandelin]

Updated

•

14 years ago

Depends on: 584860

David Mandelin [:dmandelin]

Updated

•

14 years ago

Depends on: 615723

David Mandelin [:dmandelin]

Updated

•

14 years ago

Whiteboard: [sg:nse][needs owner] → [sg:nse][needs owner][cib-workers]

David Mandelin [:dmandelin]

Comment 35

•

14 years ago

As of a 12/27 nightly, the linked URL gets 30fps for me, and is strictly better than Fx3.6. Bug 584860 moved the tracejit data structures to compartments, so they stick properly with a given web worker and trace optimization works as designed. Let us know if any related issues pop up again.

Status: NEW → RESOLVED

Closed: 14 years ago

No longer depends on: 584860, 615723

Resolution: --- → DUPLICATE

Daniel Veditz [:dveditz]

Updated

•

12 years ago

Group: core-security