Last Comment Bug 562455 - TM: Web Worker test case performance regression with JIT
: TM: Web Worker test case performance regression with JIT
Status: RESOLVED DUPLICATE of bug 584860
[sg:nse][needs owner][cib-workers]
: regression
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86 Windows 7
: -- normal (vote)
: ---
Assigned To: general
:
:
Mentors:
http://extremelysatisfactorytotalitar...
Depends on:
Blocks: 538440
  Show dependency treegraph
 
Reported: 2010-04-28 13:50 PDT by Brendan Kenny
Modified: 2012-04-19 22:59 PDT (History)
13 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
final+
needed
wanted
unaffected


Attachments
1200-some aborts, here you go (455.89 KB, text/plain)
2010-04-28 19:46 PDT, Boris Zbarsky [:bz] (still a bit busy)
no flags Details

Description Brendan Kenny 2010-04-28 13:50:45 PDT
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.17 Safari/533.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a5pre) Gecko/20100428 Minefield/3.7a5pre

**Test case may crash Firefox < 3.6.4**

The test is a library similar to Box2djs, run in a web worker to get it off the main thread.

When I originally wrote the test, Firefox 3.6.0 was able to handle the second "test level" on the worker test page at > 30fps (though it did crash occasionally). With Bug 538440 fixed, performance has degraded significantly (~2fps), much more so than is found running essentially the same script in the main thread (even though it is presumably JITed there as well):

http://extremelysatisfactorytotalitarianism.com/projects/experiments/2010/03/boxdemo/noworker.html

To see the old, faster behavior you'll have to either run 3.6.0 or disable JIT on some higher level, since the about:config preference doesn't work for workers yet (Bug 530641). Turning on Firebug Script/Console panels will demonstrate it at first, but if the test is reset performance degrades.

I apologize for the GWT-compiled test case, but I haven't been able to replicate the regression in any of my (much simpler) hand-written worker scripts. I can recompile and post a non-obfuscated version of the test if that would be helpful.

Reproducible: Always

Steps to Reproduce:
1. Run test case
2. Compare with test case run without JIT compilation, still within a worker.
3. Compare with test case run with JIT compilation, but not within a worker.


Expected Results:  
Ideally TraceMonkey would fall back to interpreted mode, but I would also expect the JITed behavior of a script run in a worker to approximate the JITed behavior of (almost) the same script run in the main thread.
Comment 1 Brendan Kenny 2010-04-28 14:09:04 PDT
Crashes with that page decreased to almost nothing with 3.6.4 (I think with some fix for JSString and workers, but I can't find it). I have not gotten it to crash for me in a nightly recently, but if the extra data point helps, here is a crash report for 3.6.4b:

http://crash-stats.mozilla.com/report/index/7dd2dcad-9cac-4d6b-aa9f-20edb2100428
Comment 2 Boris Zbarsky [:bz] (still a bit busy) 2010-04-28 19:09:30 PDT
OK, with the worker testcase I see a lot more aborts, a lot less blacklisting.  It's hard to tell more than that offhand, since jitstats is so busted with workers...
Comment 3 Boris Zbarsky [:bz] (still a bit busy) 2010-04-28 19:12:21 PDT
The extra aborts are disproportionately "no compatible inner tree" and "inner tree is trying to grow".

Brendan, a deobfuscated testcase would in fact be nice.
Comment 4 Andreas Gal :gal 2010-04-28 19:15:12 PDT
I am probably overly conservative here, but I am going to hide this bug until we know that the crash cause has been fixed since we have a test case attached.
Comment 5 Andreas Gal :gal 2010-04-28 19:17:35 PDT
bz, can you attach a list of all aborts. either something makes us not trace an inner tree, or we somehow increased branchiness, causing the branch limit to kick in.
Comment 6 Boris Zbarsky [:bz] (still a bit busy) 2010-04-28 19:46:14 PDT
Created attachment 442297 [details]
1200-some aborts, here you go

Note that this is t-m close to tip.  I did verify it shows the problem.

Also note that the line numbers might not tell you much given the minified code.
Comment 7 Brendan Kenny 2010-04-28 20:47:34 PDT
Here is the same test with non-obfuscated code. The script files are quite large.

http://extremelysatisfactorytotalitarianism.com/projects/experiments/2010/04/boxfull/

It might not be relevant, but every time a different test level is selected or the "reset" button is clicked, the current worker is terminated and a new one is created (the test is for some worker emulation code).

The current crash might just be part of the larger performance issue, but it definitely seemed like a terminated worker was a necessary condition for the earlier threading/JSString crashes (my best guess is Bug 547814, but I'm not allowed to access it).
Comment 8 Boris Zbarsky [:bz] (still a bit busy) 2010-04-28 21:01:19 PDT
> every time a different test level is selected or the "reset" button is clicked

I did neither in my testing, fwiw.
Comment 9 Andreas Gal :gal 2010-04-28 22:14:56 PDT
#7: we fixed that one a while ago, so I think its ok if I grant you access. You might be right with your assumption. That looks very related.
Comment 10 Brendan Kenny 2010-04-28 22:46:58 PDT
Thanks. Of the two different crash sources I used to get, the JSString-related one was much more frequent. That suddenly ceased with 3.6.4, but I couldn't find any visible bugs in the change list that would account for the improvement.

Here's a report for that older type of crash
http://crash-stats.mozilla.com/report/index/bp-3ff73dce-1103-4f61-b965-8c8b42100407
Comment 11 Brendan Kenny 2010-04-29 13:35:53 PDT
I was able to get a crash this afternoon in the latest nightly.

http://crash-stats.mozilla.com/report/index/6fb2f37e-6685-46cd-9422-7fcc92100429

This may of course be a separate issue, just triggered by all the aborts and a bunch of terminated workers.
Comment 12 Andreas Gal :gal 2010-04-29 14:11:50 PDT
That's a different and bad looking crash. Many aborts and possibly bug 562734 could cause that. We should split #11 off into its own bug and search for related stacks.
Comment 13 Mike Beltzner [:beltzner, not reading bugmail] 2010-05-03 10:01:27 PDT
Can't tell if this is a bad regression/crash or not based on the comments so far, so minusing the blocking nomination for 1.9.2.x. Please renominate once we know more.
Comment 14 Boris Zbarsky [:bz] (still a bit busy) 2010-05-03 10:08:49 PDT
The crash thing is not clear.

The 15x performance slowdown seems like a clear bad regression to me.  That, plus the indications that jit in workers is just not working right _somehow_....
Comment 15 Mike Beltzner [:beltzner, not reading bugmail] 2010-05-05 12:34:45 PDT
Ben: didn't you check in the patch to make web workers work with the JIT? Do we want to back that change out, or figure out a fix for this bug? Either way, I'm not sure that it blocks.

bz: the 15x regression is on code within web workers, yes?
Comment 16 Mike Beltzner [:beltzner, not reading bugmail] 2010-05-05 12:47:27 PDT
Need a decision for 1.9.2.5 - either back out the change that caused the regression (I suspect bug 538440) or fix it.
Comment 17 Boris Zbarsky [:bz] (still a bit busy) 2010-05-05 12:48:19 PDT
> bz: the 15x regression is on code within web workers, yes?

Yes.
Comment 18 Ben Turner (not reading bugmail, use the needinfo flag!) 2010-05-05 12:51:36 PDT
(In reply to comment #15)
> Ben: didn't you check in the patch to make web workers work with the JIT?

Yep, that was bug 538440.

> Do we want to back that change out, or figure out a fix for this bug?

A user commented in bug 538440 comment 16 that the patch seemed to be working fine...
Comment 19 Boris Zbarsky [:bz] (still a bit busy) 2010-05-05 12:54:38 PDT
Right.  It just seems that in this particular testcase it's not, somehow... would be good to figure out why.
Comment 20 Daniel Veditz [:dveditz] 2010-05-24 20:13:17 PDT
(In reply to comment #12)
> That's a different and bad looking crash. Many aborts and possibly bug 562734
> could cause that. We should split #11 off into its own bug and search for
> related stacks.

Did a new bug ever get filed, or did the crashing turn out to be a duplicate of bug 562734 after all? It'd be good to keep this bug limited to the perf regression (which is not a security issue, but we could keep the bug hidden due to the testcase).
Comment 21 Brendan Eich [:brendan] 2010-06-21 12:54:57 PDT
Ben, can you own this bug?

/be
Comment 22 Ben Turner (not reading bugmail, use the needinfo flag!) 2010-06-21 14:01:15 PDT
I can investigate, sure. If it turns out to be something other than "oops, boolean logic is wrong" then I'll probably have to turn it over to a JIT guru though.
Comment 23 christian 2010-06-29 13:17:19 PDT
There doesn't look to be good movement on this, so looks like we will have to punt out of .7 and .11. Note there are crashes seen in 3.6.4, so it is not fixed (http://bit.ly/d3RTRn)
Comment 24 Ben Turner (not reading bugmail, use the needinfo flag!) 2010-06-29 14:24:53 PDT
Sorry, forgot about this. I just looked at the testcases linked above in today's 1.9.2 nightly on mac and js::TraceRecorder is all over the stack for the worker threads. JIT is clearly enabled but i don't see the normal |<unknown library>| samples that usually indicate actually running JIT'd code. In any case I haven't the foggiest idea about how to proceed, I think this needs a JS peer.
Comment 25 Mike Shaver (:shaver -- probably not reading bugmail closely) 2010-06-29 14:26:48 PDT
If you have a debug build, can you set TMFLAGS=full and run the test above, recording the log that's sent to stderr?
Comment 26 Boris Zbarsky [:bz] (still a bit busy) 2010-06-30 13:32:51 PDT
So....  That's been done already to some extent.  See comment 2, comment 3, comment 6.  What's needed is someone competent to analyze why we're aborting more in those particular ways.  Is the oracle busted for workers?  Is something else happening incorrectly in terms of guessing the right typemaps?  Something else?
Comment 27 Boris Zbarsky [:bz] (still a bit busy) 2010-06-30 13:49:23 PDT
dvander says we have one oracle per tracemonitor, so per thread.

From irc:

> dvander: any idea what else could make a script running in a worker have issues
  finding peers?
> dvander: as compared to the exact same script running as part of a webpage?
<dvander> bz, hrm unless it's getting the wrong cx (and thus the wrong TM)
          somewhere, not off the top of my head

Obvious question: is this worker always running on the same thread?
Comment 28 Andreas Gal :gal 2010-06-30 13:57:06 PDT
We should really make the code cache and oracle per-global.
Comment 29 Ben Turner (not reading bugmail, use the needinfo flag!) 2010-06-30 14:08:17 PDT
(In reply to comment #27)
> Obvious question: is this worker always running on the same thread?

Workers are never guaranteed to run on the same thread or in the same JSContext. They always have the same global.
Comment 30 Ben Turner (not reading bugmail, use the needinfo flag!) 2010-06-30 14:08:55 PDT
(In reply to comment #25)
> If you have a debug build, can you set TMFLAGS=full and run the test above,
> recording the log that's sent to stderr?

I have a 1.6 gb log for this... Any tips on how to extract something useful?
Comment 31 Boris Zbarsky [:bz] (still a bit busy) 2010-06-30 15:33:51 PDT
> Workers are never guaranteed

Yes, I understand that.... but is this particular worker running on more than one thread?  For example, is it even running more than once?
Comment 32 Ben Turner (not reading bugmail, use the needinfo flag!) 2010-06-30 16:02:42 PDT
Just got this testing:

http://crash-stats.mozilla.com/report/index/e437eea6-bc1a-41ce-8526-ea9b52100630
Comment 33 Ben Turner (not reading bugmail, use the needinfo flag!) 2010-06-30 16:23:34 PDT
So I consistently see two threads being used for worker stuff when running this testcase. It's possible that two workers are being used, and it's also possible that this is the same worker being bounced from thread to thread. I can't tell because I can't find the script that actually launches the worker, this code is crazy. Building a debug version to trap it in C++ now.
Comment 34 David Mandelin [:dmandelin] 2010-11-19 16:44:10 PST
Question: what options are currently used with web workers? tracejit enabled only?

And, what do we want to do with this? Some easy ideas:

- Run methodjit on web workers, too? (Seems risky, but we could try it out and maybe it is ok.)
- Don't fix this for Fx4.
Comment 35 David Mandelin [:dmandelin] 2010-12-27 16:08:50 PST
As of a 12/27 nightly, the linked URL gets 30fps for me, and is strictly better than Fx3.6. Bug 584860 moved the tracejit data structures to compartments, so they stick properly with a given web worker and trace optimization works as designed. Let us know if any related issues pop up again.

*** This bug has been marked as a duplicate of bug 584860 ***

Note You need to log in before you can comment on or make changes to this bug.