Closed Bug 562455 Opened 14 years ago Closed 14 years ago

TM: Web Worker test case performance regression with JIT

Categories

(Core :: JavaScript Engine, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 584860
Tracking Status
blocking2.0 --- final+
blocking1.9.2 --- needed
status1.9.2 --- wanted
status1.9.1 --- unaffected

People

(Reporter: bckenny, Unassigned)

References

()

Details

(Keywords: regression, Whiteboard: [sg:nse][needs owner][cib-workers])

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.17 Safari/533.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a5pre) Gecko/20100428 Minefield/3.7a5pre

**Test case may crash Firefox < 3.6.4**

The test is a library similar to Box2djs, run in a web worker to get it off the main thread.

When I originally wrote the test, Firefox 3.6.0 was able to handle the second "test level" on the worker test page at > 30fps (though it did crash occasionally). With Bug 538440 fixed, performance has degraded significantly (~2fps), much more so than is found running essentially the same script in the main thread (even though it is presumably JITed there as well):

http://extremelysatisfactorytotalitarianism.com/projects/experiments/2010/03/boxdemo/noworker.html

To see the old, faster behavior you'll have to either run 3.6.0 or disable JIT on some higher level, since the about:config preference doesn't work for workers yet (Bug 530641). Turning on Firebug Script/Console panels will demonstrate it at first, but if the test is reset performance degrades.

I apologize for the GWT-compiled test case, but I haven't been able to replicate the regression in any of my (much simpler) hand-written worker scripts. I can recompile and post a non-obfuscated version of the test if that would be helpful.

Reproducible: Always

Steps to Reproduce:
1. Run test case
2. Compare with test case run without JIT compilation, still within a worker.
3. Compare with test case run with JIT compilation, but not within a worker.


Expected Results:  
Ideally TraceMonkey would fall back to interpreted mode, but I would also expect the JITed behavior of a script run in a worker to approximate the JITed behavior of (almost) the same script run in the main thread.
Crashes with that page decreased to almost nothing with 3.6.4 (I think with some fix for JSString and workers, but I can't find it). I have not gotten it to crash for me in a nightly recently, but if the extra data point helps, here is a crash report for 3.6.4b:

http://crash-stats.mozilla.com/report/index/7dd2dcad-9cac-4d6b-aa9f-20edb2100428
OK, with the worker testcase I see a lot more aborts, a lot less blacklisting.  It's hard to tell more than that offhand, since jitstats is so busted with workers...
The extra aborts are disproportionately "no compatible inner tree" and "inner tree is trying to grow".

Brendan, a deobfuscated testcase would in fact be nice.
Status: UNCONFIRMED → NEW
blocking1.9.2: --- → ?
blocking2.0: --- → ?
Ever confirmed: true
Keywords: regression
I am probably overly conservative here, but I am going to hide this bug until we know that the crash cause has been fixed since we have a test case attached.
Group: core-security
Whiteboard: [sg:investigate?]
bz, can you attach a list of all aborts. either something makes us not trace an inner tree, or we somehow increased branchiness, causing the branch limit to kick in.
Note that this is t-m close to tip.  I did verify it shows the problem.

Also note that the line numbers might not tell you much given the minified code.
Here is the same test with non-obfuscated code. The script files are quite large.

http://extremelysatisfactorytotalitarianism.com/projects/experiments/2010/04/boxfull/

It might not be relevant, but every time a different test level is selected or the "reset" button is clicked, the current worker is terminated and a new one is created (the test is for some worker emulation code).

The current crash might just be part of the larger performance issue, but it definitely seemed like a terminated worker was a necessary condition for the earlier threading/JSString crashes (my best guess is Bug 547814, but I'm not allowed to access it).
> every time a different test level is selected or the "reset" button is clicked

I did neither in my testing, fwiw.
#7: we fixed that one a while ago, so I think its ok if I grant you access. You might be right with your assumption. That looks very related.
Thanks. Of the two different crash sources I used to get, the JSString-related one was much more frequent. That suddenly ceased with 3.6.4, but I couldn't find any visible bugs in the change list that would account for the improvement.

Here's a report for that older type of crash
http://crash-stats.mozilla.com/report/index/bp-3ff73dce-1103-4f61-b965-8c8b42100407
I was able to get a crash this afternoon in the latest nightly.

http://crash-stats.mozilla.com/report/index/6fb2f37e-6685-46cd-9422-7fcc92100429

This may of course be a separate issue, just triggered by all the aborts and a bunch of terminated workers.
That's a different and bad looking crash. Many aborts and possibly bug 562734 could cause that. We should split #11 off into its own bug and search for related stacks.
Can't tell if this is a bad regression/crash or not based on the comments so far, so minusing the blocking nomination for 1.9.2.x. Please renominate once we know more.
blocking1.9.2: ? → -
The crash thing is not clear.

The 15x performance slowdown seems like a clear bad regression to me.  That, plus the indications that jit in workers is just not working right _somehow_....
blocking1.9.2: - → ?
Ben: didn't you check in the patch to make web workers work with the JIT? Do we want to back that change out, or figure out a fix for this bug? Either way, I'm not sure that it blocks.

bz: the 15x regression is on code within web workers, yes?
Need a decision for 1.9.2.5 - either back out the change that caused the regression (I suspect bug 538440) or fix it.
Blocks: 538440
blocking1.9.2: ? → .5+
> bz: the 15x regression is on code within web workers, yes?

Yes.
(In reply to comment #15)
> Ben: didn't you check in the patch to make web workers work with the JIT?

Yep, that was bug 538440.

> Do we want to back that change out, or figure out a fix for this bug?

A user commented in bug 538440 comment 16 that the patch seemed to be working fine...
Right.  It just seems that in this particular testcase it's not, somehow... would be good to figure out why.
(In reply to comment #12)
> That's a different and bad looking crash. Many aborts and possibly bug 562734
> could cause that. We should split #11 off into its own bug and search for
> related stacks.

Did a new bug ever get filed, or did the crashing turn out to be a duplicate of bug 562734 after all? It'd be good to keep this bug limited to the perf regression (which is not a security issue, but we could keep the bug hidden due to the testcase).
Whiteboard: [sg:investigate?]
Whiteboard: [sg:nse]
blocking2.0: ? → final+
blocking1.9.2: .5+ → .6+
Whiteboard: [sg:nse] → [sg:nse][needs owner]
Ben, can you own this bug?

/be
I can investigate, sure. If it turns out to be something other than "oops, boolean logic is wrong" then I'll probably have to turn it over to a JIT guru though.
There doesn't look to be good movement on this, so looks like we will have to punt out of .7 and .11. Note there are crashes seen in 3.6.4, so it is not fixed (http://bit.ly/d3RTRn)
blocking1.9.2: .7+ → needed
Sorry, forgot about this. I just looked at the testcases linked above in today's 1.9.2 nightly on mac and js::TraceRecorder is all over the stack for the worker threads. JIT is clearly enabled but i don't see the normal |<unknown library>| samples that usually indicate actually running JIT'd code. In any case I haven't the foggiest idea about how to proceed, I think this needs a JS peer.
If you have a debug build, can you set TMFLAGS=full and run the test above, recording the log that's sent to stderr?
So....  That's been done already to some extent.  See comment 2, comment 3, comment 6.  What's needed is someone competent to analyze why we're aborting more in those particular ways.  Is the oracle busted for workers?  Is something else happening incorrectly in terms of guessing the right typemaps?  Something else?
dvander says we have one oracle per tracemonitor, so per thread.

From irc:

> dvander: any idea what else could make a script running in a worker have issues
  finding peers?
> dvander: as compared to the exact same script running as part of a webpage?
<dvander> bz, hrm unless it's getting the wrong cx (and thus the wrong TM)
          somewhere, not off the top of my head

Obvious question: is this worker always running on the same thread?
We should really make the code cache and oracle per-global.
(In reply to comment #27)
> Obvious question: is this worker always running on the same thread?

Workers are never guaranteed to run on the same thread or in the same JSContext. They always have the same global.
(In reply to comment #25)
> If you have a debug build, can you set TMFLAGS=full and run the test above,
> recording the log that's sent to stderr?

I have a 1.6 gb log for this... Any tips on how to extract something useful?
> Workers are never guaranteed

Yes, I understand that.... but is this particular worker running on more than one thread?  For example, is it even running more than once?
So I consistently see two threads being used for worker stuff when running this testcase. It's possible that two workers are being used, and it's also possible that this is the same worker being bounced from thread to thread. I can't tell because I can't find the script that actually launches the worker, this code is crazy. Building a debug version to trap it in C++ now.
Question: what options are currently used with web workers? tracejit enabled only?

And, what do we want to do with this? Some easy ideas:

- Run methodjit on web workers, too? (Seems risky, but we could try it out and maybe it is ok.)
- Don't fix this for Fx4.
Depends on: 584860
Depends on: 615723
Whiteboard: [sg:nse][needs owner] → [sg:nse][needs owner][cib-workers]
As of a 12/27 nightly, the linked URL gets 30fps for me, and is strictly better than Fx3.6. Bug 584860 moved the tracejit data structures to compartments, so they stick properly with a given web worker and trace optimization works as designed. Let us know if any related issues pop up again.
Status: NEW → RESOLVED
Closed: 14 years ago
No longer depends on: 584860, 615723
Resolution: --- → DUPLICATE
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: