Closed Bug 736152 Opened 12 years ago Closed 6 years ago

"Good Primes" web worker test hangs browser and is slow to update

Categories

(Core :: DOM: Workers, defect, P5)

13 Branch
x86
macOS
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox13 + wontfix
firefox14 - ---
firefox26 --- unaffected
firefox37 - affected
firefox38 - affected
firefox39 - affected
firefox40 - affected
firefox-esr31 --- unaffected
firefox-esr38 - affected
firefox58 --- unaffected
firefox59 --- unaffected
firefox60 --- unaffected

People

(Reporter: cpeterson, Unassigned)

References

(Depends on 1 open bug, )

Details

(Keywords: hang, perf, regression, Whiteboard: [Snappy:p3])

Attachments

(1 file)

STR:
1. In Firefox 14, load http://htmlfive.appspot.com/static/primes-good.html
2. Press "Go!" button
3. Try switching tabs

AR:
Firefox 14 pegs the CPU and the GUI is unresponsive. The prime number updates infrequently.

ER:
Firefox 10, 11, 12, 13, and Chrome peg the CPU, but the GUI is still responsive. The prime number updates very quickly.
After bisecting Nightly builds, I believe this problem was introduced in Nightly 12a1 (2012-01-16). This problem "went away" in Aurora 12a2 (2012-03-01), but Nightly builds are still affected.
Summary: "Good Primes" web worker test hangs browser and is slow to update in Firefox 14 → "Good Primes" web worker test hangs browser and is slow to update
Whiteboard: [Snappy]
bent, khuey suggested I CC you on this web worker regression.

This bug has been reproduced on Mac OS X and Linux.
Based on the dates, this must be a regression from bug 598482 which landed in the 2012-01-16 nightly and was backed out on Aurora on 2012-02-29 (bug 722641).

Nominating for tracking because this is a regression in Firefox 13, from a patchset that was already backed out of Firefox 12.
Blocks: 598482
Version: Trunk → 13 Branch
As soon as we get the slow script dialog things because nice and snappy again (even if you click continue and not stop script).
Huh.  That's ... really odd.  We should still be updating every 17ms or whenever the refresh driver fires.  And before that change we wouldn't have invalidated anyway until the div reflowed, which would still happen off the refresh driver.

It's possible that all the worker events swamp the event queue, but I wouldn't have expected that to be any different before either.

Most interestingly, I can't reproduce reliably.  The nightly I just tried showed the problem once, but my self-built opt build from m-c (no local patches) works just fine....
Actually, for me things are fine _until_ the slow script dialog comes up.  Once it comes up (once so far), everything is hosed.
Workers end up triggering the operation callback on the main thread in certain situations.  I wonder if that's happening here.
I wonder why the slow-script dialog is coming up. The event handler is extremely short so we should never spend the 10 seconds or so which are required to get the slow-script dialog to show.

But I suspect the problem is somehow related to swamping the event loop with message events.
Since this is a suspected regression from bug 598482, sending over to you bz.
Assignee: nobody → bzbarsky
OK, so I may have found a way to reproduce this.

There are a few things I understand about it, and some I don't yet.  All comments below are with a current 13 beta build.

1)  When workers fire their events, they use pure JSAPI to do so.  That means that they do not ever trigger nsJSContext::ScriptEvaluated.  What _that_ means is that if nsJSContext::DOMOperationCallback happens to trigger while the worker event handler is running, it will set the mOperationCallbackTime on the nsJSContext.  If no script runs on the page other than the worker callbacks and the operation callback happens to trigger _again_ after at least 10 seconds have passed, it will trigger the slow-script dialog.  This is just a bug we should fix.  I don't see it filed; filed bug 751458 on this.

2)  While the slow-script dialog is up, we continue to process events.  That includes worker events.  So we'll keep updating the counter under the modal dialog.  Luckily, DOMOperationCallback checks for a modal dialog being up and doesn't try to reenter.  This is probably an OK behavior; not processing the worker events and letting them queue up would be much worse.  Pausing the worker (and killing it off entirely if the user says "stop script", perhaps) would be better.  I filed bug 751457 on this.

3)  When I first click the Go button, the numbers go slowly for a little bit (less than a second) before starting to go fast.  This _may_ be due to the worker thrashing the event loop, since there are lots more primes down there in the small numbers.  During this time my browser is reasonably responsive.  This part I can see being due to bug 598482, because the refresh ticks may be getting swamped by the worker events.

4)  The numbers start to go slowly again before the slow-script dialog comes up.  Again, no idea why, but it may just be the worker events and refresh driver competing with the modal dialog spinup for the event loop?  Or something weird about the whole modal thing?  Not sure.  During this time my browser is definitely very laggy.  Once the dialog is up, the behavior settles down again: responsive browser, quickly updating text, per item 2 above.

I'm still trying to reproduce the GUI unresponsiveness.  I can see it how that _might_ happen if for some reason the main thread is being slower compared to the worker thread than on my machine, but I don't see how that would have gotten worse from bug 598482, really.
Interesting.  So comparing a nightly from before bug 598482 to one after:

* The older nightly brings up the slow script dialog much faster.  I don't understand why.
* The newer nightly allows clicking away the slow script dialog much faster.  Again, no
  idea why.
So I think the right plan of action is to do a try build with the fix for bug 751458 and see whether Chris sees the problem described in this bug with that build.....
Depends on: 751458
Oh, and in a _debug_ build I get a more or less complete hang on this testcase, because the main thread is much slower and the worker is not.

I put in a printf in the worker code to see the number of queued up events on the main thread.  In my debug build, this quickly hit about 300,000 before I killed it.  In the opt build, it got up to about 10,000 during that initial slow-updating second or so, then dropped to right about 0 and stayed there as the main thread caught up.  I can push it back into the several thousand range by doing CPU-intensive things on the main thread (e.g. holding down Ctrl-tab to switch between tabs), but that's about it.

In any case, I pushed a build as described in comment 12 to try.  It should appear at https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/bzbarsky@mozilla.com-d152b5f4ee58/
One other note.  With that debugging patch if I take a current nightly and look at the number of pending events (assuming I'm logging them correctly), I see it hang around 0 until the slow script dialog tries to come up.  Then the number zooms up to about 100,000 (while the web page updates slowly) until the slow script dialog appears.  After that the number drops back down to 0.

With a build from before bug 598482 landed, I see the slow script dialog come up pretty quickly, as I said above.  During this time the number of pending events is low.  But once the dialog is up it starts climbing fast to about 100,000, then drops again.  Once I managed to get it to go up to 500,000 instead in a matter of seconds, but I haven't been able to reproduce that.  In this case, while the number of worker events is rising we're overwhelmingly spending our time painting, with the paints coming in off the mac event loop.

So now I think I know what happened in terms of bug 598482.  Before that patch, on this particular testcase, we would invalidate on every single DOM mutation, because the DOM mutations this testcase uses destroy frames and right now frame destruction invalidates.  We would synchronously send the invalidate to the OS.  Then, at least on Mac, even when we had a ton of worker events in the queue we would still service the paint events the OS sent us, because those went in a _different_ task queue and the Mac event loop mode makes sure to service that task queue even if there's lots of pending stuff in the nsThread's XPCOM event queue.  So we would end up painting, even though there were lots of worker events pending.  And the paints would flush layout as needed.  So at least the page itself would update.  I'm not sure about other UI interaction, because I can only reproduce the "lots of events in the queue" state in an opt build during that transition period when the modal dialog is coming up.  During this time, the UI is definitely _not_ responsive in either the old build or the new build.  It paints fine, but doesn't so much respond to mouse clicks, say.

The old sync-invalidation setup plus forced paint is exactly what we were trying to get rid of with bug 598482...

The real issue is that when the event loop is stuffed we really want to prioritize user-visible events of various sorts over the worker events.  We used to get that somewhat for free for _paints_ because the paint events lived in a different task queue.  And we'll get there again (but better!) when nfroyd finishes up his stuff for prioritizing events.  I'm not quite sure what, if anything, we want to do here in the meantime.
One thing that we could do is bug 609042 comment 9. That should prevent flooding the main thread event queue with worker messages.

Of course, an even better solution to that problem might be to use nfroyd's priorities to give worker messages low priority.
> One thing that we could do is bug 609042 comment 9.

Yes, that's what I was just thinking this morning.  That or some other setup that caps how long workers process events for, or caps the number of worker events in the main-thread event queue.

> Of course, an even better solution to that problem might be to use nfroyd's priorities 

Yes.

The other thing we should strongly consider doing, with nfroyd's stuff or not, is to give refresh ticks a _high_ priority.  We know they won't happen too often (modulo XPCOM timer suck), and that when they should happen they really need to happen...
Adding qawanted to see if we're able to reproduce the STR in comment 0. We'd like to know

1) On what configuration it occurred
2) How frequently we were able to repro
3) Whether or not the slow script dialog came up

We may untrack depending on the frequency of reproducibility.
Keywords: qawanted
I think the worker parts here shouldn't block anything. We've had those issues as long as we've had workers. I think bug 609042 is enough to track that issue.

If I understand correctly that leaves the fact that bug 598482 caused "animations" to get choppy when the event loop is flooded. That doesn't seem like a huge problem to me since user interaction has always been bad in that case.

Of course, it might affect perceived performance which is bad.

So doing the last paragraph of comment 17 sounds like it could be a good idea, as long as it's not too much work (fluffy definition I know)
Doing the last part of comment 17 either means no longer using XPCOM timers for the refresh driver and instead using something that will get notified even when there are lots of pending XPCOM events, or adding a flag to XPCOM timers to make them "high priority" and adding some way to post events to the front of the main thread XPCOM event queue.  And then hoping that no one misuses those new APIs...  :(
I'm not worried about people misusing the API given that it's temporary until we get the new priority stuff, right?

I'll leave the decision to you of if you think it's worth doing this for now or not.
Quite honestly, I think for beta none of those are really viable: too risky.  Imo.
I compared bz's private build to Nightly 2012-01-15 (pre-regression build) and 2012-05-03 (today's build). Here is a summary of what I saw:

* Nightly 2012-01-15:
** The prime numbers update continuously.
** Switching tabs takes about 5 seconds.
** The Unresponsive Script dialog pops up after about 30 seconds.

* Nightly 2012-05-03:
** The prime numbers updates are about 2 seconds apart.
** Switching tabs takes about 5 seconds.
** The Unresponsive Script dialog pops up after about 15 seconds. Curiously, after clicking "Continue" on the Unresponsive Script dialog, the prime numbers update continuously, yet tab switching is then instantaneous. Perhaps the number of web worker events drops off as the prime numbers become further apart, allowing the main thread to process UI events sooner?

* bz's build:
** The prime numbers updates are about 2 seconds apart.
** Switching tabs takes about 5 seconds.
** The Unresponsive Script dialog never pops up! After about 20 seconds, the prime numbers update continuously and tab switching is instantaneous (the same behavior I saw on the 2012-05-03 build).
Cool, I think that matches expectations based on what we understand about this bug so far.

So it seems like the only actual effect of the regression here is that we'll in some cases paint less often. "Some cases" means in situations where the event loop is flooded and UI interaction (such as switching tabs) works poorly anyway.

So I would say that this bug only affects perceived performance. We should actually be counting prime numbers just as fast, faster even, now compared to before bug 598482. However we'll paint the results less often.

That's not to say that this isn't important. Perceived performance is almost as important as actual performance. But I'm not sure that this will hit a lot of websites, and the risks in taking a fix are pretty high.

So that leaves us with a few choices:

* Back out bug 598482. I don't think we should do this. Mostly based on hoping that
  this won't affect many sites.
* Write up a fix and land on trunk (and maybe aurora).
* Do nothing and wait for nfroyd's proper fix.

I don't have much of an opinion regarding which of the last two is better.
I think I agree with the analysis in comment 24 (though I do think we should consider landing bug 751458 on aurora and maybe even beta).

Nathan, do you have any idea of timeframes yet?  Would creating some sort of simplified mechanism to solve some of the issues here (prioritizing refresh ticks, deprioritizing worker events) be a good stepping-stone on your project, or just a distraction?
Testing on Ubuntu 11.10 64-bit:

Firefox 10.0: 
* Counter is continuous
* Stop Script Warning after 30 seconds
* UI very slow to respond
* Clicking "stop script" does not improve responsiveness

Firefox 11.0: 
* Counter is continuous
* No stop script warning
* UI very slow to respond (same as Firefox 10)

Firefox 12.0: 
* Counter is continuous
* No stop script warning
* UI slow to respond (seems better than Firefox 10 and 11)

Firefox 13.0b2: 
* Counter is NOT continuous
* Stop script warning after 15 seconds
* UI responds as normal
* Clicking "stop script" makes counter continuous
* Stop Script Warning reprompts every 15 seconds

Firefox 14.0a2 Aurora 2012-05-08: 
* Same as Firefox 13.0b2

Firefox 15.0a1 Nightly 2012-05-08:
* Counter starts out slow but becomes continuous after about 5 seconds
* No stop script warning appears
* UI responds as normal

Steps:
1. Install Firefox
2. Start with a new profile
3. Navigate to http://htmlfive.appspot.com/static/primes-good.html
4. Click GO
5. Wait for counter to reach 50K
6. Open a new tab
7. Switch to first tab
8. Close second tab
9. Wait for 30 seconds
10. Kill Firefox

To try to answer your questions, Alex (as per comment 18):
> 1) On what configuration it occurred

Seems to happen in some form or another across all versions going back to Firefox 10, though Nightly seems to do the best.

> 2) How frequently we were able to repro

The results I indicate in this bug are 100% reproducible.

> 3) Whether or not the slow script dialog came up

Appears in Firefox 10, 14, and 15.
(In reply to Boris Zbarsky (:bz) from comment #25)
> I think I agree with the analysis in comment 24 (though I do think we should
> consider landing bug 751458 on aurora and maybe even beta).

We've approved bug 751458 for Aurora 14, but given the lack of dupes and the risk of regression, we're going to leave it unfixed in FF13.

Since all remaining options are too risky for beta, we'll wontfix for that release.

(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #26)
> > 3) Whether or not the slow script dialog came up
> 
> Appears in Firefox 10, 14, and 15.

Thanks for these very clear results Anthony. I'm also very interested in the differing experiences across these versions, and the fact that this bug appears to be fixed on FF15.
Over to default owner, but we really just need to fix bug 609042.
Assignee: bzbarsky → nobody
Depends on: 609042
Removing qawanted given comment 27. Please re-add if there is something else QA can do.
Keywords: qawanted
I don't believe there's anything actionable here at this point, and this isn't a critical issue post-release of FF13. Untracking.
Whiteboard: [Snappy] → [Snappy:p3]
This hang is no longer reproducible (as of Firefox 26).
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
The OP behaviour is still manifested in FF 36.0.1 on Ubuntu 14.04. The tabs are barely switching. Google Chrome, however, handles tab switching/scrolling nicely.
I tested all Firefox versions from 28.0 through 40.0a1 with a new profile.
* Firefox 28.0 through 31.0 are immediately responsive and does not result in any jankiness
* Firefox 32.0 through 40.0 results in several seconds of initial jankiness before becoming responsive.
* Firefox 40.0 with e10s on results in a few seconds shorter jankiness period than with e10s off.

I also tested this with my daily use profile which resulted in the browser being completely hung to the point where I had to kill the process and restart Firefox.

Based on these results it looks like there might be a regression in Firefox 32. I'm reopening this bug and will try to narrow down the range. 

Thank you @dansmith for calling this out.

[Tracking Requested - why for this release]: nominating to track since this is a performance regression which puts us at a competitive disadvantage (ie. Chrome is faster).
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Regression range for mozilla-central
====================================
Last Good: 2014-05-15 [e4843f4f08a7]
First Bad: 2014-05-16 [58c5a3427997
Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=e4843f4f08a7&tochange=58c5a3427997

Regression range for mozilla-inbound
====================================
Last Good: 2014-05-13 [d6a0c7c471e2]
First Bad: 2014-05-14 [1d2bbffee6ea]
Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=d6a0c7c471e2&tochange=1d2bbffee6ea
khuey or bent, is this something you can help us find an owner for? 

Maybe this recent work from Anthony be helpful in either figuring out how significant the problem is or leading to a fix.
Flags: needinfo?(khuey)
Flags: needinfo?(bent.mozilla)
Keywords: hang, perf
Please see comment 28. Nothing has changed there, so I have no idea why this bug was closed, or why it is now reopened ;)
Flags: needinfo?(bent.mozilla)
(In reply to Ben Turner [:bent] (use the needinfo flag!) from comment #36)
> Please see comment 28. Nothing has changed there, so I have no idea why this
> bug was closed, or why it is now reopened ;)

I think it makes sense to keep this bug report open until bug 609042 is resolved so we can properly test that it fixes *this* issue.
That makes sense. I talked with :bent and it sounds like this may or may not have been accidentally fixed at some point between two and three years ago. Since we don't think it's affecting overall performance, I am not tracking this for now. 

Please do renominate the bug, if it seems to be a general problem or it becomes a priority.
The link is dead, but I get the same problem with the same example here: https://html.spec.whatwg.org/demos/workers/primes/page.html
Priority: -- → P5
This bug is no longer a problem now that we have e10s and multiple content processes.

I retested with the link from comment 39. Also the prime number counter updates quickly, unlike the behavior I described in comment 0 when reporting the bug.
Status: REOPENED → RESOLVED
Closed: 10 years ago6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: