Closed Bug 965309 Opened 12 years ago Closed 11 years ago

Worker hang at shutdown with XHR

Categories

(Core :: DOM: Workers, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla32
Tracking Status
firefox29 - wontfix
firefox30 + fixed
firefox31 + fixed
firefox32 + fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: ttaubert, Assigned: khuey)

References

Details

(Keywords: hang, regression)

Attachments

(1 file, 5 obsolete files)

With a slightly modified Firefox that doesn't start the SessionWorker early but spawns it on shutdown, I can reproduce a worker hang in 1 of 10-20 tries. This happens only when Proxy::Teardown() is called by a SyncTeardownRunnable. mOutstandingSendCount=1 is one which means we dispatch a XHRUnpinRunnable that successfully calls Unpin(). On normal shutdowns, the XHR's refcount is 4 in Unpin(). When hanging, the refcount is 5 in Unpin(). Looks like something is holding a reference to the XHR that then hangs the shutdown process.
It's the |autoSyncLoop.ref().Run()| call at the very end of XMLHttpRequest::SendInternal() that's hanging. Not sure if that's obvious but I wanted to mention it.
The LoadStartDetectionRunnable would usually create a ProxyCompleteRunnable that then calls StopSyncLoop() but because mOutstandingSendCount is set to zero in Proxy::Teardown() that will never happen.
Re-adding this hunk doesn't solve the hang, the XHR is probably canceled after we started spinning.
How about this? We could also create specialized runnable only for that task that omits Unpin() to keep the assertion. Or extend ProxyCompleteRunnable to take a boolean. This does fix the hang for me.
Attachment #8368018 - Flags: feedback?(khuey)
Attachment #8368018 - Attachment is obsolete: true
Attachment #8368018 - Flags: feedback?(khuey)
Blocks: 1005487
Could you imagine some STR we could use to reproduce this? I would like to see when this has been really regressed.
Flags: needinfo?(ttaubert)
I don't know. I'm honestly not really interested to know what has regressed this and if this is a regression. We should just fix it.
Assignee: nobody → ttaubert
Status: NEW → ASSIGNED
Oh, a clear STR is in bug 1006478. Just enable "Clear History when closing Firefox" and start and quit the browser a couple of times.
When spawning a worker on shutdown, this worker mostly loads some resources using require() which is implemented by a sync XHR inside the worker. If the proxy is released and torn down before the LoadStartDetectionRunnable has run we currently don't stop the XHR's sync loop. This patch fixes it by making SyncTeardownRunnables dispatch a MainThreadStopSyncLoopRunnable after tearing down the proxy. This fixes the shutdown hang for me.
Attachment #8420058 - Flags: review?(khuey)
Attachment #8420058 - Flags: review?(bent.mozilla)
Bumping the priority a bit. Bug 1005487 has a lot of duplicates and this seems to hit quite a lot of users.
Severity: normal → major
Review ping? This affects Firefox 30 and in case reviewers agree that this is safe enough to uplift we really would like to do that.
We're not really interested in a regression window. This has probably been around since Worker XHRs have been rewritten not too long ago.
Comment on attachment 8420058 [details] [diff] [review] 0001-Bug-965309-Stop-the-SyncLoop-of-a-running-sync-XHR-i.patch Review of attachment 8420058 [details] [diff] [review]: ----------------------------------------------------------------- Sorry I took so long here. I wanted to convince myself this would actually happen, so I wrote this test. ::: dom/workers/XMLHttpRequest.cpp @@ +569,5 @@ > + if (mProxy->mSyncLoopTarget) { > + nsRefPtr<MainThreadStopSyncLoopRunnable> runnable = > + new MainThreadStopSyncLoopRunnable(mWorkerPrivate, > + mProxy->mSyncLoopTarget.forget(), > + true); Indent these at the same level as mWorkerPrivate please.
Attachment #8420058 - Flags: review?(khuey)
Attachment #8420058 - Flags: review?(bent.mozilla)
Attachment #8420058 - Flags: review+
I would like to let this sit on tinderbox for a couple days before we uplift. Usually that's enough to shake out any race conditions in this stuff that we missed.
Comment on attachment 8420058 [details] [diff] [review] 0001-Bug-965309-Stop-the-SyncLoop-of-a-running-sync-XHR-i.patch Review of attachment 8420058 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/workers/XMLHttpRequest.cpp @@ +569,5 @@ > + if (mProxy->mSyncLoopTarget) { > + nsRefPtr<MainThreadStopSyncLoopRunnable> runnable = > + new MainThreadStopSyncLoopRunnable(mWorkerPrivate, > + mProxy->mSyncLoopTarget.forget(), > + true); And actually, the boolean should be false, not true.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #16) > Sorry I took so long here. I wanted to convince myself this would actually > happen, so I wrote this test. Yay, I had a hard time coming up with a test for this, thanks! (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #17) > I would like to let this sit on tinderbox for a couple days before we > uplift. Usually that's enough to shake out any race conditions in this > stuff that we missed. Sure, that sounds reasonable.
Backed out for m-4 failures: https://hg.mozilla.org/integration/mozilla-inbound/rev/0aeea1609e53 https://hg.mozilla.org/integration/mozilla-inbound/rev/ded6d2a3f18c Assertion failure: cx, at /builds/slave/m-in-osx64-d-00000000000000000/build/dom/workers/WorkerRunnable.cpp:284 TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_terminateSyncXHR.html | application terminated with exit code 1 PROCESS-CRASH | /tests/dom/workers/test/test_terminateSyncXHR.html | application crashed [@ mozilla::dom::workers::WorkerRunnable::Run()] Return code: 1
These failures affect Linux and OS X 10.6 - Windows and 10.8 seem fine. I can't reproduce that on my Linux 64-bit VM.
I am seeing this happen even when "clear history when Firefox closes" is turned off on 30b7. The problem seems to be getting worse.
Sorry, slight error on my part. Private browsing mode was enabled and on first glance the "clear history when Firefox closes" box was unchecked. Not being in Private browsing mode and unchecking "clear history when Firefox closes" does allow proper shutdown. Apologies.
Does that assertion happen reliably on the platforms in comment 22?
Flags: needinfo?(ttaubert)
Seems to happen quite reliably, yes. Only Linux 32 debug didn't fail for some reason. Android 2.* crashes as well, but not Android 4. https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=97e18fbba4b6
Flags: needinfo?(ttaubert)
On OS X, I can make it crash quite reliably after ~30s with: mach mochitest-plain dom/workers/test/test_terminateSyncXHR.html --run-until-failure --repeat 1000
I pushed https://tbpl.mozilla.org/?tree=Try&rev=dc862998755a with the patch from bug 1010784 but that seems to make this overshoot too far the opposite direction (in that sometimes we're not shutting down the sync loop when we should be).
Aha, it was really hard to get my Linux VM to reproduce the crash. It immediately happens however as soon as I set the number of CPUs to one. Now I can reproduce it always after ~5 runs of the test.
The problem seems to be that there is a small window between when we dispatch the MainThreadStopSyncLoopRunnable and when it actually runs. If the sync XHR completes in between those two events then |mWorkerPrivate->ScheduleDeletion(WorkerPrivate::WorkerRan)| is called by the RuntimeService. This will process pending events including the MainThreadStopSyncLoopRunnable which still runs even if we try to Cancel() it and thus makes us crash.
So why does StopSyncLoopRunnable::Cancel() call Run() even though it should be cancelled? Can we do |if (mWorkerPrivate->GetJSContext()) { Run(); }| to only call Run() when the JSContext is still available? That seems to fix the crashes for me locally.
Flags: needinfo?(khuey)
(In reply to Tim Taubert [:ttaubert] from comment #31) > Can we do |if (mWorkerPrivate->GetJSContext()) { Run(); }| to only call > Run() when the JSContext is still available? That seems to fix the crashes > for me locally. So what do you think about this approach? Seems to work fine on try: https://tbpl.mozilla.org/?tree=Try&rev=2c352b67a350
Attachment #8420058 - Attachment is obsolete: true
Attachment #8429038 - Flags: review?(khuey)
Flags: needinfo?(khuey)
Triage drive-by: This is the last week of Beta landings for FF30 - need to know if this is low risk enough to take to branches this week or if we have to wait another cycle.
Flags: needinfo?(khuey)
Comment on attachment 8429038 [details] [diff] [review] 0001-Bug-965309-Stop-the-SyncLoop-of-a-running-sync-XHR-i.patch, v2 Review of attachment 8429038 [details] [diff] [review]: ----------------------------------------------------------------- No, this is just wallpaper.
Attachment #8429038 - Flags: review?(khuey) → review-
Assignee: ttaubert → khuey
Flags: needinfo?(khuey)
Attached patch Patch (obsolete) — Splinter Review
This is the correct fix, I think.
Attachment #8429038 - Attachment is obsolete: true
Attachment #8430318 - Flags: review?(bent.mozilla)
Comment on attachment 8430318 [details] [diff] [review] Patch Review of attachment 8430318 [details] [diff] [review]: ----------------------------------------------------------------- r=me with this: ::: dom/workers/XMLHttpRequest.cpp @@ +335,5 @@ > XMLHttpRequest* aXHRPrivate, uint32_t aChannelId) > : MainThreadProxyRunnable(aWorkerPrivate, aProxy), > mXMLHttpRequestPrivate(aXHRPrivate), mChannelId(aChannelId) > + { > + mProxy->mSyncLoopTarget = nullptr; Let's move this to LoadStartDetectionRunnable::Run @@ +950,5 @@ > AddRemoveEventListeners(false, false); > mXHR->Abort(); > > if (mOutstandingSendCount) { > + MOZ_ASSERT(mWorkerPrivate); This seems unnecessary
Attachment #8430318 - Flags: review?(bent.mozilla) → review+
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Attachment #8427394 - Attachment is obsolete: true
Attachment #8430318 - Attachment is obsolete: true
Attachment #8430816 - Flags: review+
What's the chance of getting this to beta? I see lots of tests in the patch - is this safe to uplift?
Flags: needinfo?(khuey)
I am comfortable landing this on beta. I'm not convinced that it fixes every possible way that we can hang, but I am convinced that it does fix at least one possible way that we can hang, and that it will not regress things further.
Flags: needinfo?(khuey)
Comment on attachment 8430816 [details] [diff] [review] Patch (as landed) [Approval Request Comment] Bug caused by (feature/regressing bug #): Unknown User impact if declined: Shutdown hangs when using Clear Private Data at shutdown (Bug 1005487) Testing completed (on m-c, etc.): On m-c, has tests, has been tested quite a bit in our record and replay debugging setup. Risk to taking this patch (and alternatives if risky): This is low risk. I am confident that this will not regress anything. I am slightly less confident that this will fix all of the shutdown hang problems, but it does fix the main one we've identified. String or IDL/UUID changes made by this patch: N/A NB: We have to take bug 1010784 with this too, as this patch depends on that one.
Attachment #8430816 - Flags: approval-mozilla-beta?
Attachment #8430816 - Flags: approval-mozilla-aurora?
Attachment #8430816 - Flags: approval-mozilla-beta?
Attachment #8430816 - Flags: approval-mozilla-beta+
Attachment #8430816 - Flags: approval-mozilla-aurora?
Attachment #8430816 - Flags: approval-mozilla-aurora+
Thanks for helping and taking over, Kyle! Glad to see this fixed :)
No worries. This is still mostly your patch, I just moved things one level down the callstack. I figured it was better to just write it down rather than make you suffer through a couple more review cycles :P
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: