Closed
Bug 965309
Opened 12 years ago
Closed 11 years ago
Worker hang at shutdown with XHR
Categories
(Core :: DOM: Workers, defect)
Core
DOM: Workers
Tracking
()
People
(Reporter: ttaubert, Assigned: khuey)
References
Details
(Keywords: hang, regression)
Attachments
(1 file, 5 obsolete files)
5.97 KB,
patch
|
khuey
:
review+
lsblakk
:
approval-mozilla-aurora+
lsblakk
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
With a slightly modified Firefox that doesn't start the SessionWorker early but spawns it on shutdown, I can reproduce a worker hang in 1 of 10-20 tries.
This happens only when Proxy::Teardown() is called by a SyncTeardownRunnable. mOutstandingSendCount=1 is one which means we dispatch a XHRUnpinRunnable that successfully calls Unpin().
On normal shutdowns, the XHR's refcount is 4 in Unpin(). When hanging, the refcount is 5 in Unpin(). Looks like something is holding a reference to the XHR that then hangs the shutdown process.
Reporter | ||
Comment 1•12 years ago
|
||
It's the |autoSyncLoop.ref().Run()| call at the very end of XMLHttpRequest::SendInternal() that's hanging. Not sure if that's obvious but I wanted to mention it.
Reporter | ||
Comment 2•12 years ago
|
||
The LoadStartDetectionRunnable would usually create a ProxyCompleteRunnable that then calls StopSyncLoop() but because mOutstandingSendCount is set to zero in Proxy::Teardown() that will never happen.
Assignee | ||
Comment 3•12 years ago
|
||
I suspect that the hunk at http://hg.mozilla.org/mozilla-central/rev/5cae5d9ef313#l4.24 regressed this.
Reporter | ||
Comment 4•12 years ago
|
||
Re-adding this hunk doesn't solve the hang, the XHR is probably canceled after we started spinning.
Reporter | ||
Comment 5•12 years ago
|
||
How about this? We could also create specialized runnable only for that task that omits Unpin() to keep the assertion. Or extend ProxyCompleteRunnable to take a boolean.
This does fix the hang for me.
Attachment #8368018 -
Flags: feedback?(khuey)
Reporter | ||
Updated•11 years ago
|
Attachment #8368018 -
Attachment is obsolete: true
Attachment #8368018 -
Flags: feedback?(khuey)
Comment 7•11 years ago
|
||
Could you imagine some STR we could use to reproduce this? I would like to see when this has been really regressed.
Flags: needinfo?(ttaubert)
Keywords: regressionwindow-wanted
Reporter | ||
Comment 8•11 years ago
|
||
I don't know. I'm honestly not really interested to know what has regressed this and if this is a regression. We should just fix it.
status-firefox29:
--- → affected
status-firefox30:
--- → affected
status-firefox31:
--- → affected
status-firefox32:
--- → affected
tracking-firefox29:
--- → ?
tracking-firefox30:
--- → ?
tracking-firefox31:
--- → ?
tracking-firefox32:
--- → ?
Flags: needinfo?(ttaubert)
Reporter | ||
Updated•11 years ago
|
Assignee: nobody → ttaubert
Status: NEW → ASSIGNED
Reporter | ||
Comment 9•11 years ago
|
||
Oh, a clear STR is in bug 1006478. Just enable "Clear History when closing Firefox" and start and quit the browser a couple of times.
Reporter | ||
Comment 10•11 years ago
|
||
When spawning a worker on shutdown, this worker mostly loads some resources using require() which is implemented by a sync XHR inside the worker. If the proxy is released and torn down before the LoadStartDetectionRunnable has run we currently don't stop the XHR's sync loop.
This patch fixes it by making SyncTeardownRunnables dispatch a MainThreadStopSyncLoopRunnable after tearing down the proxy.
This fixes the shutdown hang for me.
Attachment #8420058 -
Flags: review?(khuey)
Attachment #8420058 -
Flags: review?(bent.mozilla)
Updated•11 years ago
|
Reporter | ||
Comment 11•11 years ago
|
||
Bumping the priority a bit. Bug 1005487 has a lot of duplicates and this seems to hit quite a lot of users.
Severity: normal → major
Comment 12•11 years ago
|
||
Reporter | ||
Comment 13•11 years ago
|
||
Review ping? This affects Firefox 30 and in case reviewers agree that this is safe enough to uplift we really would like to do that.
Reporter | ||
Comment 14•11 years ago
|
||
We're not really interested in a regression window. This has probably been around since Worker XHRs have been rewritten not too long ago.
Keywords: regressionwindow-wanted
Assignee | ||
Comment 15•11 years ago
|
||
Assignee | ||
Comment 16•11 years ago
|
||
Comment on attachment 8420058 [details] [diff] [review]
0001-Bug-965309-Stop-the-SyncLoop-of-a-running-sync-XHR-i.patch
Review of attachment 8420058 [details] [diff] [review]:
-----------------------------------------------------------------
Sorry I took so long here. I wanted to convince myself this would actually happen, so I wrote this test.
::: dom/workers/XMLHttpRequest.cpp
@@ +569,5 @@
> + if (mProxy->mSyncLoopTarget) {
> + nsRefPtr<MainThreadStopSyncLoopRunnable> runnable =
> + new MainThreadStopSyncLoopRunnable(mWorkerPrivate,
> + mProxy->mSyncLoopTarget.forget(),
> + true);
Indent these at the same level as mWorkerPrivate please.
Attachment #8420058 -
Flags: review?(khuey)
Attachment #8420058 -
Flags: review?(bent.mozilla)
Attachment #8420058 -
Flags: review+
Assignee | ||
Comment 17•11 years ago
|
||
I would like to let this sit on tinderbox for a couple days before we uplift. Usually that's enough to shake out any race conditions in this stuff that we missed.
Assignee | ||
Comment 18•11 years ago
|
||
Comment on attachment 8420058 [details] [diff] [review]
0001-Bug-965309-Stop-the-SyncLoop-of-a-running-sync-XHR-i.patch
Review of attachment 8420058 [details] [diff] [review]:
-----------------------------------------------------------------
::: dom/workers/XMLHttpRequest.cpp
@@ +569,5 @@
> + if (mProxy->mSyncLoopTarget) {
> + nsRefPtr<MainThreadStopSyncLoopRunnable> runnable =
> + new MainThreadStopSyncLoopRunnable(mWorkerPrivate,
> + mProxy->mSyncLoopTarget.forget(),
> + true);
And actually, the boolean should be false, not true.
Reporter | ||
Comment 19•11 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #16)
> Sorry I took so long here. I wanted to convince myself this would actually
> happen, so I wrote this test.
Yay, I had a hard time coming up with a test for this, thanks!
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #17)
> I would like to let this sit on tinderbox for a couple days before we
> uplift. Usually that's enough to shake out any race conditions in this
> stuff that we missed.
Sure, that sounds reasonable.
Reporter | ||
Comment 20•11 years ago
|
||
Reporter | ||
Comment 21•11 years ago
|
||
Backed out for m-4 failures:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0aeea1609e53
https://hg.mozilla.org/integration/mozilla-inbound/rev/ded6d2a3f18c
Assertion failure: cx, at /builds/slave/m-in-osx64-d-00000000000000000/build/dom/workers/WorkerRunnable.cpp:284
TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_terminateSyncXHR.html | application terminated with exit code 1
PROCESS-CRASH | /tests/dom/workers/test/test_terminateSyncXHR.html | application crashed [@ mozilla::dom::workers::WorkerRunnable::Run()]
Return code: 1
Reporter | ||
Comment 22•11 years ago
|
||
These failures affect Linux and OS X 10.6 - Windows and 10.8 seem fine. I can't reproduce that on my Linux 64-bit VM.
Comment 23•11 years ago
|
||
I am seeing this happen even when "clear history when Firefox closes" is turned off on 30b7. The problem seems to be getting worse.
Comment 24•11 years ago
|
||
Sorry, slight error on my part. Private browsing mode was enabled and on first glance the "clear history when Firefox closes" box was unchecked. Not being in Private browsing mode and unchecking "clear history when Firefox closes" does allow proper shutdown. Apologies.
Assignee | ||
Comment 25•11 years ago
|
||
Does that assertion happen reliably on the platforms in comment 22?
Flags: needinfo?(ttaubert)
Reporter | ||
Comment 26•11 years ago
|
||
Seems to happen quite reliably, yes. Only Linux 32 debug didn't fail for some reason. Android 2.* crashes as well, but not Android 4.
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=97e18fbba4b6
Flags: needinfo?(ttaubert)
Reporter | ||
Comment 27•11 years ago
|
||
On OS X, I can make it crash quite reliably after ~30s with:
mach mochitest-plain dom/workers/test/test_terminateSyncXHR.html --run-until-failure --repeat 1000
Assignee | ||
Comment 28•11 years ago
|
||
I pushed https://tbpl.mozilla.org/?tree=Try&rev=dc862998755a with the patch from bug 1010784 but that seems to make this overshoot too far the opposite direction (in that sometimes we're not shutting down the sync loop when we should be).
Reporter | ||
Comment 29•11 years ago
|
||
Aha, it was really hard to get my Linux VM to reproduce the crash. It immediately happens however as soon as I set the number of CPUs to one. Now I can reproduce it always after ~5 runs of the test.
Reporter | ||
Comment 30•11 years ago
|
||
The problem seems to be that there is a small window between when we dispatch the MainThreadStopSyncLoopRunnable and when it actually runs. If the sync XHR completes in between those two events then |mWorkerPrivate->ScheduleDeletion(WorkerPrivate::WorkerRan)| is called by the RuntimeService.
This will process pending events including the MainThreadStopSyncLoopRunnable which still runs even if we try to Cancel() it and thus makes us crash.
Reporter | ||
Comment 31•11 years ago
|
||
So why does StopSyncLoopRunnable::Cancel() call Run() even though it should be cancelled?
Can we do |if (mWorkerPrivate->GetJSContext()) { Run(); }| to only call Run() when the JSContext is still available? That seems to fix the crashes for me locally.
Flags: needinfo?(khuey)
Reporter | ||
Comment 32•11 years ago
|
||
(In reply to Tim Taubert [:ttaubert] from comment #31)
> Can we do |if (mWorkerPrivate->GetJSContext()) { Run(); }| to only call
> Run() when the JSContext is still available? That seems to fix the crashes
> for me locally.
So what do you think about this approach? Seems to work fine on try:
https://tbpl.mozilla.org/?tree=Try&rev=2c352b67a350
Attachment #8420058 -
Attachment is obsolete: true
Attachment #8429038 -
Flags: review?(khuey)
Flags: needinfo?(khuey)
Comment 33•11 years ago
|
||
Triage drive-by: This is the last week of Beta landings for FF30 - need to know if this is low risk enough to take to branches this week or if we have to wait another cycle.
Flags: needinfo?(khuey)
Assignee | ||
Comment 34•11 years ago
|
||
Comment on attachment 8429038 [details] [diff] [review]
0001-Bug-965309-Stop-the-SyncLoop-of-a-running-sync-XHR-i.patch, v2
Review of attachment 8429038 [details] [diff] [review]:
-----------------------------------------------------------------
No, this is just wallpaper.
Attachment #8429038 -
Flags: review?(khuey) → review-
Assignee | ||
Updated•11 years ago
|
Assignee: ttaubert → khuey
Flags: needinfo?(khuey)
Assignee | ||
Comment 35•11 years ago
|
||
This is the correct fix, I think.
Attachment #8429038 -
Attachment is obsolete: true
Attachment #8430318 -
Flags: review?(bent.mozilla)
Comment on attachment 8430318 [details] [diff] [review]
Patch
Review of attachment 8430318 [details] [diff] [review]:
-----------------------------------------------------------------
r=me with this:
::: dom/workers/XMLHttpRequest.cpp
@@ +335,5 @@
> XMLHttpRequest* aXHRPrivate, uint32_t aChannelId)
> : MainThreadProxyRunnable(aWorkerPrivate, aProxy),
> mXMLHttpRequestPrivate(aXHRPrivate), mChannelId(aChannelId)
> + {
> + mProxy->mSyncLoopTarget = nullptr;
Let's move this to LoadStartDetectionRunnable::Run
@@ +950,5 @@
> AddRemoveEventListeners(false, false);
> mXHR->Abort();
>
> if (mOutstandingSendCount) {
> + MOZ_ASSERT(mWorkerPrivate);
This seems unnecessary
Attachment #8430318 -
Flags: review?(bent.mozilla) → review+
Assignee | ||
Comment 37•11 years ago
|
||
Comment 38•11 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Assignee | ||
Comment 39•11 years ago
|
||
Attachment #8427394 -
Attachment is obsolete: true
Attachment #8430318 -
Attachment is obsolete: true
Attachment #8430816 -
Flags: review+
Comment 40•11 years ago
|
||
What's the chance of getting this to beta? I see lots of tests in the patch - is this safe to uplift?
Flags: needinfo?(khuey)
Assignee | ||
Comment 41•11 years ago
|
||
I am comfortable landing this on beta. I'm not convinced that it fixes every possible way that we can hang, but I am convinced that it does fix at least one possible way that we can hang, and that it will not regress things further.
Flags: needinfo?(khuey)
Assignee | ||
Comment 42•11 years ago
|
||
Comment on attachment 8430816 [details] [diff] [review]
Patch (as landed)
[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown
User impact if declined: Shutdown hangs when using Clear Private Data at shutdown (Bug 1005487)
Testing completed (on m-c, etc.): On m-c, has tests, has been tested quite a bit in our record and replay debugging setup.
Risk to taking this patch (and alternatives if risky): This is low risk. I am confident that this will not regress anything. I am slightly less confident that this will fix all of the shutdown hang problems, but it does fix the main one we've identified.
String or IDL/UUID changes made by this patch: N/A
NB: We have to take bug 1010784 with this too, as this patch depends on that one.
Attachment #8430816 -
Flags: approval-mozilla-beta?
Attachment #8430816 -
Flags: approval-mozilla-aurora?
Updated•11 years ago
|
Attachment #8430816 -
Flags: approval-mozilla-beta?
Attachment #8430816 -
Flags: approval-mozilla-beta+
Attachment #8430816 -
Flags: approval-mozilla-aurora?
Attachment #8430816 -
Flags: approval-mozilla-aurora+
Comment 43•11 years ago
|
||
Comment 44•11 years ago
|
||
Flags: in-testsuite+
Updated•11 years ago
|
status-b2g-v1.4:
--- → fixed
status-b2g-v2.0:
--- → fixed
Reporter | ||
Comment 45•11 years ago
|
||
Thanks for helping and taking over, Kyle! Glad to see this fixed :)
Assignee | ||
Comment 46•11 years ago
|
||
No worries. This is still mostly your patch, I just moved things one level down the callstack. I figured it was better to just write it down rather than make you suffer through a couple more review cycles :P
You need to log in
before you can comment on or make changes to this bug.
Description
•