Closed Bug 957693 Opened 6 years ago Closed 6 years ago

Investigate the mid-December spike in mochitest shutdown timeouts across all platforms

Categories

(Core :: DOM: Workers, defect, blocker)

defect
Not set
blocker

Tracking

()

RESOLVED FIXED
mozilla29
Tracking Status
firefox27 --- unaffected
firefox28 --- unaffected
firefox29 --- affected
firefox-esr24 --- unaffected
b2g-v1.2 --- unaffected
b2g-v1.3 --- unaffected

People

(Reporter: RyanVM, Assigned: khuey)

References

Details

Attachments

(2 files, 1 obsolete file)

Bug 918759 is very noisy, so mccr8 and I felt it would be better to have a new bug filed for investigating the mid-December spike in mochitest shutdown timeouts rather than trying to do it in the middle of that one.

What we know:
1. Affects all platforms
2. Appears to have started around 17-Dec
3. Only affects trunk (bug 918759 is used for other timeouts as well, but this big spike hasn't appeared on other branches)

I'm going to work on finding a narrower regression range and mccr8 is going to try to figure out where the hang is occurring in the code.
Try bisection indicates that bug 914762 is the cause of this.

Revert to the push prior: https://tbpl.mozilla.org/?tree=Try&rev=1bc27db7ebdf
Revert to the push for bug 914762: https://tbpl.mozilla.org/?tree=Try&rev=977a6c8e890b

Ben, can you please take a look? These shutdown hangs are the #1 orange on trunk by a 4-fold margin.
Assignee: continuation → bent.mozilla
Blocks: 914762
Flags: needinfo?(bent.mozilla)
Thanks for doing the bisection, Ryan!
FWIW, given the high frequency of these timeouts, I'm very likely to attempt backing out bug 914762 before too long if there's a lack of progress here.
Severity: normal → blocker
Component: General → DOM: Workers
Try run of the backout:
https://tbpl.mozilla.org/?tree=Try&rev=a8810f09b56d

I also had to include bug 951671 in the backout due to it depending pretty heavily on bug 914762.
I think I caught this or something closely related in the replay VM.
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #4)
> Try run of the backout:
> https://tbpl.mozilla.org/?tree=Try&rev=a8810f09b56d

Now with less bustage hopefully:
https://tbpl.mozilla.org/?tree=Try&rev=943ef17a7b14
Assignee: bent.mozilla → khuey
Flags: needinfo?(bent.mozilla)
Attached patch PatchSplinter Review
This patch fixes 5 small bugs in the worker code.

1. ScriptExecutorRunnable shuts down a sync loop, but does not override the Cancel method.  Thus we end up with a wedged sync loop if it is in the queue when the worker begins to shut down.  This is the cause of this bug.
2. NotifyRunnable asserts that aStatus is (among other things) not Closing, which is bogus.  This is timing dependent and triggered by test_onLine which is brand new.
3. AutoSyncLoopHolder's dtor does not destroy the sync loop, so if we ever rely on it to clean up we will fatally assert when we try to shutdown the worker (since the worker thinks it still has a sync queue outstanding).
4. We used to throw away ProxyCompleteRunnables for sync XHRs if a control runnable processed during the sync loop for the send call caused the XHR to cancel.  Instead we relied on the RAII helpers.  This was because sync queues did not obey the RunWhenClearing flag.  Now they do honor the replacement for RunWhenClearing, so we try to clean up twice.  The solution is to remove the early return and actually run the sync loop.
5. A bit of botched indentation.
Attachment #8358679 - Flags: review?(bent.mozilla)
Comment on attachment 8358679 [details] [diff] [review]
Patch

Review of attachment 8358679 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! Thanks for digging here.

::: dom/workers/WorkerPrivate.cpp
@@ +4885,5 @@
> +  MOZ_ASSERT(mSyncLoopStack.Length() - 1 == aLoopIndex);
> +
> +  if (!aThread) {
> +    nsCOMPtr<nsIThreadInternal> thread = do_QueryInterface(mThread);
> +    MOZ_ASSERT(thread);

It's probably overkill but you should probably MOZ_ASSERT(mThread == static_cast<nsIThread*>(thread.get())) here too
Attachment #8358679 - Flags: review?(bent.mozilla) → review+
And at least one of the m4 retriggers had the shutdown timeouts: https://tbpl.mozilla.org/php/getParsedLog.php?id=32850996&tree=Mozilla-Inbound
Are you saying that I made that worse?

There is no worker code running on any of the threads in that log, so that seems like an unrelated issue.
(In reply to Wes Kocher (:KWierso) from comment #11)
> And at least one of the m4 retriggers had the shutdown timeouts:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=32850996&tree=Mozilla-
> Inbound

This is also something different I haven't seen before.
https://hg.mozilla.org/mozilla-central/rev/5cae5d9ef313
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
(Due to comments 10-15)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch Moar Patch (obsolete) — Splinter Review
Attachment #8359479 - Flags: review?(bent.mozilla)
Attached patch Moar PatchSplinter Review
Attachment #8359479 - Attachment is obsolete: true
Attachment #8359479 - Flags: review?(bent.mozilla)
Attachment #8359481 - Flags: review?(bent.mozilla)
Attachment #8359481 - Flags: review?(bent.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/3d91cf687759
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Depends on: CVE-2014-1537
You need to log in before you can comment on or make changes to this bug.