Closed
Bug 951120
Opened 11 years ago
Closed 11 years ago
Thread manager shutdown hang when running metro mochitests
Categories
(Firefox for Metro Graveyard :: Tests, defect, P2)
Tracking
(firefox28 fixed, firefox29 fixed)
RESOLVED
FIXED
Firefox 29
People
(Reporter: jimm, Assigned: jimm)
References
Details
(Whiteboard: [beta28][defect] p=8 [qa-])
Attachments
(2 files, 1 obsolete file)
3.25 KB,
patch
|
bbondy
:
review+
|
Details | Diff | Splinter Review |
11.48 KB,
patch
|
bbondy
:
review+
|
Details | Diff | Splinter Review |
+++ This bug was initially created as a clone of Bug #918759 +++
https://tbpl.mozilla.org/php/getParsedLog.php?id=28138006&tree=Mozilla-Inbound
Ubuntu ASAN VM 12.04 x64 mozilla-inbound opt test mochitest-1 on 2013-09-19 23:06:49 PDT for push 76aa567b4255
slave: tst-linux64-ec2-313
TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment hidden (Legacy TBPL/Treeherder Robot) |
Assignee | ||
Updated•11 years ago
|
Blocks: 950986
Component: General → Tests
Keywords: intermittent-failure
OS: Linux → Windows 8.1
Product: Core → Firefox for Metro
Summary: Intermittent TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output (Ubuntu VM) → Thread manager shutdown hang when running metro mochitests
Whiteboard: [triage]
Version: 26 Branch → Trunk
Updated•11 years ago
|
Blocks: metrov1backlog
Updated•11 years ago
|
Whiteboard: [triage] → [triage] [defect] p=0
Comment 7•11 years ago
|
||
Comment 8•11 years ago
|
||
Comment 9•11 years ago
|
||
Comment 10•11 years ago
|
||
Updated•11 years ago
|
Whiteboard: [triage] [defect] p=0 → [defect] p=0
Comment 11•11 years ago
|
||
Comment 12•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=32701025&tree=B2g-Inbound
Any idea what's going on here, Jim? This is happening more than the bug comments indicate due to this requiring manual starring.
Assignee | ||
Comment 13•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #12)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=32701025&tree=B2g-Inbound
>
> Any idea what's going on here, Jim? This is happening more than the bug
> comments indicate due to this requiring manual starring.
Not sure what's causing this and I'm not sure how to debug it. Maybe repeated runs locally to try and reproduce and if so, attach to debug the hang. AFAIK we aren't seeing these types of hangs on user machines so this isn't blocking, as such it won't get much attention for a few weeks.
Comment 14•11 years ago
|
||
Given that we've tracked down the cause of the other mochitest hangs to a DOM Workers patch, is it conceivable that this is the same issue after all? See bug 957693.
Assignee | ||
Comment 15•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #14)
> Given that we've tracked down the cause of the other mochitest hangs to a
> DOM Workers patch, is it conceivable that this is the same issue after all?
> See bug 957693.
sure. Lets do that backout or get bent on the bug, land, and see. That would be great if it was, although I think our timeouts stretch back farther than december.
Well, with the number of mochitest-metro-chrome retriggers on khuey's push ( https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=5cae5d9ef313 ) that came back orange from a shutdown crash, I'm going to guess this didn't fix this one, at the very least.
https://tbpl.mozilla.org/php/getParsedLog.php?id=32850733&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=32850713&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=32850721&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=32850752&tree=Mozilla-Inbound
Comment 17•11 years ago
|
||
Looking at the recent history in bug 918759, this appears to be far and away our most frequent mochitest shutdown hang. Can we please find someone to help look into this?
Flags: needinfo?(jmathies)
Assignee | ||
Comment 18•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #17)
> Looking at the recent history in bug 918759, this appears to be far and away
> our most frequent mochitest shutdown hang. Can we please find someone to
> help look into this?
so I guess the thread worker changes didn't help?
Flags: needinfo?(jmathies)
Comment 19•11 years ago
|
||
For this hang, no.
Comment 20•11 years ago
|
||
I just did a bunch of mochitest-mc retriggers and we're hitting this timeout roughly 50% of the time. If we don't get a handle on this soon, we're going to have to hide the suite by default per the visibility guidelines.
Assignee | ||
Updated•11 years ago
|
Whiteboard: [defect] p=0 → [triage][defect] p=0
Assignee | ||
Comment 21•11 years ago
|
||
Did bug 957693 have any impact on the hangs we were seeing on other test suites?
Updated•11 years ago
|
Comment 22•11 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #21)
> Did bug 957693 have any impact on the hangs we were seeing on other test
> suites?
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19)
> For this hang, no.
That was an implicit "yes" to your question :). It made a big difference in the other suites.
Assignee | ||
Comment 23•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #22)
> (In reply to Jim Mathies [:jimm] from comment #21)
> > Did bug 957693 have any impact on the hangs we were seeing on other test
> > suites?
>
> (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19)
> > For this hang, no.
>
> That was an implicit "yes" to your question :). It made a big difference in
> the other suites.
Ah joy, then it's probably not the same cause. My first guess would be omtc. Maybe we could turn off some big ticket items like omtc and apzc to see if we can isolate it.
Updated•11 years ago
|
Whiteboard: [triage][defect] p=0 → [beta28][defect] p=0
Assignee | ||
Comment 24•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #22)
> (In reply to Jim Mathies [:jimm] from comment #21)
> > Did bug 957693 have any impact on the hangs we were seeing on other test
> > suites?
>
> (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19)
> > For this hang, no.
>
> That was an implicit "yes" to your question :). It made a big difference in
> the other suites.
It must not have fixed the problem 100% of the time, I'm still seeing a lot of non-metro failures.
Assignee | ||
Comment 25•11 years ago
|
||
Pushed a patch that disables dom workers, we'll start with the most likely culprit -
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce
Comment 26•11 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #24)
> It must not have fixed the problem 100% of the time, I'm still seeing a lot
> of non-metro failures.
No, of course it didn't. It's a false argument to say that it was supposed to. Bug 918759 existed long before the mid-December spike that we know was workers-related. You said yourself in this bug that this hang predates that spike as well.
Since that fix landed, we've already had another fix land for a completely different shutdown hang (bug 951908). And yes, there are other failures in that bug (that's what happens with dumping ground bugs), but until the Workers hang was fixed, they were drowned out by the noise.
That said, since bug 957693 landed, the failure rate in bug 918759 is less than *half* of what it was before. So yes, it made a dramatic improvement. Did it fix everything? No. Was it expected to? No.
The data doesn't lie, mochitest-mc is dominating these timeouts at this point.
http://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=918759&startday=2014-01-11&endday=2014-01-18&tree=trunk
Assignee | ||
Comment 27•11 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #25)
> Pushed a patch that disables dom workers, we'll start with the most likely
> culprit -
>
> https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce
Disable omtc/apzc -
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=3bd15146ca1b
Comment 28•11 years ago
|
||
From what I can tell, this regressed right around 12/16. I do see some APZC and Metro stuff that landed around then. I'm going to try narrowing it down on Try. Will let you know.
Assignee | ||
Comment 29•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #28)
> From what I can tell, this regressed right around 12/16. I do see some APZC
> and Metro stuff that landed around then. I'm going to try narrowing it down
> on Try. Will let you know.
It wasn't a regression, it was an unmasking. :)
https://bugzilla.mozilla.org/show_bug.cgi?id=920952
Comment 30•11 years ago
|
||
Ah, crap. I can apply that patch locally to try to bisect, though. Think that would work?
The interesting thing, FWIW, is that the patch from bug 920952 was uplifted to Aurora where we also run the tests, but we don't see the hangs there. So it still may be a Gecko29 regression (which started on 12-6). So the range may not be too bad.
Assignee | ||
Comment 31•11 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #30)
> Ah, crap. I can apply that patch locally to try to bisect, though. Think
> that would work?
>
> The interesting thing, FWIW, is that the patch from bug 920952 was uplifted
> to Aurora where we also run the tests, but we don't see the hangs there. So
> it still may be a Gecko29 regression (which started on 12-6). So the range
> may not be too bad.
Interesting, I didn't realize it was mc only. I was under the impression this went back a lot farther and affected all trees. We pushed that fflush change to aurora as well, so it wouldn't masked there.
Comment 32•11 years ago
|
||
Maybe it does. I see a few instances on Aurora too looking at the OF stats. I'll try applying that patch locally and see if I can find a range. Just might take longer than I was hoping :)
Assignee | ||
Comment 33•11 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #27)
> (In reply to Jim Mathies [:jimm] from comment #25)
> > Pushed a patch that disables dom workers, we'll start with the most likely
> > culprit -
> >
> > https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce
>
> Disable omtc/apzc -
>
> https://tbpl.mozilla.org/?tree=Try&showall=0&rev=3bd15146ca1b
Disabling workers helped, but it wasn't a silver bullet. Disabling apzc and omtc didn't seem to have an impact.
Comment 34•11 years ago
|
||
For posterity, APZC disabled failed 20% of the time and Workers disabled failed 17% of the time.
Comment 35•11 years ago
|
||
So far, on the earlier reverts (without bug 920952 applied on top), I'm hitting occasional 1000s timeouts, but mostly just bug 900641. Even if we assume that every 1000s timeout is what we're hitting in this bug now, we still weren't failing at anywhere near the rate we are now. So we may be able to investigate what has made this so much worse since December 16 too.
Comment 36•11 years ago
|
||
Actually, some of those runs are hitting 15+% still, so I think we are in the same ballpark as the runs above would indicate we would expect to be.
Comment 37•11 years ago
|
||
On a run from late September, I hit it twice in 100 attempts. So still there, but definitely something that's gotten worse over time.
Assignee | ||
Comment 38•11 years ago
|
||
I'm trying to find the thread that hangs. Unfortunately my first attempt fixed the bug. :) Just pushed another less chatty patch, we'll see what happens.
Comment 39•11 years ago
|
||
Good, because I actually can't go back much further without some sorcery due to the change to rev2 build slaves awhile back. Not impossible to work around, but definitely upping the difficulty level a few notches.
Updated•11 years ago
|
Whiteboard: [beta28][defect] p=0 → [beta28][defect] p=8
Assignee | ||
Comment 40•11 years ago
|
||
This addresses the issue in metro mochitest runs. In the while loop here we call NS_ProcessNextEvent with the default mayWait parameter of true, but in shutdowns that's ignored. Which results in a really tight loop and I'm guessing, a perpetual lock on the event queue. Yielding in the loop frees things up.
In mochitest runs we were failing on average around 20% of the time. In these runs with the patch I get no failures -
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=bb4a5ce2cc87
Pushed a full mochitest run as well to check for issues on other platforms -
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=5c108f3a4345
Assignee: nobody → jmathies
Updated•11 years ago
|
Status: NEW → ASSIGNED
Priority: -- → P2
QA Contact: jbecerra
Assignee | ||
Updated•11 years ago
|
Attachment #8362692 -
Flags: review?(ehsan)
Comment 41•11 years ago
|
||
Comment on attachment 8362692 [details] [diff] [review]
loosen thread shutdown event loop v.1
Review of attachment 8362692 [details] [diff] [review]:
-----------------------------------------------------------------
I think Benjamin would want to review this himself.
::: xpcom/threads/nsThread.cpp
@@ +489,4 @@
> // Process events on the current thread until we receive a shutdown ACK.
> + while (!context.shutdownAck) {
> + NS_ProcessNextEvent(context.joiningThread, false);
> + PR_Sleep(PR_INTERVAL_NO_WAIT);
I would much prefer if you just explicitly called Sleep(0) here on Windows and did not change the behavior on other platforms.
Attachment #8362692 -
Flags: review?(ehsan) → review?(benjamin)
Assignee | ||
Comment 42•11 years ago
|
||
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #41)
> Comment on attachment 8362692 [details] [diff] [review]
> loosen thread shutdown event loop v.1
>
> Review of attachment 8362692 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> I think Benjamin would want to review this himself.
No problem, thanks for flagging him.
> I would much prefer if you just explicitly called Sleep(0) here on Windows
> and did not change the behavior on other platforms.
I think I'd like to keep this cross and see if it fixes the shutdown timeout problems we see on platforms besides Windows too.
Comment 43•11 years ago
|
||
Actually what concerns me about this bug is comment 40: "In the while loop here we call NS_ProcessNextEvent with the default mayWait parameter of true, but in shutdowns that's ignored."
Why is it ignored? That really doesn't make any sense: the whole point of this loop is to block on receiving a new XPCOM-or-native event. In shutdown we don't expect to receive any more native events, but we should continue to block for receiving XPCOM events.
Updated•11 years ago
|
Flags: needinfo?(jmathies)
Assignee | ||
Comment 44•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #43)
> Actually what concerns me about this bug is comment 40: "In the while loop
> here we call NS_ProcessNextEvent with the default mayWait parameter of true,
> but in shutdowns that's ignored."
>
> Why is it ignored? That really doesn't make any sense: the whole point of
> this loop is to block on receiving a new XPCOM-or-native event. In shutdown
> we don't expect to receive any more native events, but we should continue to
> block for receiving XPCOM events.
Looks like this stretches back to the original thread manager work in bug 326273 (1.61) -
http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/xpcom/threads/nsThread.cpp&root=/cvsroot
http://hg.mozilla.org/mozilla-central/annotate/45a22c4901bb/xpcom/threads/nsThread.cpp#l629
I don't see any comments about this in the reviews. (You approved the original patch! :)
Flags: needinfo?(jmathies)
Assignee | ||
Comment 45•11 years ago
|
||
In this rev there was an mActive flag, then it looks like Darin updated that to method in the landing.
https://bugzilla.mozilla.org/attachment.cgi?id=218120&action=diff
Generally speaking it looks like we assumed not waiting in was ok to do. So this processing event loop has been 'tight' for a long time.
Assignee | ||
Comment 46•11 years ago
|
||
An alternative fix would be to remove the !ShuttingDown() check here, and allow GetEvent to wait. I'd bet that would fix this as well.
http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#629
Flags: needinfo?(benjamin)
Comment 47•11 years ago
|
||
I'm actually confused again. ShuttingDown() should never be true for the main-thread nsThread. It is only used when calling nsIThread::Shutdown for one of the non-main threads.
Are you saying that a non-main thread is in this very tight loop? If so, I think we should consider removing all of the calls to ShuttingDown() from ::ProcessNextEvent and see whether that causes any problems. Or, if it does cause problems, do it only for non-main threads. The widget system shouldn't be involved in this at all, if I'm understanding it correctly.
The sleep hack is very hacky and I'd like to avoid it if possible.
Flags: needinfo?(benjamin)
Assignee | ||
Comment 48•11 years ago
|
||
Hmm, yes you're right about the loop. I miss interpreted the loop here is being on the non-main thread, but we're calling it with the joining thread, which is the main thread. Hence it should wait in GetEvent.
So now I'm totally confused as to why my patch fixes this. :) The PR_Sleep call triggers a yield, and that frees up the thread we're waiting on for some reason, which isn't happening without it. In the stacks, the main thread is sitting in GetEvent -> PR_Wait, which appears right. The remaining non-main threads are also caught up in GetEvent -> PR_Wait. Why they are stuck there after we've posted an event is unknown.
Unfortunately adding any debug output to the non-main thread fixes the problem as well, which makes debugging this kinda tough.
Assignee | ||
Comment 49•11 years ago
|
||
One thing I was able to debug. I added a flag in nsThreadShutdownEvent such that the main thread would know and report when MessageLoop::current()->Quit() was called. This wasn't happening. So for some reason that posted event never gets processed. Yielding the main thread makes it happen.
Assignee | ||
Updated•11 years ago
|
Attachment #8362692 -
Flags: review?(benjamin)
Assignee | ||
Comment 50•11 years ago
|
||
I think I understand what's going wrong here. Occasionally from within the NS_ProcessNextEvent call (which calls nsThread::ProcessNextEvent) for the main thread, we fire an nsIThreadObserver call to nsBaseAppShell::OnProcessNextEvent, and in this, we may process a native event which can result in a call to NS_ProcessPendingEvents. This is what I'm seeing in my logs and explains why the stacks look the way they do. The target non-main thread of the Shutdown call has already terminated and sent a nsThreadShutdownAckEvent to the main thread. This gets processed before the call to GetEvent in nsThread::ProcessNextEvent, and if the app doesn't receive any additional main thread events, we hang in GetEvent forever.
This may be a problem with our app shell implementation or it might be more common. It looks like desktop could also get caught up in this although the behavior there may be sufficiently different (additional main thread events received during the Shutdown calls?) such that it doesn't happen.
Assignee | ||
Updated•11 years ago
|
Attachment #8362692 -
Attachment is obsolete: true
Assignee | ||
Comment 51•11 years ago
|
||
Attachment #8365528 -
Flags: review?(netzen)
Assignee | ||
Updated•11 years ago
|
Attachment #8365528 -
Attachment is patch: true
Assignee | ||
Comment 52•11 years ago
|
||
This isn't necessary for this fix but threw it in here. While debugging I noticed we get some ICoreWindow events during shutdown. So I cleaned up our handlers.
Attachment #8365529 -
Flags: review?(netzen)
Assignee | ||
Comment 53•11 years ago
|
||
Updated•11 years ago
|
Attachment #8365529 -
Flags: review?(netzen) → review+
Updated•11 years ago
|
Attachment #8365528 -
Flags: review?(netzen) → review+
Assignee | ||
Comment 54•11 years ago
|
||
Comment 55•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/3cd6de67ba6b
https://hg.mozilla.org/mozilla-central/rev/1a6b54cc507b
When you feel that this has baked long enough on trunk, feel free to [checkin-needed-aurora] this and I'll land it a=metro-only :)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 29
Comment 56•11 years ago
|
||
No occurrences on branches where this has landed :)
Assignee | ||
Comment 57•11 years ago
|
||
Comment on attachment 8365528 [details] [diff] [review]
don't process native events in shutdown v.1
[Approval Request Comment]
Bug caused by (feature/regressing bug #): been there from the beginning.
User impact if declined: hangs on shutdown / test runs.
Testing completed (on m-c, etc.): yes
Risk to taking this patch (and alternatives if risky): testing on mc indicates everything is working as expected. no user issues reported. I think this is safe to move up to 28.
String or IDL/UUID changes made by this patch: none.
Attachment #8365528 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 58•11 years ago
|
||
Comment on attachment 8365529 [details] [diff] [review]
cleanup events in framework view
[Approval Request Comment]
Bug caused by (feature/regressing bug #): been there from the beginning.
User impact if declined: hangs on shutdown / test runs.
Testing completed (on m-c, etc.): yes
Risk to taking this patch (and alternatives if risky): testing on mc indicates everything is working as expected. no user issues reported. I think this is safe to move up to 28.
String or IDL/UUID changes made by this patch: none.
Attachment #8365529 -
Flags: approval-mozilla-aurora?
Updated•11 years ago
|
Attachment #8365528 -
Flags: approval-mozilla-aurora?
Comment 59•11 years ago
|
||
Comment on attachment 8365529 [details] [diff] [review]
cleanup events in framework view
We have a=metro-only while 28 is on Aurora :)
Attachment #8365529 -
Flags: approval-mozilla-aurora?
Updated•11 years ago
|
Whiteboard: [beta28][defect] p=8 → [beta28][defect] p=8 [checkin-needed-aurora a=metro-only]
Comment 60•11 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/7e64908ef411
https://hg.mozilla.org/releases/mozilla-aurora/rev/5b9c5ed94b04
status-firefox28:
--- → fixed
status-firefox29:
--- → fixed
Whiteboard: [beta28][defect] p=8 [checkin-needed-aurora a=metro-only] → [beta28][defect] p=8
Comment 61•11 years ago
|
||
Could you please give some guidance in order for the QA to verify this? Thanks!
Flags: needinfo?(jmathies)
Assignee | ||
Comment 62•11 years ago
|
||
(In reply to Manuela Muntean [:Manuela] [QA] from comment #61)
> Could you please give some guidance in order for the QA to verify this?
> Thanks!
This has been verified in our test runs since the landing - no timeouts occurring. There's nothing else here for qa to verify.
Flags: needinfo?(jmathies)
Comment 63•11 years ago
|
||
Thanks, Jim! Marking this [qa-].
Whiteboard: [beta28][defect] p=8 → [beta28][defect] p=8 [qa-]
You need to log in
before you can comment on or make changes to this bug.
Description
•