Closed Bug 951120 Opened 11 years ago Closed 11 years ago

Thread manager shutdown hang when running metro mochitests

Categories

(Firefox for Metro Graveyard :: Tests, defect, P2)

All
Windows 8.1
defect

Tracking

(firefox28 fixed, firefox29 fixed)

RESOLVED FIXED
Firefox 29
Tracking Status
firefox28 --- fixed
firefox29 --- fixed

People

(Reporter: jimm, Assigned: jimm)

References

Details

(Whiteboard: [beta28][defect] p=8 [qa-])

Attachments

(2 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #918759 +++ https://tbpl.mozilla.org/php/getParsedLog.php?id=28138006&tree=Mozilla-Inbound Ubuntu ASAN VM 12.04 x64 mozilla-inbound opt test mochitest-1 on 2013-09-19 23:06:49 PDT for push 76aa567b4255 slave: tst-linux64-ec2-313 TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output
Blocks: 950986
Component: General → Tests
OS: Linux → Windows 8.1
Product: Core → Firefox for Metro
Summary: Intermittent TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output (Ubuntu VM) → Thread manager shutdown hang when running metro mochitests
Whiteboard: [triage]
Version: 26 Branch → Trunk
No longer blocks: 950986
Blocks: 918759
Whiteboard: [triage] → [triage] [defect] p=0
Blocks: metrobacklog
No longer blocks: metrov1backlog
Whiteboard: [triage] [defect] p=0 → [defect] p=0
https://tbpl.mozilla.org/php/getParsedLog.php?id=32701025&tree=B2g-Inbound Any idea what's going on here, Jim? This is happening more than the bug comments indicate due to this requiring manual starring.
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #12) > https://tbpl.mozilla.org/php/getParsedLog.php?id=32701025&tree=B2g-Inbound > > Any idea what's going on here, Jim? This is happening more than the bug > comments indicate due to this requiring manual starring. Not sure what's causing this and I'm not sure how to debug it. Maybe repeated runs locally to try and reproduce and if so, attach to debug the hang. AFAIK we aren't seeing these types of hangs on user machines so this isn't blocking, as such it won't get much attention for a few weeks.
Given that we've tracked down the cause of the other mochitest hangs to a DOM Workers patch, is it conceivable that this is the same issue after all? See bug 957693.
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #14) > Given that we've tracked down the cause of the other mochitest hangs to a > DOM Workers patch, is it conceivable that this is the same issue after all? > See bug 957693. sure. Lets do that backout or get bent on the bug, land, and see. That would be great if it was, although I think our timeouts stretch back farther than december.
Looking at the recent history in bug 918759, this appears to be far and away our most frequent mochitest shutdown hang. Can we please find someone to help look into this?
Flags: needinfo?(jmathies)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #17) > Looking at the recent history in bug 918759, this appears to be far and away > our most frequent mochitest shutdown hang. Can we please find someone to > help look into this? so I guess the thread worker changes didn't help?
Flags: needinfo?(jmathies)
For this hang, no.
I just did a bunch of mochitest-mc retriggers and we're hitting this timeout roughly 50% of the time. If we don't get a handle on this soon, we're going to have to hide the suite by default per the visibility guidelines.
Whiteboard: [defect] p=0 → [triage][defect] p=0
Did bug 957693 have any impact on the hangs we were seeing on other test suites?
Blocks: metrov1backlog
No longer blocks: metrobacklog
(In reply to Jim Mathies [:jimm] from comment #21) > Did bug 957693 have any impact on the hangs we were seeing on other test > suites? (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19) > For this hang, no. That was an implicit "yes" to your question :). It made a big difference in the other suites.
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #22) > (In reply to Jim Mathies [:jimm] from comment #21) > > Did bug 957693 have any impact on the hangs we were seeing on other test > > suites? > > (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19) > > For this hang, no. > > That was an implicit "yes" to your question :). It made a big difference in > the other suites. Ah joy, then it's probably not the same cause. My first guess would be omtc. Maybe we could turn off some big ticket items like omtc and apzc to see if we can isolate it.
Whiteboard: [triage][defect] p=0 → [beta28][defect] p=0
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #22) > (In reply to Jim Mathies [:jimm] from comment #21) > > Did bug 957693 have any impact on the hangs we were seeing on other test > > suites? > > (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19) > > For this hang, no. > > That was an implicit "yes" to your question :). It made a big difference in > the other suites. It must not have fixed the problem 100% of the time, I'm still seeing a lot of non-metro failures.
Pushed a patch that disables dom workers, we'll start with the most likely culprit - https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce
(In reply to Jim Mathies [:jimm] from comment #24) > It must not have fixed the problem 100% of the time, I'm still seeing a lot > of non-metro failures. No, of course it didn't. It's a false argument to say that it was supposed to. Bug 918759 existed long before the mid-December spike that we know was workers-related. You said yourself in this bug that this hang predates that spike as well. Since that fix landed, we've already had another fix land for a completely different shutdown hang (bug 951908). And yes, there are other failures in that bug (that's what happens with dumping ground bugs), but until the Workers hang was fixed, they were drowned out by the noise. That said, since bug 957693 landed, the failure rate in bug 918759 is less than *half* of what it was before. So yes, it made a dramatic improvement. Did it fix everything? No. Was it expected to? No. The data doesn't lie, mochitest-mc is dominating these timeouts at this point. http://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=918759&startday=2014-01-11&endday=2014-01-18&tree=trunk
(In reply to Jim Mathies [:jimm] from comment #25) > Pushed a patch that disables dom workers, we'll start with the most likely > culprit - > > https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce Disable omtc/apzc - https://tbpl.mozilla.org/?tree=Try&showall=0&rev=3bd15146ca1b
From what I can tell, this regressed right around 12/16. I do see some APZC and Metro stuff that landed around then. I'm going to try narrowing it down on Try. Will let you know.
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #28) > From what I can tell, this regressed right around 12/16. I do see some APZC > and Metro stuff that landed around then. I'm going to try narrowing it down > on Try. Will let you know. It wasn't a regression, it was an unmasking. :) https://bugzilla.mozilla.org/show_bug.cgi?id=920952
Ah, crap. I can apply that patch locally to try to bisect, though. Think that would work? The interesting thing, FWIW, is that the patch from bug 920952 was uplifted to Aurora where we also run the tests, but we don't see the hangs there. So it still may be a Gecko29 regression (which started on 12-6). So the range may not be too bad.
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #30) > Ah, crap. I can apply that patch locally to try to bisect, though. Think > that would work? > > The interesting thing, FWIW, is that the patch from bug 920952 was uplifted > to Aurora where we also run the tests, but we don't see the hangs there. So > it still may be a Gecko29 regression (which started on 12-6). So the range > may not be too bad. Interesting, I didn't realize it was mc only. I was under the impression this went back a lot farther and affected all trees. We pushed that fflush change to aurora as well, so it wouldn't masked there.
Maybe it does. I see a few instances on Aurora too looking at the OF stats. I'll try applying that patch locally and see if I can find a range. Just might take longer than I was hoping :)
(In reply to Jim Mathies [:jimm] from comment #27) > (In reply to Jim Mathies [:jimm] from comment #25) > > Pushed a patch that disables dom workers, we'll start with the most likely > > culprit - > > > > https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce > > Disable omtc/apzc - > > https://tbpl.mozilla.org/?tree=Try&showall=0&rev=3bd15146ca1b Disabling workers helped, but it wasn't a silver bullet. Disabling apzc and omtc didn't seem to have an impact.
For posterity, APZC disabled failed 20% of the time and Workers disabled failed 17% of the time.
So far, on the earlier reverts (without bug 920952 applied on top), I'm hitting occasional 1000s timeouts, but mostly just bug 900641. Even if we assume that every 1000s timeout is what we're hitting in this bug now, we still weren't failing at anywhere near the rate we are now. So we may be able to investigate what has made this so much worse since December 16 too.
Actually, some of those runs are hitting 15+% still, so I think we are in the same ballpark as the runs above would indicate we would expect to be.
On a run from late September, I hit it twice in 100 attempts. So still there, but definitely something that's gotten worse over time.
I'm trying to find the thread that hangs. Unfortunately my first attempt fixed the bug. :) Just pushed another less chatty patch, we'll see what happens.
Good, because I actually can't go back much further without some sorcery due to the change to rev2 build slaves awhile back. Not impossible to work around, but definitely upping the difficulty level a few notches.
Whiteboard: [beta28][defect] p=0 → [beta28][defect] p=8
This addresses the issue in metro mochitest runs. In the while loop here we call NS_ProcessNextEvent with the default mayWait parameter of true, but in shutdowns that's ignored. Which results in a really tight loop and I'm guessing, a perpetual lock on the event queue. Yielding in the loop frees things up. In mochitest runs we were failing on average around 20% of the time. In these runs with the patch I get no failures - https://tbpl.mozilla.org/?tree=Try&showall=0&rev=bb4a5ce2cc87 Pushed a full mochitest run as well to check for issues on other platforms - https://tbpl.mozilla.org/?tree=Try&showall=0&rev=5c108f3a4345
Assignee: nobody → jmathies
Blocks: metrov1it23
No longer blocks: metrov1backlog
Status: NEW → ASSIGNED
Priority: -- → P2
QA Contact: jbecerra
Attachment #8362692 - Flags: review?(ehsan)
Comment on attachment 8362692 [details] [diff] [review] loosen thread shutdown event loop v.1 Review of attachment 8362692 [details] [diff] [review]: ----------------------------------------------------------------- I think Benjamin would want to review this himself. ::: xpcom/threads/nsThread.cpp @@ +489,4 @@ > // Process events on the current thread until we receive a shutdown ACK. > + while (!context.shutdownAck) { > + NS_ProcessNextEvent(context.joiningThread, false); > + PR_Sleep(PR_INTERVAL_NO_WAIT); I would much prefer if you just explicitly called Sleep(0) here on Windows and did not change the behavior on other platforms.
Attachment #8362692 - Flags: review?(ehsan) → review?(benjamin)
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #41) > Comment on attachment 8362692 [details] [diff] [review] > loosen thread shutdown event loop v.1 > > Review of attachment 8362692 [details] [diff] [review]: > ----------------------------------------------------------------- > > I think Benjamin would want to review this himself. No problem, thanks for flagging him. > I would much prefer if you just explicitly called Sleep(0) here on Windows > and did not change the behavior on other platforms. I think I'd like to keep this cross and see if it fixes the shutdown timeout problems we see on platforms besides Windows too.
Actually what concerns me about this bug is comment 40: "In the while loop here we call NS_ProcessNextEvent with the default mayWait parameter of true, but in shutdowns that's ignored." Why is it ignored? That really doesn't make any sense: the whole point of this loop is to block on receiving a new XPCOM-or-native event. In shutdown we don't expect to receive any more native events, but we should continue to block for receiving XPCOM events.
Flags: needinfo?(jmathies)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #43) > Actually what concerns me about this bug is comment 40: "In the while loop > here we call NS_ProcessNextEvent with the default mayWait parameter of true, > but in shutdowns that's ignored." > > Why is it ignored? That really doesn't make any sense: the whole point of > this loop is to block on receiving a new XPCOM-or-native event. In shutdown > we don't expect to receive any more native events, but we should continue to > block for receiving XPCOM events. Looks like this stretches back to the original thread manager work in bug 326273 (1.61) - http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/xpcom/threads/nsThread.cpp&root=/cvsroot http://hg.mozilla.org/mozilla-central/annotate/45a22c4901bb/xpcom/threads/nsThread.cpp#l629 I don't see any comments about this in the reviews. (You approved the original patch! :)
Flags: needinfo?(jmathies)
In this rev there was an mActive flag, then it looks like Darin updated that to method in the landing. https://bugzilla.mozilla.org/attachment.cgi?id=218120&action=diff Generally speaking it looks like we assumed not waiting in was ok to do. So this processing event loop has been 'tight' for a long time.
An alternative fix would be to remove the !ShuttingDown() check here, and allow GetEvent to wait. I'd bet that would fix this as well. http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#629
Flags: needinfo?(benjamin)
I'm actually confused again. ShuttingDown() should never be true for the main-thread nsThread. It is only used when calling nsIThread::Shutdown for one of the non-main threads. Are you saying that a non-main thread is in this very tight loop? If so, I think we should consider removing all of the calls to ShuttingDown() from ::ProcessNextEvent and see whether that causes any problems. Or, if it does cause problems, do it only for non-main threads. The widget system shouldn't be involved in this at all, if I'm understanding it correctly. The sleep hack is very hacky and I'd like to avoid it if possible.
Flags: needinfo?(benjamin)
Hmm, yes you're right about the loop. I miss interpreted the loop here is being on the non-main thread, but we're calling it with the joining thread, which is the main thread. Hence it should wait in GetEvent. So now I'm totally confused as to why my patch fixes this. :) The PR_Sleep call triggers a yield, and that frees up the thread we're waiting on for some reason, which isn't happening without it. In the stacks, the main thread is sitting in GetEvent -> PR_Wait, which appears right. The remaining non-main threads are also caught up in GetEvent -> PR_Wait. Why they are stuck there after we've posted an event is unknown. Unfortunately adding any debug output to the non-main thread fixes the problem as well, which makes debugging this kinda tough.
One thing I was able to debug. I added a flag in nsThreadShutdownEvent such that the main thread would know and report when MessageLoop::current()->Quit() was called. This wasn't happening. So for some reason that posted event never gets processed. Yielding the main thread makes it happen.
Attachment #8362692 - Flags: review?(benjamin)
I think I understand what's going wrong here. Occasionally from within the NS_ProcessNextEvent call (which calls nsThread::ProcessNextEvent) for the main thread, we fire an nsIThreadObserver call to nsBaseAppShell::OnProcessNextEvent, and in this, we may process a native event which can result in a call to NS_ProcessPendingEvents. This is what I'm seeing in my logs and explains why the stacks look the way they do. The target non-main thread of the Shutdown call has already terminated and sent a nsThreadShutdownAckEvent to the main thread. This gets processed before the call to GetEvent in nsThread::ProcessNextEvent, and if the app doesn't receive any additional main thread events, we hang in GetEvent forever. This may be a problem with our app shell implementation or it might be more common. It looks like desktop could also get caught up in this although the behavior there may be sufficiently different (additional main thread events received during the Shutdown calls?) such that it doesn't happen.
Attachment #8362692 - Attachment is obsolete: true
Attachment #8365528 - Flags: review?(netzen)
Attachment #8365528 - Attachment is patch: true
This isn't necessary for this fix but threw it in here. While debugging I noticed we get some ICoreWindow events during shutdown. So I cleaned up our handlers.
Attachment #8365529 - Flags: review?(netzen)
Attachment #8365529 - Flags: review?(netzen) → review+
Attachment #8365528 - Flags: review?(netzen) → review+
https://hg.mozilla.org/mozilla-central/rev/3cd6de67ba6b https://hg.mozilla.org/mozilla-central/rev/1a6b54cc507b When you feel that this has baked long enough on trunk, feel free to [checkin-needed-aurora] this and I'll land it a=metro-only :)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 29
No occurrences on branches where this has landed :)
Comment on attachment 8365528 [details] [diff] [review] don't process native events in shutdown v.1 [Approval Request Comment] Bug caused by (feature/regressing bug #): been there from the beginning. User impact if declined: hangs on shutdown / test runs. Testing completed (on m-c, etc.): yes Risk to taking this patch (and alternatives if risky): testing on mc indicates everything is working as expected. no user issues reported. I think this is safe to move up to 28. String or IDL/UUID changes made by this patch: none.
Attachment #8365528 - Flags: approval-mozilla-aurora?
Comment on attachment 8365529 [details] [diff] [review] cleanup events in framework view [Approval Request Comment] Bug caused by (feature/regressing bug #): been there from the beginning. User impact if declined: hangs on shutdown / test runs. Testing completed (on m-c, etc.): yes Risk to taking this patch (and alternatives if risky): testing on mc indicates everything is working as expected. no user issues reported. I think this is safe to move up to 28. String or IDL/UUID changes made by this patch: none.
Attachment #8365529 - Flags: approval-mozilla-aurora?
Attachment #8365528 - Flags: approval-mozilla-aurora?
Comment on attachment 8365529 [details] [diff] [review] cleanup events in framework view We have a=metro-only while 28 is on Aurora :)
Attachment #8365529 - Flags: approval-mozilla-aurora?
Whiteboard: [beta28][defect] p=8 → [beta28][defect] p=8 [checkin-needed-aurora a=metro-only]
Could you please give some guidance in order for the QA to verify this? Thanks!
Flags: needinfo?(jmathies)
(In reply to Manuela Muntean [:Manuela] [QA] from comment #61) > Could you please give some guidance in order for the QA to verify this? > Thanks! This has been verified in our test runs since the landing - no timeouts occurring. There's nothing else here for qa to verify.
Flags: needinfo?(jmathies)
Thanks, Jim! Marking this [qa-].
Whiteboard: [beta28][defect] p=8 → [beta28][defect] p=8 [qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: