Thread manager shutdown hang when running metro mochitests

RESOLVED FIXED in Firefox 28

Status

P2
normal
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: jimm, Assigned: jimm)

Tracking

Trunk
Firefox 29
All
Windows 8.1
Dependency tree / graph

Firefox Tracking Flags

(firefox28 fixed, firefox29 fixed)

Details

(Whiteboard: [beta28][defect] p=8 [qa-])

Attachments

(2 attachments, 1 obsolete attachment)

(Assignee)

Description

5 years ago
+++ This bug was initially created as a clone of Bug #918759 +++

https://tbpl.mozilla.org/php/getParsedLog.php?id=28138006&tree=Mozilla-Inbound

Ubuntu ASAN VM 12.04 x64 mozilla-inbound opt test mochitest-1 on 2013-09-19 23:06:49 PDT for push 76aa567b4255

slave: tst-linux64-ec2-313

TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Updated

5 years ago
Blocks: 950986
Component: General → Tests
Keywords: intermittent-failure
OS: Linux → Windows 8.1
Product: Core → Firefox for Metro
Summary: Intermittent TEST-UNEXPECTED-FAIL | Shutdown | application timed out after 330 seconds with no output (Ubuntu VM) → Thread manager shutdown hang when running metro mochitests
Whiteboard: [triage]
Version: 26 Branch → Trunk
(Assignee)

Updated

5 years ago
No longer blocks: 950986
(Assignee)

Updated

5 years ago
Blocks: 918759
Blocks: 838081
Whiteboard: [triage] → [triage] [defect] p=0
Blocks: 861680
No longer blocks: 838081
Whiteboard: [triage] [defect] p=0 → [defect] p=0
https://tbpl.mozilla.org/php/getParsedLog.php?id=32701025&tree=B2g-Inbound

Any idea what's going on here, Jim? This is happening more than the bug comments indicate due to this requiring manual starring.
(Assignee)

Comment 13

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #12)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=32701025&tree=B2g-Inbound
> 
> Any idea what's going on here, Jim? This is happening more than the bug
> comments indicate due to this requiring manual starring.

Not sure what's causing this and I'm not sure how to debug it. Maybe repeated runs locally to try and reproduce and if so, attach to debug the hang. AFAIK we aren't seeing these types of hangs on user machines so this isn't blocking, as such it won't get much attention for a few weeks.
Given that we've tracked down the cause of the other mochitest hangs to a DOM Workers patch, is it conceivable that this is the same issue after all? See bug 957693.
(Assignee)

Comment 15

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #14)
> Given that we've tracked down the cause of the other mochitest hangs to a
> DOM Workers patch, is it conceivable that this is the same issue after all?
> See bug 957693.

sure. Lets do that backout or get bent on the bug, land, and see. That would be great if it was, although I think our timeouts stretch back farther than december.
Looking at the recent history in bug 918759, this appears to be far and away our most frequent mochitest shutdown hang. Can we please find someone to help look into this?
Flags: needinfo?(jmathies)
(Assignee)

Comment 18

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #17)
> Looking at the recent history in bug 918759, this appears to be far and away
> our most frequent mochitest shutdown hang. Can we please find someone to
> help look into this?

so I guess the thread worker changes didn't help?
Flags: needinfo?(jmathies)
For this hang, no.
I just did a bunch of mochitest-mc retriggers and we're hitting this timeout roughly 50% of the time. If we don't get a handle on this soon, we're going to have to hide the suite by default per the visibility guidelines.
(Assignee)

Updated

5 years ago
Whiteboard: [defect] p=0 → [triage][defect] p=0
(Assignee)

Comment 21

5 years ago
Did bug 957693  have any impact on the hangs we were seeing on other test suites?
Blocks: 838081
No longer blocks: 861680
(In reply to Jim Mathies [:jimm] from comment #21)
> Did bug 957693  have any impact on the hangs we were seeing on other test
> suites?

(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19)
> For this hang, no.

That was an implicit "yes" to your question :). It made a big difference in the other suites.
(Assignee)

Comment 23

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #22)
> (In reply to Jim Mathies [:jimm] from comment #21)
> > Did bug 957693  have any impact on the hangs we were seeing on other test
> > suites?
> 
> (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19)
> > For this hang, no.
> 
> That was an implicit "yes" to your question :). It made a big difference in
> the other suites.

Ah joy, then it's probably not the same cause. My first guess would be omtc. Maybe we could turn off some big ticket items like omtc and apzc to see if we can isolate it.
Whiteboard: [triage][defect] p=0 → [beta28][defect] p=0
(Assignee)

Comment 24

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #22)
> (In reply to Jim Mathies [:jimm] from comment #21)
> > Did bug 957693  have any impact on the hangs we were seeing on other test
> > suites?
> 
> (In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #19)
> > For this hang, no.
> 
> That was an implicit "yes" to your question :). It made a big difference in
> the other suites.

It must not have fixed the problem 100% of the time, I'm still seeing a lot of non-metro failures.
(Assignee)

Comment 25

5 years ago
Pushed a patch that disables dom workers, we'll start with the most likely culprit - 

https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce
(In reply to Jim Mathies [:jimm] from comment #24)
> It must not have fixed the problem 100% of the time, I'm still seeing a lot
> of non-metro failures.

No, of course it didn't. It's a false argument to say that it was supposed to. Bug 918759 existed long before the mid-December spike that we know was workers-related. You said yourself in this bug that this hang predates that spike as well.

Since that fix landed, we've already had another fix land for a completely different shutdown hang (bug 951908). And yes, there are other failures in that bug (that's what happens with dumping ground bugs), but until the Workers hang was fixed, they were drowned out by the noise.

That said, since bug 957693 landed, the failure rate in bug 918759 is less than *half* of what it was before. So yes, it made a dramatic improvement. Did it fix everything? No. Was it expected to? No.

The data doesn't lie, mochitest-mc is dominating these timeouts at this point.
http://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=918759&startday=2014-01-11&endday=2014-01-18&tree=trunk
(Assignee)

Comment 27

5 years ago
(In reply to Jim Mathies [:jimm] from comment #25)
> Pushed a patch that disables dom workers, we'll start with the most likely
> culprit - 
> 
> https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce

Disable omtc/apzc - 

https://tbpl.mozilla.org/?tree=Try&showall=0&rev=3bd15146ca1b
From what I can tell, this regressed right around 12/16. I do see some APZC and Metro stuff that landed around then. I'm going to try narrowing it down on Try. Will let you know.
(Assignee)

Comment 29

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #28)
> From what I can tell, this regressed right around 12/16. I do see some APZC
> and Metro stuff that landed around then. I'm going to try narrowing it down
> on Try. Will let you know.

It wasn't a regression, it was an unmasking. :)

https://bugzilla.mozilla.org/show_bug.cgi?id=920952
Ah, crap. I can apply that patch locally to try to bisect, though. Think that would work?

The interesting thing, FWIW, is that the patch from bug 920952 was uplifted to Aurora where we also run the tests, but we don't see the hangs there. So it still may be a Gecko29 regression (which started on 12-6). So the range may not be too bad.
(Assignee)

Comment 31

5 years ago
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #30)
> Ah, crap. I can apply that patch locally to try to bisect, though. Think
> that would work?
> 
> The interesting thing, FWIW, is that the patch from bug 920952 was uplifted
> to Aurora where we also run the tests, but we don't see the hangs there. So
> it still may be a Gecko29 regression (which started on 12-6). So the range
> may not be too bad.

Interesting, I didn't realize it was mc only. I was under the impression this went back a lot farther and affected all trees. We pushed that fflush change to aurora as well, so it wouldn't masked there.
Maybe it does. I see a few instances on Aurora too looking at the OF stats. I'll try applying that patch locally and see if I can find a range. Just might take longer than I was hoping :)
(Assignee)

Comment 33

5 years ago
(In reply to Jim Mathies [:jimm] from comment #27)
> (In reply to Jim Mathies [:jimm] from comment #25)
> > Pushed a patch that disables dom workers, we'll start with the most likely
> > culprit - 
> > 
> > https://tbpl.mozilla.org/?tree=Try&showall=0&rev=dfd97e2a88ce
> 
> Disable omtc/apzc - 
> 
> https://tbpl.mozilla.org/?tree=Try&showall=0&rev=3bd15146ca1b

Disabling workers helped, but it wasn't a silver bullet. Disabling apzc and omtc didn't seem to have an impact.
For posterity, APZC disabled failed 20% of the time and Workers disabled failed 17% of the time.
So far, on the earlier reverts (without bug 920952 applied on top), I'm hitting occasional 1000s timeouts, but mostly just bug 900641. Even if we assume that every 1000s timeout is what we're hitting in this bug now, we still weren't failing at anywhere near the rate we are now. So we may be able to investigate what has made this so much worse since December 16 too.
Actually, some of those runs are hitting 15+% still, so I think we are in the same ballpark as the runs above would indicate we would expect to be.
On a run from late September, I hit it twice in 100 attempts. So still there, but definitely something that's gotten worse over time.
(Assignee)

Comment 38

5 years ago
I'm trying to find the thread that hangs. Unfortunately my first attempt fixed the bug. :) Just pushed another less chatty patch, we'll see what happens.
Good, because I actually can't go back much further without some sorcery due to the change to rev2 build slaves awhile back. Not impossible to work around, but definitely upping the difficulty level a few notches.
Whiteboard: [beta28][defect] p=0 → [beta28][defect] p=8
(Assignee)

Comment 40

5 years ago
Created attachment 8362692 [details] [diff] [review]
loosen thread shutdown event loop v.1

This addresses the issue in metro mochitest runs. In the while loop here we call NS_ProcessNextEvent with the default mayWait parameter of true, but in shutdowns that's ignored. Which results in a really tight loop and I'm guessing, a perpetual lock on the event queue. Yielding in the loop frees things up.

In mochitest runs we were failing on average around 20% of the time. In these runs with the patch I get no failures -
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=bb4a5ce2cc87

Pushed a full mochitest run as well to check for issues on other platforms - 
https://tbpl.mozilla.org/?tree=Try&showall=0&rev=5c108f3a4345
Assignee: nobody → jmathies
Blocks: 961497
No longer blocks: 838081
Status: NEW → ASSIGNED
Priority: -- → P2
QA Contact: jbecerra
(Assignee)

Updated

5 years ago
Attachment #8362692 - Flags: review?(ehsan)

Comment 41

5 years ago
Comment on attachment 8362692 [details] [diff] [review]
loosen thread shutdown event loop v.1

Review of attachment 8362692 [details] [diff] [review]:
-----------------------------------------------------------------

I think Benjamin would want to review this himself.

::: xpcom/threads/nsThread.cpp
@@ +489,4 @@
>    // Process events on the current thread until we receive a shutdown ACK.
> +  while (!context.shutdownAck) {
> +    NS_ProcessNextEvent(context.joiningThread, false);
> +    PR_Sleep(PR_INTERVAL_NO_WAIT);

I would much prefer if you just explicitly called Sleep(0) here on Windows and did not change the behavior on other platforms.
Attachment #8362692 - Flags: review?(ehsan) → review?(benjamin)
(Assignee)

Comment 42

5 years ago
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #41)
> Comment on attachment 8362692 [details] [diff] [review]
> loosen thread shutdown event loop v.1
> 
> Review of attachment 8362692 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think Benjamin would want to review this himself.

No problem, thanks for flagging him.

> I would much prefer if you just explicitly called Sleep(0) here on Windows
> and did not change the behavior on other platforms.

I think I'd like to keep this cross and see if it fixes the shutdown timeout problems we see on platforms besides Windows too.
Actually what concerns me about this bug is comment 40: "In the while loop here we call NS_ProcessNextEvent with the default mayWait parameter of true, but in shutdowns that's ignored."

Why is it ignored? That really doesn't make any sense: the whole point of this loop is to block on receiving a new XPCOM-or-native event. In shutdown we don't expect to receive any more native events, but we should continue to block for receiving XPCOM events.

Updated

5 years ago
Flags: needinfo?(jmathies)
(Assignee)

Comment 44

5 years ago
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #43)
> Actually what concerns me about this bug is comment 40: "In the while loop
> here we call NS_ProcessNextEvent with the default mayWait parameter of true,
> but in shutdowns that's ignored."
> 
> Why is it ignored? That really doesn't make any sense: the whole point of
> this loop is to block on receiving a new XPCOM-or-native event. In shutdown
> we don't expect to receive any more native events, but we should continue to
> block for receiving XPCOM events.

Looks like this stretches back to the original thread manager work in bug 326273 (1.61) -

http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/xpcom/threads/nsThread.cpp&root=/cvsroot

http://hg.mozilla.org/mozilla-central/annotate/45a22c4901bb/xpcom/threads/nsThread.cpp#l629

I don't see any comments about this in the reviews. (You approved the original patch! :)
Flags: needinfo?(jmathies)
(Assignee)

Comment 45

5 years ago
In this rev there was an mActive flag, then it looks like Darin updated that to method in the landing.

https://bugzilla.mozilla.org/attachment.cgi?id=218120&action=diff

Generally speaking it looks like we assumed not waiting in was ok to do. So this processing event loop has been 'tight' for a long time.
(Assignee)

Comment 46

5 years ago
An alternative fix would be to remove the !ShuttingDown() check here, and allow GetEvent to wait. I'd bet that would fix this as well.

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#629
Flags: needinfo?(benjamin)
I'm actually confused again. ShuttingDown() should never be true for the main-thread nsThread. It is only used when calling nsIThread::Shutdown for one of the non-main threads.

Are you saying that a non-main thread is in this very tight loop? If so, I think we should consider removing all of the calls to ShuttingDown() from ::ProcessNextEvent and see whether that causes any problems. Or, if it does cause problems, do it only for non-main threads. The widget system shouldn't be involved in this at all, if I'm understanding it correctly.

The sleep hack is very hacky and I'd like to avoid it if possible.
Flags: needinfo?(benjamin)
(Assignee)

Comment 48

5 years ago
Hmm, yes you're right about the loop. I miss interpreted the loop here is being on the non-main thread, but we're calling it with the joining thread, which is the main thread. Hence it should wait in GetEvent.

So now I'm totally confused as to why my patch fixes this. :) The PR_Sleep call triggers a yield, and that frees up the thread we're waiting on for some reason, which isn't happening without it. In the stacks, the main thread is sitting in GetEvent -> PR_Wait, which appears right. The remaining non-main threads are also caught up in GetEvent -> PR_Wait. Why they are stuck there after we've posted an event is unknown.

Unfortunately adding any debug output to the non-main thread fixes the problem as well, which makes debugging this kinda tough.
(Assignee)

Comment 49

5 years ago
One thing I was able to debug. I added a flag in nsThreadShutdownEvent such that the main thread would know and report when MessageLoop::current()->Quit() was called. This wasn't happening. So for some reason that posted event never gets processed. Yielding the main thread makes it happen.
(Assignee)

Updated

5 years ago
Attachment #8362692 - Flags: review?(benjamin)
(Assignee)

Comment 50

5 years ago
I think I understand what's going wrong here. Occasionally from within the NS_ProcessNextEvent call (which calls nsThread::ProcessNextEvent) for the main thread, we fire an nsIThreadObserver call to nsBaseAppShell::OnProcessNextEvent, and in this, we may process a native event which can result in a call to NS_ProcessPendingEvents. This is what I'm seeing in my logs and explains why the stacks look the way they do. The target non-main thread of the Shutdown call has already terminated and sent a nsThreadShutdownAckEvent to the main thread. This gets processed before the call to GetEvent in nsThread::ProcessNextEvent, and if the app doesn't receive any additional main thread events, we hang in GetEvent forever.

This may be a problem with our app shell implementation or it might be more common. It looks like desktop could also get caught up in this although the behavior there may be sufficiently different (additional main thread events received during the Shutdown calls?) such that it doesn't happen.
(Assignee)

Updated

5 years ago
Attachment #8362692 - Attachment is obsolete: true
(Assignee)

Comment 51

5 years ago
Created attachment 8365528 [details] [diff] [review]
don't process native events in shutdown  v.1
Attachment #8365528 - Flags: review?(netzen)
(Assignee)

Updated

5 years ago
Attachment #8365528 - Attachment is patch: true
(Assignee)

Comment 52

5 years ago
Created attachment 8365529 [details] [diff] [review]
cleanup events in framework view

This isn't necessary for this fix but threw it in here. While debugging I noticed we get some ICoreWindow events during shutdown. So I cleaned up our handlers.
Attachment #8365529 - Flags: review?(netzen)
Attachment #8365529 - Flags: review?(netzen) → review+
Attachment #8365528 - Flags: review?(netzen) → review+
https://hg.mozilla.org/mozilla-central/rev/3cd6de67ba6b
https://hg.mozilla.org/mozilla-central/rev/1a6b54cc507b

When you feel that this has baked long enough on trunk, feel free to [checkin-needed-aurora] this and I'll land it a=metro-only :)
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 29
No occurrences on branches where this has landed :)
(Assignee)

Comment 57

5 years ago
Comment on attachment 8365528 [details] [diff] [review]
don't process native events in shutdown  v.1

[Approval Request Comment]
Bug caused by (feature/regressing bug #): been there from the beginning.
User impact if declined: hangs on shutdown / test runs.
Testing completed (on m-c, etc.): yes
Risk to taking this patch (and alternatives if risky): testing on mc indicates everything is working as expected. no user issues reported. I think this is safe to move up to 28.
String or IDL/UUID changes made by this patch: none.
Attachment #8365528 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 58

5 years ago
Comment on attachment 8365529 [details] [diff] [review]
cleanup events in framework view

[Approval Request Comment]
Bug caused by (feature/regressing bug #): been there from the beginning.
User impact if declined: hangs on shutdown / test runs.
Testing completed (on m-c, etc.): yes
Risk to taking this patch (and alternatives if risky): testing on mc indicates everything is working as expected. no user issues reported. I think this is safe to move up to 28.
String or IDL/UUID changes made by this patch: none.
Attachment #8365529 - Flags: approval-mozilla-aurora?
Attachment #8365528 - Flags: approval-mozilla-aurora?
Comment on attachment 8365529 [details] [diff] [review]
cleanup events in framework view

We have a=metro-only while 28 is on Aurora :)
Attachment #8365529 - Flags: approval-mozilla-aurora?
Whiteboard: [beta28][defect] p=8 → [beta28][defect] p=8 [checkin-needed-aurora a=metro-only]
https://hg.mozilla.org/releases/mozilla-aurora/rev/7e64908ef411
https://hg.mozilla.org/releases/mozilla-aurora/rev/5b9c5ed94b04
status-firefox28: --- → fixed
status-firefox29: --- → fixed
Whiteboard: [beta28][defect] p=8 [checkin-needed-aurora a=metro-only] → [beta28][defect] p=8
Could you please give some guidance in order for the QA to verify this? Thanks!
Flags: needinfo?(jmathies)
(Assignee)

Comment 62

5 years ago
(In reply to Manuela Muntean [:Manuela] [QA] from comment #61)
> Could you please give some guidance in order for the QA to verify this?
> Thanks!

This has been verified in our test runs since the landing - no timeouts occurring. There's nothing else here for qa to verify.
Flags: needinfo?(jmathies)
Thanks, Jim! Marking this [qa-].
Whiteboard: [beta28][defect] p=8 → [beta28][defect] p=8 [qa-]
You need to log in before you can comment on or make changes to this bug.