Closed Bug 986113 Opened 6 years ago Closed 6 years ago

B2G child shutdown: MOZ_Assert: Assertion failure: work_queue_.empty(), at message_loop.cc:393

Categories

(Core :: Graphics, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED WORKSFORME
blocking-b2g 1.4+
Tracking Status
e10s + ---

People

(Reporter: gwagner, Assigned: nical)

References

(Blocks 1 open bug)

Details

I have seen this a couple of times now on B2G debug builds. We also hit it during tests.
For example: https://tbpl.mozilla.org/php/getParsedLog.php?id=36455607&tree=B2g-Inbound&full=1#error2
blocking-b2g: --- → 1.4?
Please request gaia approval for 1.4 on this.
Meant aurora approval
Ehsan, can you give us more information about this assertions? How bad is it? Lets wait on the domain experts before we make a blocking decision.
Flags: needinfo?(ehsan)
It's very bad. :(

For one thing, it means that we leak those objects.  This is fine I guess if we are hitting this assertion during shutdown.

What's more worrying is that the work queue is not empty.  This means that some of the work queue items were never processed before the thread is shutting down, which means that some work which we expected to happen never does happen.  Gregor, can you reproduce this?  If yes, can you please catch this under the debugger and see what the type of the Task objects left in the work queue is?  That should give us an idea about the nature of the problem.

Note that backing out bug 976363 alone is not an option because it will turn mochitest-2 perma orange because of leaks.  That would also require backing out bug 935778 which I would like to avoid at all costs.
Flags: needinfo?(ehsan)
Flags: needinfo?(benjamin)
Flags: needinfo?(anygregor)
(Also note that this is a debug only assertion, and it just means that we now *know* about this problem.  The underlying problem has been happening all along regardless -- bug 976363 just helped uncover it.)
Does the assert come with a stack? I don't see any of the normal crashreporter stuff in the log. It would be nice to know what kind of threads are still running, if you catch this in a debugger.

If this is only at shutdown I'm not sure I care all that much; it probably doesn't need to block in that case.
I'm seeing this in e10s testing too.
Blocks: e10s-tests
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> Does the assert come with a stack? I don't see any of the normal
> crashreporter stuff in the log. It would be nice to know what kind of
> threads are still running, if you catch this in a debugger.
> 
> If this is only at shutdown I'm not sure I care all that much; it probably
> doesn't need to block in that case.

This happens on the emulator and we don't record stack traces for a child process after we run all the tests successfully. Andrew knows more about this.
Maybe we can try to get a stack trace on try with your patch that enables this?
Flags: needinfo?(anygregor) → needinfo?(ahalberstadt)
(In reply to comment #7)
> I'm seeing this in e10s testing too.

Do you have a stack and/or can you repro?
This is the output in e10s. I'm wondering if the assertion happens in the child process. The stack isn't too helpful.
https://tbpl.mozilla.org/php/getParsedLog.php?id=36478575&full=1&branch=holly#error0

There's also this very scary looking assertion:
firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.
The patch in bug 969146 should get the crash stack dumped in the log, but bug 983489 is blocking me from landing it.
Flags: needinfo?(ahalberstadt)
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> The patch in bug 969146 should get the crash stack dumped in the log, but
> bug 983489 is blocking me from landing it.

I can land the patch on pine but it doesn't apply currently.
I pushed the patch to pine. We have multiple crash stacks but I am not sure if we have one that shows this bug.

https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110
See Also: → 986836
(In reply to Bill McCloskey (:billm) from comment #10)
> This is the output in e10s. I'm wondering if the assertion happens in the
> child process. The stack isn't too helpful.
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=36478575&full=1&branch=holly#error0
> 
> There's also this very scary looking assertion:
> firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion
> `(-(e)) != 3 || !robust' failed.

This stack is very interesting.  I think one process here is aborting from libc's abort(), and the other one is aborting with the assertion in the summary of this bug.  We have no stacks for the latter.  But the former process seems to be trying to post an IPC message to the MessageLoop *while* we're shutting down on the main thread:

18:42:58     INFO -  18  libxul.so!LockImpl::Lock() [lock_impl_posix.cc:3d266881bda3 : 41 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94a055d   rsp = 0x00007fc2bab3e830
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  19  libxul.so!tracked_objects::Tracked::SetBirthPlace(tracked_objects::Location const&) [tracked.cc:3d266881bda3 : 73 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94b044c   rsp = 0x00007fc2bab3e840
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  20  libxul.so!MessageLoop::PostTask_Helper(tracked_objects::Location const&, Task*, int, bool) [lock.h:3d266881bda3 : 43 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94ad9a6   rsp = 0x00007fc2bab3e860
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  21  libxul.so!mozilla::ipc::MessageChannel::RefCountedTask::AddRef() [MessageChannel.h:3d266881bda3 : 386 + 0x18]
18:42:58     INFO -      rip = 0x00007fc2c94bcac6   rsp = 0x00007fc2bab3e8b0
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  22  libxul.so!mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message const&) [MessageChannel.cpp:3d266881bda3 : 551 + 0x4]
18:42:58     INFO -      rip = 0x00007fc2c94c5f60   rsp = 0x00007fc2bab3e8e0
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  23  libxul.so!mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message const&) [MessageLink.cpp:3d266881bda3 : 259 + 0xb]
18:42:58     INFO -      rip = 0x00007fc2c94c60ce   rsp = 0x00007fc2bab3e940
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning
18:42:58     INFO -  24  libxul.so!IPC::Channel::ChannelImpl::ProcessIncomingMessages() [ipc_channel_posix.cc:3d266881bda3 : 608 + 0xc]
18:42:58     INFO -      rip = 0x00007fc2c94a5a79   rsp = 0x00007fc2bab3e970
18:42:58     INFO -      rbp = 0x00007fc2d0213f3e
18:42:58     INFO -      Found by: stack scanning

This is definitely broken.  It may or may not be the only problem to solve here, but why is the IPC thread doing work while the main thread is shutting down?
Flags: needinfo?(bent.mozilla)
(In reply to Gregor Wagner [:gwagner] from comment #13)
> I pushed the patch to pine. We have multiple crash stacks but I am not sure
> if we have one that shows this bug.
> 
> https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110

The assertions happen on most of the test runs, but I can't find a stack in the ones that I looked at.  Andrew, your patch seems to not be working as expected.
Flags: needinfo?(ahalberstadt)
This only seems to happen in some of the debug mochitests as far as I can see. The crashes in opt mochitest are bug 983489. The stacks in the debug mochitests seem to be bug 930884 which is odd because that bug should be fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the libEGL one happens first?
Flags: needinfo?(ahalberstadt)
blocking-b2g: 1.4? → 1.4+
Assignee: nobody → overholt
(In reply to Andrew Halberstadt [:ahal] from comment #16)
> This only seems to happen in some of the debug mochitests as far as I can
> see. The crashes in opt mochitest are bug 983489. The stacks in the debug
> mochitests seem to be bug 930884 which is odd because that bug should be
> fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the
> libEGL one happens first?

MOZ_ASSERT is a debug only macro, so it's totally expected to not see that in opt builds.  Looks like bug 930884 is not fixed?  I'm not really sure how to make progress here, we need more info before this bug is actionable.
Apparently bug 930884 was crashing in the parent process, and that got fixed, but that crash can still happen in a child process? I filed bug 987251 to continue investigating it.

To clarify, these two bugs are potentially blocking us from getting a crash stack. Though I'll see if I can reproduce this locally and try to figure a way of extracting a stack manually.
Depends on: 987251, 983489
Depends on: 986836
Flags: needinfo?(benjamin)
I tried to debug this locally and I got a different MOZ_ASSERT [1] on shutdown for which a stack trace *did* appear (in addition to the libEGL one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but I'm confused why I would see a stack for that locally, but we wouldn't see a stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call in my patch is happening too early, before gecko has completely shutdown and we are missing the crash. To test this theory I added a sleep:

https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb

[1] F/MOZ_Assert(  311): Assertion failure: mUsedShmems.empty(), at /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51
(In reply to Andrew Halberstadt [:ahal] from comment #19)
> I tried to debug this locally and I got a different MOZ_ASSERT [1] on
> shutdown for which a stack trace *did* appear (in addition to the libEGL
> one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but
> I'm confused why I would see a stack for that locally, but we wouldn't see a
> stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call
> in my patch is happening too early, before gecko has completely shutdown and
> we are missing the crash. To test this theory I added a sleep:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb
> 
> [1] F/MOZ_Assert(  311): Assertion failure: mUsedShmems.empty(), at
> /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51

I'm not sure if I can read this log properly?  Which process is each one of these assertions coming from?  The parent or the child?  Surely they must happen on two different processes as they're both fatal...
Flags: needinfo?(ahalberstadt)
Which log? I didn't paste any. I have no idea why I saw two crash stacks locally (though not for this MOZ_ASSERT) but that try run only seems to have one. I also don't know if fixing bug 987251 and bug 983489 will help expose the stack here or if we aren't seeing it for some other reason.
Flags: needinfo?(ahalberstadt)
(In reply to comment #21)
> Which log? I didn't paste any.

The logs in the try run!

> I have no idea why I saw two crash stacks
> locally (though not for this MOZ_ASSERT) but that try run only seems to have
> one. I also don't know if fixing bug 987251 and bug 983489 will help expose the
> stack here or if we aren't seeing it for some other reason.

Hmm.. Can you please help me understand which code is responsible for capturing the stacks here?  MOZ_ASSERT itself doesn't unfortunately.

FWIW, I'm guessing that fixing those bugs will fix the underlying issue here, because they both look like potential symptoms of the underlying problem here.  I don't think fixing either of those will get us stacks per se but that could be because my understanding of how we capture these stack traces is incorrect.
There's only one stack in the try logs :), I was talking about what I was seeing locally. I'll try to explain more clearly:

1) When MOZ_CRASHREPORTER_SHUTDOWN is set, we quit the application when a child process crash happens:
http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#135

2) My patch in the try run, makes b2g mochitests pass in MOZ_CRASHREPORTER_SHUTDOWN:
https://hg.mozilla.org/try/rev/a9c2ddbf2832

3) When a crash happens, a minidump should be saved in 'path/to/profile/minidumps'. This is on crashreporter and has nothing to do with the test harnesses.

4) We call mozcrash.check_for_crashes() after a test run, on a timeout, or when there is an exception. This basically just checks for minidumps in the profile and passes them on to minidump_stackwalk (which is what prints the stack):
http://mxr.mozilla.org/mozilla-central/source/testing/mozbase/mozcrash/mozcrash/mozcrash.py#22

My theory is that the work_queue.empty MOZ_ASSERT causes a child process crash which triggers this line [1]. Then we hit bug 983489 which causes another crash in a different child process (?) and somehow erases the original minidump or prevents it from being stackwalked (?).

However when I ran debug mochitests locally, I witnessed two separate stacks in the log. This makes me question my original theory. In short, I don't know what is happening :).

[1] http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#139
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #14)
> This is definitely broken.  It may or may not be the only problem to solve
> here, but why is the IPC thread doing work while the main thread is shutting
> down?

Dunno, shutdown ordering is hard? :)

Currently I think we kill the IO thread here:

  http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#1034

Maybe it just needs to be moved earlier.
Flags: needinfo?(bent.mozilla)
This is still waiting on bug 986836 which Milan said is going to start moving at the earliest tomorrow.
I've been investigating this from the e10s end, and I think I have a better idea what's going on. The problem seems to happen when the child process tries to shut down the image bridge. We call ImageBridgeChild::DestroyBridge() from the main thread. That posts a task to the image bridge thread that calls StopImageBridgeSync. That function tries to send a sync message to the parent process.

All of this stuff happens very late in shutdown. If the parent then dies, the child's send call will fail and it will run some IPC to post a task somewhere to report the error. However, we never actually process that task because we're so late in shutdown. Instead, we end up with this assertion. The task that remains in the queue is MessageChannel::OnNotifyMaybeChannelError.

It's hard to know if the same problem is affecting b2g. However, I think both b2g and e10s would benefit from a fix for bug 924622, which proposes to fix the shutdown code in graphics so that it happens earlier. That would solve the image bridge bug I'm seeing and probably fix a bunch of other things.

Also, I should mention that it's pretty easy to reproduce this problem in e10s by putting sleep calls in the right places.
Depends on: 924622
Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".
tracking-e10s: --- → +
Gregor, there has been a bit of progress on the list of dependencies here.  Can you please retest this when you get a chance?  Thanks!
Flags: needinfo?(anygregor)
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #28)
> Gregor, there has been a bit of progress on the list of dependencies here. 
> Can you please retest this when you get a chance?  Thanks!

Sure, I just merged to pine again. https://tbpl.mozilla.org/?tree=Pine&showall=1
Flags: needinfo?(anygregor)
Re-triage - this doesn't have proof of being a user facing crash on devices, it only happens in debug build situations & tests. So renoming to reevaluate.
blocking-b2g: 1.4+ → 1.4?
(In reply to Jason Smith [:jsmith] from comment #30)
> Re-triage - this doesn't have proof of being a user facing crash on devices,
> it only happens in debug build situations & tests. So renoming to reevaluate.

This might as well be a major source of orange like bug 965677.
Assignee: overholt → anygregor
blocking-b2g: 1.4? → 1.4+
They are all gfx related as far as I can tell. Milan is a better owner/driver here.
Assignee: anygregor → milan
Blocks: 997367
(In reply to Gregor Wagner [:gwagner] from comment #32)
> They are all gfx related as far as I can tell. Milan is a better
> owner/driver here.

So, does it mean, that "graphics" component would make sense? 
Milan, may we have your comments for this bug? Thank you.
Flags: needinfo?(milan)
It may very well be the case.  We are looking at shutdown issues in bug 924622, but it is slow going.
Flags: needinfo?(milan)
Component: IPC → Graphics
Milan

Are we fixing this issue in 1.4? We blocked on it so please help with next steps.
Flags: needinfo?(milan)
These shutdown issues are expensive to fix, we've been at it for weeks, with 7-8 landings that got backed out, so best I can offer is to keep working on it.  Bug 924622 is where the work is, if that works on central, then we can proceed.
Flags: needinfo?(milan)
Assigning to :nical as a placeholder once we clear bug 924622.
Assignee: milan → nical.bugzilla
Now that bug 924622 has landed in 2.0, is this assertion still present there?
Flags: needinfo?(anygregor)
(In reply to Milan Sreckovic [:milan] from comment #38)
> Now that bug 924622 has landed in 2.0, is this assertion still present there?

This assertion is no longer there.
We moved to bug 1017781 (https://tbpl.mozilla.org/php/getParsedLog.php?id=40756411&tree=Pine&full=1)
And bug 1017719 which is not gfx related.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(anygregor) → needinfo?(milan)
Resolution: --- → WORKSFORME
Flags: needinfo?(milan)
You need to log in before you can comment on or make changes to this bug.