Closed
Bug 986113
Opened 11 years ago
Closed 10 years ago
B2G child shutdown: MOZ_Assert: Assertion failure: work_queue_.empty(), at message_loop.cc:393
Categories
(Core :: Graphics, defect)
Tracking
()
People
(Reporter: gwagner, Assigned: nical)
References
(Blocks 1 open bug)
Details
I have seen this a couple of times now on B2G debug builds. We also hit it during tests.
For example: https://tbpl.mozilla.org/php/getParsedLog.php?id=36455607&tree=B2g-Inbound&full=1#error2
Reporter | ||
Updated•11 years ago
|
blocking-b2g: --- → 1.4?
Comment 1•11 years ago
|
||
Please request gaia approval for 1.4 on this.
Comment 2•11 years ago
|
||
Meant aurora approval
Reporter | ||
Comment 3•11 years ago
|
||
Ehsan, can you give us more information about this assertions? How bad is it? Lets wait on the domain experts before we make a blocking decision.
Flags: needinfo?(ehsan)
Comment 4•11 years ago
|
||
It's very bad. :(
For one thing, it means that we leak those objects. This is fine I guess if we are hitting this assertion during shutdown.
What's more worrying is that the work queue is not empty. This means that some of the work queue items were never processed before the thread is shutting down, which means that some work which we expected to happen never does happen. Gregor, can you reproduce this? If yes, can you please catch this under the debugger and see what the type of the Task objects left in the work queue is? That should give us an idea about the nature of the problem.
Note that backing out bug 976363 alone is not an option because it will turn mochitest-2 perma orange because of leaks. That would also require backing out bug 935778 which I would like to avoid at all costs.
Flags: needinfo?(ehsan)
Flags: needinfo?(benjamin)
Flags: needinfo?(anygregor)
Comment 5•11 years ago
|
||
(Also note that this is a debug only assertion, and it just means that we now *know* about this problem. The underlying problem has been happening all along regardless -- bug 976363 just helped uncover it.)
Comment 6•11 years ago
|
||
Does the assert come with a stack? I don't see any of the normal crashreporter stuff in the log. It would be nice to know what kind of threads are still running, if you catch this in a debugger.
If this is only at shutdown I'm not sure I care all that much; it probably doesn't need to block in that case.
I'm seeing this in e10s testing too.
Blocks: e10s-tests
Reporter | ||
Comment 8•11 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #6)
> Does the assert come with a stack? I don't see any of the normal
> crashreporter stuff in the log. It would be nice to know what kind of
> threads are still running, if you catch this in a debugger.
>
> If this is only at shutdown I'm not sure I care all that much; it probably
> doesn't need to block in that case.
This happens on the emulator and we don't record stack traces for a child process after we run all the tests successfully. Andrew knows more about this.
Maybe we can try to get a stack trace on try with your patch that enables this?
Flags: needinfo?(anygregor) → needinfo?(ahalberstadt)
Comment 9•11 years ago
|
||
(In reply to comment #7)
> I'm seeing this in e10s testing too.
Do you have a stack and/or can you repro?
This is the output in e10s. I'm wondering if the assertion happens in the child process. The stack isn't too helpful.
https://tbpl.mozilla.org/php/getParsedLog.php?id=36478575&full=1&branch=holly#error0
There's also this very scary looking assertion:
firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.
Comment 11•11 years ago
|
||
The patch in bug 969146 should get the crash stack dumped in the log, but bug 983489 is blocking me from landing it.
Flags: needinfo?(ahalberstadt)
Reporter | ||
Comment 12•11 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> The patch in bug 969146 should get the crash stack dumped in the log, but
> bug 983489 is blocking me from landing it.
I can land the patch on pine but it doesn't apply currently.
Reporter | ||
Comment 13•11 years ago
|
||
I pushed the patch to pine. We have multiple crash stacks but I am not sure if we have one that shows this bug.
https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110
Comment 14•11 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #10)
> This is the output in e10s. I'm wondering if the assertion happens in the
> child process. The stack isn't too helpful.
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=36478575&full=1&branch=holly#error0
>
> There's also this very scary looking assertion:
> firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion
> `(-(e)) != 3 || !robust' failed.
This stack is very interesting. I think one process here is aborting from libc's abort(), and the other one is aborting with the assertion in the summary of this bug. We have no stacks for the latter. But the former process seems to be trying to post an IPC message to the MessageLoop *while* we're shutting down on the main thread:
18:42:58 INFO - 18 libxul.so!LockImpl::Lock() [lock_impl_posix.cc:3d266881bda3 : 41 + 0x4]
18:42:58 INFO - rip = 0x00007fc2c94a055d rsp = 0x00007fc2bab3e830
18:42:58 INFO - rbp = 0x00007fc2d0213f3e
18:42:58 INFO - Found by: stack scanning
18:42:58 INFO - 19 libxul.so!tracked_objects::Tracked::SetBirthPlace(tracked_objects::Location const&) [tracked.cc:3d266881bda3 : 73 + 0x4]
18:42:58 INFO - rip = 0x00007fc2c94b044c rsp = 0x00007fc2bab3e840
18:42:58 INFO - rbp = 0x00007fc2d0213f3e
18:42:58 INFO - Found by: stack scanning
18:42:58 INFO - 20 libxul.so!MessageLoop::PostTask_Helper(tracked_objects::Location const&, Task*, int, bool) [lock.h:3d266881bda3 : 43 + 0x4]
18:42:58 INFO - rip = 0x00007fc2c94ad9a6 rsp = 0x00007fc2bab3e860
18:42:58 INFO - rbp = 0x00007fc2d0213f3e
18:42:58 INFO - Found by: stack scanning
18:42:58 INFO - 21 libxul.so!mozilla::ipc::MessageChannel::RefCountedTask::AddRef() [MessageChannel.h:3d266881bda3 : 386 + 0x18]
18:42:58 INFO - rip = 0x00007fc2c94bcac6 rsp = 0x00007fc2bab3e8b0
18:42:58 INFO - rbp = 0x00007fc2d0213f3e
18:42:58 INFO - Found by: stack scanning
18:42:58 INFO - 22 libxul.so!mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message const&) [MessageChannel.cpp:3d266881bda3 : 551 + 0x4]
18:42:58 INFO - rip = 0x00007fc2c94c5f60 rsp = 0x00007fc2bab3e8e0
18:42:58 INFO - rbp = 0x00007fc2d0213f3e
18:42:58 INFO - Found by: stack scanning
18:42:58 INFO - 23 libxul.so!mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message const&) [MessageLink.cpp:3d266881bda3 : 259 + 0xb]
18:42:58 INFO - rip = 0x00007fc2c94c60ce rsp = 0x00007fc2bab3e940
18:42:58 INFO - rbp = 0x00007fc2d0213f3e
18:42:58 INFO - Found by: stack scanning
18:42:58 INFO - 24 libxul.so!IPC::Channel::ChannelImpl::ProcessIncomingMessages() [ipc_channel_posix.cc:3d266881bda3 : 608 + 0xc]
18:42:58 INFO - rip = 0x00007fc2c94a5a79 rsp = 0x00007fc2bab3e970
18:42:58 INFO - rbp = 0x00007fc2d0213f3e
18:42:58 INFO - Found by: stack scanning
This is definitely broken. It may or may not be the only problem to solve here, but why is the IPC thread doing work while the main thread is shutting down?
Flags: needinfo?(bent.mozilla)
Comment 15•11 years ago
|
||
(In reply to Gregor Wagner [:gwagner] from comment #13)
> I pushed the patch to pine. We have multiple crash stacks but I am not sure
> if we have one that shows this bug.
>
> https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110
The assertions happen on most of the test runs, but I can't find a stack in the ones that I looked at. Andrew, your patch seems to not be working as expected.
Flags: needinfo?(ahalberstadt)
Comment 16•11 years ago
|
||
This only seems to happen in some of the debug mochitests as far as I can see. The crashes in opt mochitest are bug 983489. The stacks in the debug mochitests seem to be bug 930884 which is odd because that bug should be fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the libEGL one happens first?
Flags: needinfo?(ahalberstadt)
Updated•11 years ago
|
blocking-b2g: 1.4? → 1.4+
Updated•11 years ago
|
Assignee: nobody → overholt
Comment 17•11 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #16)
> This only seems to happen in some of the debug mochitests as far as I can
> see. The crashes in opt mochitest are bug 983489. The stacks in the debug
> mochitests seem to be bug 930884 which is odd because that bug should be
> fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the
> libEGL one happens first?
MOZ_ASSERT is a debug only macro, so it's totally expected to not see that in opt builds. Looks like bug 930884 is not fixed? I'm not really sure how to make progress here, we need more info before this bug is actionable.
Comment 18•11 years ago
|
||
Apparently bug 930884 was crashing in the parent process, and that got fixed, but that crash can still happen in a child process? I filed bug 987251 to continue investigating it.
To clarify, these two bugs are potentially blocking us from getting a crash stack. Though I'll see if I can reproduce this locally and try to figure a way of extracting a stack manually.
Updated•11 years ago
|
Flags: needinfo?(benjamin)
Comment 19•11 years ago
|
||
I tried to debug this locally and I got a different MOZ_ASSERT [1] on shutdown for which a stack trace *did* appear (in addition to the libEGL one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but I'm confused why I would see a stack for that locally, but we wouldn't see a stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call in my patch is happening too early, before gecko has completely shutdown and we are missing the crash. To test this theory I added a sleep:
https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb
[1] F/MOZ_Assert( 311): Assertion failure: mUsedShmems.empty(), at /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51
Comment 20•11 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #19)
> I tried to debug this locally and I got a different MOZ_ASSERT [1] on
> shutdown for which a stack trace *did* appear (in addition to the libEGL
> one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but
> I'm confused why I would see a stack for that locally, but we wouldn't see a
> stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call
> in my patch is happening too early, before gecko has completely shutdown and
> we are missing the crash. To test this theory I added a sleep:
>
> https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb
>
> [1] F/MOZ_Assert( 311): Assertion failure: mUsedShmems.empty(), at
> /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51
I'm not sure if I can read this log properly? Which process is each one of these assertions coming from? The parent or the child? Surely they must happen on two different processes as they're both fatal...
Updated•11 years ago
|
Flags: needinfo?(ahalberstadt)
Comment 21•11 years ago
|
||
Which log? I didn't paste any. I have no idea why I saw two crash stacks locally (though not for this MOZ_ASSERT) but that try run only seems to have one. I also don't know if fixing bug 987251 and bug 983489 will help expose the stack here or if we aren't seeing it for some other reason.
Flags: needinfo?(ahalberstadt)
Comment 22•11 years ago
|
||
(In reply to comment #21)
> Which log? I didn't paste any.
The logs in the try run!
> I have no idea why I saw two crash stacks
> locally (though not for this MOZ_ASSERT) but that try run only seems to have
> one. I also don't know if fixing bug 987251 and bug 983489 will help expose the
> stack here or if we aren't seeing it for some other reason.
Hmm.. Can you please help me understand which code is responsible for capturing the stacks here? MOZ_ASSERT itself doesn't unfortunately.
FWIW, I'm guessing that fixing those bugs will fix the underlying issue here, because they both look like potential symptoms of the underlying problem here. I don't think fixing either of those will get us stacks per se but that could be because my understanding of how we capture these stack traces is incorrect.
Comment 23•11 years ago
|
||
There's only one stack in the try logs :), I was talking about what I was seeing locally. I'll try to explain more clearly:
1) When MOZ_CRASHREPORTER_SHUTDOWN is set, we quit the application when a child process crash happens:
http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#135
2) My patch in the try run, makes b2g mochitests pass in MOZ_CRASHREPORTER_SHUTDOWN:
https://hg.mozilla.org/try/rev/a9c2ddbf2832
3) When a crash happens, a minidump should be saved in 'path/to/profile/minidumps'. This is on crashreporter and has nothing to do with the test harnesses.
4) We call mozcrash.check_for_crashes() after a test run, on a timeout, or when there is an exception. This basically just checks for minidumps in the profile and passes them on to minidump_stackwalk (which is what prints the stack):
http://mxr.mozilla.org/mozilla-central/source/testing/mozbase/mozcrash/mozcrash/mozcrash.py#22
My theory is that the work_queue.empty MOZ_ASSERT causes a child process crash which triggers this line [1]. Then we hit bug 983489 which causes another crash in a different child process (?) and somehow erases the original minidump or prevents it from being stackwalked (?).
However when I ran debug mochitests locally, I witnessed two separate stacks in the log. This makes me question my original theory. In short, I don't know what is happening :).
[1] http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#139
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #14)
> This is definitely broken. It may or may not be the only problem to solve
> here, but why is the IPC thread doing work while the main thread is shutting
> down?
Dunno, shutdown ordering is hard? :)
Currently I think we kill the IO thread here:
http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#1034
Maybe it just needs to be moved earlier.
Flags: needinfo?(bent.mozilla)
Comment 25•11 years ago
|
||
This is still waiting on bug 986836 which Milan said is going to start moving at the earliest tomorrow.
I've been investigating this from the e10s end, and I think I have a better idea what's going on. The problem seems to happen when the child process tries to shut down the image bridge. We call ImageBridgeChild::DestroyBridge() from the main thread. That posts a task to the image bridge thread that calls StopImageBridgeSync. That function tries to send a sync message to the parent process.
All of this stuff happens very late in shutdown. If the parent then dies, the child's send call will fail and it will run some IPC to post a task somewhere to report the error. However, we never actually process that task because we're so late in shutdown. Instead, we end up with this assertion. The task that remains in the queue is MessageChannel::OnNotifyMaybeChannelError.
It's hard to know if the same problem is affecting b2g. However, I think both b2g and e10s would benefit from a fix for bug 924622, which proposes to fix the shutdown code in graphics so that it happens earlier. That would solve the image bridge bug I'm seeing and probably fix a bunch of other things.
Also, I should mention that it's pretty easy to reproduce this problem in e10s by putting sleep calls in the right places.
Comment 27•11 years ago
|
||
Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".
tracking-e10s:
--- → +
Comment 28•11 years ago
|
||
Gregor, there has been a bit of progress on the list of dependencies here. Can you please retest this when you get a chance? Thanks!
Flags: needinfo?(anygregor)
Reporter | ||
Comment 29•11 years ago
|
||
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #28)
> Gregor, there has been a bit of progress on the list of dependencies here.
> Can you please retest this when you get a chance? Thanks!
Sure, I just merged to pine again. https://tbpl.mozilla.org/?tree=Pine&showall=1
Flags: needinfo?(anygregor)
Comment 30•11 years ago
|
||
Re-triage - this doesn't have proof of being a user facing crash on devices, it only happens in debug build situations & tests. So renoming to reevaluate.
blocking-b2g: 1.4+ → 1.4?
Reporter | ||
Comment 31•11 years ago
|
||
(In reply to Jason Smith [:jsmith] from comment #30)
> Re-triage - this doesn't have proof of being a user facing crash on devices,
> it only happens in debug build situations & tests. So renoming to reevaluate.
This might as well be a major source of orange like bug 965677.
Updated•11 years ago
|
Assignee: overholt → anygregor
blocking-b2g: 1.4? → 1.4+
Reporter | ||
Comment 32•11 years ago
|
||
They are all gfx related as far as I can tell. Milan is a better owner/driver here.
Assignee: anygregor → milan
Comment 33•10 years ago
|
||
(In reply to Gregor Wagner [:gwagner] from comment #32)
> They are all gfx related as far as I can tell. Milan is a better
> owner/driver here.
So, does it mean, that "graphics" component would make sense?
Milan, may we have your comments for this bug? Thank you.
Flags: needinfo?(milan)
Comment 34•10 years ago
|
||
It may very well be the case. We are looking at shutdown issues in bug 924622, but it is slow going.
Flags: needinfo?(milan)
Updated•10 years ago
|
Component: IPC → Graphics
Comment 35•10 years ago
|
||
Milan
Are we fixing this issue in 1.4? We blocked on it so please help with next steps.
Flags: needinfo?(milan)
Comment 36•10 years ago
|
||
These shutdown issues are expensive to fix, we've been at it for weeks, with 7-8 landings that got backed out, so best I can offer is to keep working on it. Bug 924622 is where the work is, if that works on central, then we can proceed.
Flags: needinfo?(milan)
Comment 37•10 years ago
|
||
Assigning to :nical as a placeholder once we clear bug 924622.
Assignee: milan → nical.bugzilla
Comment 38•10 years ago
|
||
Now that bug 924622 has landed in 2.0, is this assertion still present there?
Flags: needinfo?(anygregor)
Reporter | ||
Comment 39•10 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #38)
> Now that bug 924622 has landed in 2.0, is this assertion still present there?
This assertion is no longer there.
We moved to bug 1017781 (https://tbpl.mozilla.org/php/getParsedLog.php?id=40756411&tree=Pine&full=1)
And bug 1017719 which is not gfx related.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(anygregor) → needinfo?(milan)
Resolution: --- → WORKSFORME
Updated•10 years ago
|
Flags: needinfo?(milan)
You need to log in
before you can comment on or make changes to this bug.
Description
•