<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 5

•

11 years ago

(Also note that this is a debug only assertion, and it just means that we now *know* about this problem. The underlying problem has been happening all along regardless -- bug 976363 just helped uncover it.)

Benjamin Smedberg

Comment 6

•

11 years ago

Does the assert come with a stack? I don't see any of the normal crashreporter stuff in the log. It would be nice to know what kind of threads are still running, if you catch this in a debugger. If this is only at shutdown I'm not sure I care all that much; it probably doesn't need to block in that case.

Comment 7

•

11 years ago

I'm seeing this in e10s testing too.

Blocks: e10s-tests

Reporter

Comment 8

•

11 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #6) > Does the assert come with a stack? I don't see any of the normal > crashreporter stuff in the log. It would be nice to know what kind of > threads are still running, if you catch this in a debugger. > > If this is only at shutdown I'm not sure I care all that much; it probably > doesn't need to block in that case. This happens on the emulator and we don't record stack traces for a child process after we run all the tests successfully. Andrew knows more about this. Maybe we can try to get a stack trace on try with your patch that enables this?

Flags: needinfo?(anygregor) → needinfo?(ahalberstadt)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

11 years ago

(In reply to comment #7) > I'm seeing this in e10s testing too. Do you have a stack and/or can you repro?

Comment 10

•

11 years ago

This is the output in e10s. I'm wondering if the assertion happens in the child process. The stack isn't too helpful. https://tbpl.mozilla.org/php/getParsedLog.php?id=36478575&full=1&branch=holly#error0 There's also this very scary looking assertion: firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.

Comment 11

•

11 years ago

The patch in bug 969146 should get the crash stack dumped in the log, but bug 983489 is blocking me from landing it.

Flags: needinfo?(ahalberstadt)

Reporter

Comment 12

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #11) > The patch in bug 969146 should get the crash stack dumped in the log, but > bug 983489 is blocking me from landing it. I can land the patch on pine but it doesn't apply currently.

Reporter

Comment 13

•

11 years ago

I pushed the patch to pine. We have multiple crash stacks but I am not sure if we have one that shows this bug. https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110

Reporter

Updated

•

11 years ago

Comment 14

•

11 years ago

(In reply to Bill McCloskey (:billm) from comment #10) > This is the output in e10s. I'm wondering if the assertion happens in the > child process. The stack isn't too helpful. > https://tbpl.mozilla.org/php/getParsedLog. > php?id=36478575&full=1&branch=holly#error0 > > There's also this very scary looking assertion: > firefox: pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion > `(-(e)) != 3 || !robust' failed. This stack is very interesting. I think one process here is aborting from libc's abort(), and the other one is aborting with the assertion in the summary of this bug. We have no stacks for the latter. But the former process seems to be trying to post an IPC message to the MessageLoop *while* we're shutting down on the main thread: 18:42:58 INFO - 18 libxul.so!LockImpl::Lock() [lock_impl_posix.cc:3d266881bda3 : 41 + 0x4] 18:42:58 INFO - rip = 0x00007fc2c94a055d rsp = 0x00007fc2bab3e830 18:42:58 INFO - rbp = 0x00007fc2d0213f3e 18:42:58 INFO - Found by: stack scanning 18:42:58 INFO - 19 libxul.so!tracked_objects::Tracked::SetBirthPlace(tracked_objects::Location const&) [tracked.cc:3d266881bda3 : 73 + 0x4] 18:42:58 INFO - rip = 0x00007fc2c94b044c rsp = 0x00007fc2bab3e840 18:42:58 INFO - rbp = 0x00007fc2d0213f3e 18:42:58 INFO - Found by: stack scanning 18:42:58 INFO - 20 libxul.so!MessageLoop::PostTask_Helper(tracked_objects::Location const&, Task*, int, bool) [lock.h:3d266881bda3 : 43 + 0x4] 18:42:58 INFO - rip = 0x00007fc2c94ad9a6 rsp = 0x00007fc2bab3e860 18:42:58 INFO - rbp = 0x00007fc2d0213f3e 18:42:58 INFO - Found by: stack scanning 18:42:58 INFO - 21 libxul.so!mozilla::ipc::MessageChannel::RefCountedTask::AddRef() [MessageChannel.h:3d266881bda3 : 386 + 0x18] 18:42:58 INFO - rip = 0x00007fc2c94bcac6 rsp = 0x00007fc2bab3e8b0 18:42:58 INFO - rbp = 0x00007fc2d0213f3e 18:42:58 INFO - Found by: stack scanning 18:42:58 INFO - 22 libxul.so!mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message const&) [MessageChannel.cpp:3d266881bda3 : 551 + 0x4] 18:42:58 INFO - rip = 0x00007fc2c94c5f60 rsp = 0x00007fc2bab3e8e0 18:42:58 INFO - rbp = 0x00007fc2d0213f3e 18:42:58 INFO - Found by: stack scanning 18:42:58 INFO - 23 libxul.so!mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message const&) [MessageLink.cpp:3d266881bda3 : 259 + 0xb] 18:42:58 INFO - rip = 0x00007fc2c94c60ce rsp = 0x00007fc2bab3e940 18:42:58 INFO - rbp = 0x00007fc2d0213f3e 18:42:58 INFO - Found by: stack scanning 18:42:58 INFO - 24 libxul.so!IPC::Channel::ChannelImpl::ProcessIncomingMessages() [ipc_channel_posix.cc:3d266881bda3 : 608 + 0xc] 18:42:58 INFO - rip = 0x00007fc2c94a5a79 rsp = 0x00007fc2bab3e970 18:42:58 INFO - rbp = 0x00007fc2d0213f3e 18:42:58 INFO - Found by: stack scanning This is definitely broken. It may or may not be the only problem to solve here, but why is the IPC thread doing work while the main thread is shutting down?

Flags: needinfo?(bent.mozilla)

Comment 15

•

11 years ago

(In reply to Gregor Wagner [:gwagner] from comment #13) > I pushed the patch to pine. We have multiple crash stacks but I am not sure > if we have one that shows this bug. > > https://tbpl.mozilla.org/?tree=Pine&showall=1&rev=f8bd5adbe110 The assertions happen on most of the test runs, but I can't find a stack in the ones that I looked at. Andrew, your patch seems to not be working as expected.

Flags: needinfo?(ahalberstadt)

Comment 16

•

11 years ago

This only seems to happen in some of the debug mochitests as far as I can see. The crashes in opt mochitest are bug 983489. The stacks in the debug mochitests seem to be bug 930884 which is odd because that bug should be fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the libEGL one happens first?

Flags: needinfo?(ahalberstadt)

bhavana bajaj [:bajaj]

Updated

•

11 years ago

blocking-b2g: 1.4? → 1.4+

bhavana bajaj [:bajaj]

Updated

•

11 years ago

Assignee: nobody → overholt

Comment 17

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #16) > This only seems to happen in some of the debug mochitests as far as I can > see. The crashes in opt mochitest are bug 983489. The stacks in the debug > mochitests seem to be bug 930884 which is odd because that bug should be > fixed. Maybe you aren't seeing the stack for the MOZ_ASSERT because the > libEGL one happens first? MOZ_ASSERT is a debug only macro, so it's totally expected to not see that in opt builds. Looks like bug 930884 is not fixed? I'm not really sure how to make progress here, we need more info before this bug is actionable.

Comment 18

•

11 years ago

Apparently bug 930884 was crashing in the parent process, and that got fixed, but that crash can still happen in a child process? I filed bug 987251 to continue investigating it. To clarify, these two bugs are potentially blocking us from getting a crash stack. Though I'll see if I can reproduce this locally and try to figure a way of extracting a stack manually.

Depends on: 987251, 983489

Reporter

Updated

•

11 years ago

Depends on: 986836

Benjamin Smedberg

Updated

•

11 years ago

Flags: needinfo?(benjamin)

Comment 19

•

11 years ago

I tried to debug this locally and I got a different MOZ_ASSERT [1] on shutdown for which a stack trace *did* appear (in addition to the libEGL one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but I'm confused why I would see a stack for that locally, but we wouldn't see a stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call in my patch is happening too early, before gecko has completely shutdown and we are missing the crash. To test this theory I added a sleep: https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb [1] F/MOZ_Assert( 311): Assertion failure: mUsedShmems.empty(), at /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51

Comment 20

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #19) > I tried to debug this locally and I got a different MOZ_ASSERT [1] on > shutdown for which a stack trace *did* appear (in addition to the libEGL > one). I'm not sure if this is related to the MOZ_ASSERT here or not.. but > I'm confused why I would see a stack for that locally, but we wouldn't see a > stack for this MOZ_ASSERT in production. Maybe the "check_for_crashes" call > in my patch is happening too early, before gecko has completely shutdown and > we are missing the crash. To test this theory I added a sleep: > > https://tbpl.mozilla.org/?tree=Try&rev=110c73e019bb > > [1] F/MOZ_Assert( 311): Assertion failure: mUsedShmems.empty(), at > /home/ahal/hg/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:51 I'm not sure if I can read this log properly? Which process is each one of these assertions coming from? The parent or the child? Surely they must happen on two different processes as they're both fatal...

Updated

•

11 years ago

Flags: needinfo?(ahalberstadt)

Comment 21

•

11 years ago

Which log? I didn't paste any. I have no idea why I saw two crash stacks locally (though not for this MOZ_ASSERT) but that try run only seems to have one. I also don't know if fixing bug 987251 and bug 983489 will help expose the stack here or if we aren't seeing it for some other reason.

Flags: needinfo?(ahalberstadt)

Comment 22

•

11 years ago

(In reply to comment #21) > Which log? I didn't paste any. The logs in the try run! > I have no idea why I saw two crash stacks > locally (though not for this MOZ_ASSERT) but that try run only seems to have > one. I also don't know if fixing bug 987251 and bug 983489 will help expose the > stack here or if we aren't seeing it for some other reason. Hmm.. Can you please help me understand which code is responsible for capturing the stacks here? MOZ_ASSERT itself doesn't unfortunately. FWIW, I'm guessing that fixing those bugs will fix the underlying issue here, because they both look like potential symptoms of the underlying problem here. I don't think fixing either of those will get us stacks per se but that could be because my understanding of how we capture these stack traces is incorrect.

Ben Turner (not reading bugmail, use the needinfo flag!)

Comment 23

•

11 years ago

There's only one stack in the try logs :), I was talking about what I was seeing locally. I'll try to explain more clearly: 1) When MOZ_CRASHREPORTER_SHUTDOWN is set, we quit the application when a child process crash happens: http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#135 2) My patch in the try run, makes b2g mochitests pass in MOZ_CRASHREPORTER_SHUTDOWN: https://hg.mozilla.org/try/rev/a9c2ddbf2832 3) When a crash happens, a minidump should be saved in 'path/to/profile/minidumps'. This is on crashreporter and has nothing to do with the test harnesses. 4) We call mozcrash.check_for_crashes() after a test run, on a timeout, or when there is an exception. This basically just checks for minidumps in the profile and passes them on to minidump_stackwalk (which is what prints the stack): http://mxr.mozilla.org/mozilla-central/source/testing/mozbase/mozcrash/mozcrash/mozcrash.py#22 My theory is that the work_queue.empty MOZ_ASSERT causes a child process crash which triggers this line [1]. Then we hit bug 983489 which causes another crash in a different child process (?) and somehow erases the original minidump or prevents it from being stackwalked (?). However when I ran debug mochitests locally, I witnessed two separate stacks in the log. This makes me question my original theory. In short, I don't know what is happening :). [1] http://mxr.mozilla.org/mozilla-central/source/b2g/chrome/content/shell.js#139

Comment 24

•

11 years ago

(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #14) > This is definitely broken. It may or may not be the only problem to solve > here, but why is the IPC thread doing work while the main thread is shutting > down? Dunno, shutdown ordering is hard? :) Currently I think we kill the IO thread here: http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#1034 Maybe it just needs to be moved earlier.

Flags: needinfo?(bent.mozilla)

Andrew Overholt [:overholt]

Comment 25

•

11 years ago

This is still waiting on bug 986836 which Milan said is going to start moving at the earliest tomorrow.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 26

•

11 years ago

I've been investigating this from the e10s end, and I think I have a better idea what's going on. The problem seems to happen when the child process tries to shut down the image bridge. We call ImageBridgeChild::DestroyBridge() from the main thread. That posts a task to the image bridge thread that calls StopImageBridgeSync. That function tries to send a sync message to the parent process. All of this stuff happens very late in shutdown. If the parent then dies, the child's send call will fail and it will run some IPC to post a task somewhere to report the error. However, we never actually process that task because we're so late in shutdown. Instead, we end up with this assertion. The task that remains in the queue is MessageChannel::OnNotifyMaybeChannelError. It's hard to know if the same problem is affecting b2g. However, I think both b2g and e10s would benefit from a fix for bug 924622, which proposes to fix the shutdown code in graphics so that it happens earlier. That would solve the image bridge bug I'm seeing and probably fix a bunch of other things. Also, I should mention that it's pretty easy to reproduce this problem in e10s by putting sleep calls in the right places.

Chris Peterson [:cpeterson]

Updated

•

11 years ago

Depends on: 924622

Comment 27

•

11 years ago

Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".

tracking-e10s: --- → +

Comment 28

•

11 years ago

Gregor, there has been a bit of progress on the list of dependencies here. Can you please retest this when you get a chance? Thanks!

Flags: needinfo?(anygregor)

Reporter

Comment 29

•

11 years ago

(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #28) > Gregor, there has been a bit of progress on the list of dependencies here. > Can you please retest this when you get a chance? Thanks! Sure, I just merged to pine again. https://tbpl.mozilla.org/?tree=Pine&showall=1

Flags: needinfo?(anygregor)

Jason Smith [:jsmith]

Comment 30

•

11 years ago

Re-triage - this doesn't have proof of being a user facing crash on devices, it only happens in debug build situations & tests. So renoming to reevaluate.

blocking-b2g: 1.4+ → 1.4?

Preeti Raghunath(:Preeti)

Reporter

Comment 31

•

11 years ago

(In reply to Jason Smith [:jsmith] from comment #30) > Re-triage - this doesn't have proof of being a user facing crash on devices, > it only happens in debug build situations & tests. So renoming to reevaluate. This might as well be a major source of orange like bug 965677.

Updated

•

11 years ago

Assignee: overholt → anygregor

blocking-b2g: 1.4? → 1.4+

(inactive) Jim Chen [:jchen] [:darchons]

Reporter

Comment 32

•

11 years ago

They are all gfx related as far as I can tell. Milan is a better owner/driver here.

Assignee: anygregor → milan

Updated

•

11 years ago

Blocks: 997191

Chiajung Hung [:chiajung]

Updated

•

11 years ago

Blocks: 997367

Kai-Chih Hu [:khu]

Comment 33

•

10 years ago

(In reply to Gregor Wagner [:gwagner] from comment #32) > They are all gfx related as far as I can tell. Milan is a better > owner/driver here. So, does it mean, that "graphics" component would make sense? Milan, may we have your comments for this bug? Thank you.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 34

•

10 years ago

It may very well be the case. We are looking at shutdown issues in bug 924622, but it is slow going.

Flags: needinfo?(milan)

Kai-Chih Hu [:khu]

Updated

•

10 years ago

Component: IPC → Graphics

Preeti Raghunath(:Preeti)

Comment 35

•

10 years ago

Milan Are we fixing this issue in 1.4? We blocked on it so please help with next steps.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 36

•

10 years ago

These shutdown issues are expensive to fix, we've been at it for weeks, with 7-8 landings that got backed out, so best I can offer is to keep working on it. Bug 924622 is where the work is, if that works on central, then we can proceed.

Flags: needinfo?(milan)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 37

•

10 years ago

Assigning to :nical as a placeholder once we clear bug 924622.

Assignee: milan → nical.bugzilla

Milan Sreckovic [:milan] (needinfo for best results)

Comment 38

•

10 years ago

Now that bug 924622 has landed in 2.0, is this assertion still present there?

Flags: needinfo?(anygregor)