Closed Bug 1205164 Opened 9 years ago Closed 9 years ago

MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool interferes w/debugging in e10s

Categories

(Core :: Audio/Video, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla44
Tracking Status
e10s + ---
firefox43 --- affected
firefox44 --- fixed

People

(Reporter: jib, Assigned: gcp)

Details

Attachments

(4 files, 2 obsolete files)

Since Bug 1104616, this MOZ_ASSERT is making it difficult for me to debug GetUserMedia running in e10s. Any breakpoint set and hit in the content process while a gUM stream is active, is immediately followed by this assert in the chrome process, crashing the chrome process:

> IPDL Background (40)#0	0x0000000104c17fe7 in mozilla::ShmemBuffer mozilla::ShmemPool::Get<mozilla::camera::CamerasParent>(mozilla::camera::CamerasParent*, unsigned long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/ShmemPool.cpp:83
> #1	0x0000000104c0900d in mozilla::camera::CamerasParent::DeliverFrameOverIPC(mozilla::camera::CaptureEngine, int, mozilla::ShmemBuffer, unsigned char*, int, unsigned int, long long, long long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:161
> #2	0x0000000104c299ec in mozilla::camera::DeliverFrameRunnable::Run() at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:117
> #3	0x000000010173efb3 in nsThread::ProcessNextEvent(bool, bool*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:874
> #4	0x00000001017c1787 in NS_ProcessNextEvent(nsIThread*, bool) at /Users/Jan/moz/mozilla-central/xpcom/glue/nsThreadUtils.cpp:277
> #5	0x0000000101e4f6d1 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) at /Users/Jan/moz/mozilla-central/ipc/glue/MessagePump.cpp:355
> #6	0x0000000101db93f5 in MessageLoop::RunInternal() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:234
> #7	0x0000000101db9305 in MessageLoop::RunHandler() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:227
> #8	0x0000000101db92ad in MessageLoop::Run() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:201
> #9	0x000000010173d316 in nsThread::ThreadFunc(void*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:363
> #10	0x00000001013807ef in _pt_root at /Users/Jan/moz/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:212
> #11	0x00007fff8dfd5899 in _pthread_body ()
> #12	0x00007fff8dfd572a in _pthread_start ()
> #13	0x00007fff8dfd9fc9 in thread_start ()

Changing the code to log and fail instead seems tolerated by the calling code.
Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.
Attachment #8661614 - Flags: review?(gpascutto)
Comment on attachment 8661614 [details]
MozReview Request: Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.

https://reviewboard.mozilla.org/r/19415/#review17385

This looks like a genuine bug that should be investigated. Hiding the assert sounds like a very bad idea.
Attachment #8661614 - Flags: review?(gpascutto)
When I say "crashes" I just mean that the assert hits in debug build, which is hard to get out of in my debug work flow. If the content process is blocked and the chrome process is running, I imagine it's hard for DeliverFrameOverIPC to keep doing its job.
I understand but you should still not hit this assert, AFAIK. If the two processes get desynced due to load they should drop frames, not get into an unknown/unwanted state.
Not a fix for this bug, but noticed this logging was missing.
Attachment #8662427 - Flags: review?(rjesup)
FWIW this did not reproduce for me on Linux. Will try on Mac.
Attachment #8662427 - Flags: review?(rjesup) → review+
Note that as mentioned in irc, this patch trades an assert for plain green output from getUserMedia past the point where the assert used to hit (which is actually ok for most debugging). Again, this only seems to happen when the content process is paused in the debugger.
No longer blocks: 912342
(In reply to Jan-Ivar Bruaroey [:jib] from comment #9)
> Note that as mentioned in irc, this patch trades an assert for plain green
> output from getUserMedia past the point where the assert used to hit (which
> is actually ok for most debugging). 

Still need to understand the underlying cause for the memory allocation failing. It's not obvious to my (why) this should fail in chrome because content is blocked.
Confirmed on Mac OS X. Sending SIGSTOP to plugin-container while capture is running causes an instant failure of Shmem allocation. So this is platform specific.
Relevant log:

[Parent 22005] WARNING: didn't get an id (ipc/rcv) timed out 10004003
: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 591

So this means that Shmem allocation does need cross-process comms, at least on Mac/mach. If that is true then the patch here to just return failure seems OK.
I guess it might be worth seeing if we can recover gracefully from this, i.e. allow the video to get unstuck if the debugger continues.
So the original code should already recover from this. Frames are dropped until Shmem allocation starts working again. Because some frame deliveries can be queued up, we might temporarily use more of the ShmemPool, but debugging suggests this part normalizes quickly and goes back to 1 Shmem in use. 

I don't really see anything wrong with the code here that would explain the green frames. Is it possible these are generated elsewhere if frame delivery stops for a certain period if time?

I guess I'll have to debug in MediaStreamGraph next?
Flags: needinfo?(rjesup)
I'd check that if the shmem alloc fails that it's not passing a frame with an empty set of data.  This *smells* like something returning success when it actually failed.
Flags: needinfo?(rjesup)
Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of why it reacts differently here.  Are there any other implications of this behavior to shmem/IPC?
Flags: needinfo?(blassey.bugs)
>Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of 
>why it reacts differently here.

This is already covered in the preceding comments (comment 10-12): Mac will fail to alloc Shmem if one of the two processes is blocked, by design. But as far as I can tell (with the patches here) the code handles and recovers from this by just stopping frame delivery until the allocations start succeeding again (comment 13-14).

I can double-check if the frame data really makes it across after the recovery but I haven't seen any indication that that is the issue.
Flags: needinfo?(blassey.bugs)
So the data going into the shmem is the image data, and the data coming out is...zeroes.

A bit more experimenting seems to indicate this goes away if ShmemPool is restricted to 1 buffer. Investigating further.
From debugging, my current suspicion is that the Shmem that was active at the moment the child gets stopped/resumed can silently get corrupted and be unusable, without this being visible through the Shmem API.

For example, we will log this failure:
[Child 76438] WARNING: SendMessage failed 0x10000004 (ipc/send) timed out
: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 273

But I see no evidence in the C++ code that that error is actually handled.

From my logging, the exact failure case is when the child was just resumed, Shmem allocation in the parent will succeed, but the resulting Shmem segment will always show up empty in the child. That would correlate with the above error, which can only happen when the child is running (again, i.e. resumed).
Hmm, the code above does use a send/receive combo, so failure to send the reply in the Child would be perceived as a failure to receive in the parent, which is handled.

That said, there appears to be no numbering of message, the validity of the data is checked by

  if (msg.GetDataLength() != sizeof(mach_port_t)) {

So there seems ample opportunity for this to go wrong.
Confirmed that the problem is that the Mac Shmem implementation doesn't number its messages, causing it to get confused if the replies don't come in the order it expects (or if messages are dropped):

[Parent 8808] WARNING: Serials do not match up 1 vs 9: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
[Parent 8808] WARNING: Serials do not match up 2 vs 10: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
[Parent 8808] WARNING: Serials do not match up 3 vs 11: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
Attachment #8664283 - Flags: review?(blassey.bugs)
Attachment #8662467 - Attachment is obsolete: true
Attachment #8664284 - Flags: review?(rjesup) → review+
Comment on attachment 8664283 [details] [diff] [review]
Detect message races in Mach Shmem implementation

Review of attachment 8664283 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/SharedMemoryBasic_mach.cpp
@@ +271,5 @@
> +  // Construct the reply message, echoing the serial, and
> +  // adding the port
> +  unsigned char msgbuff[sizeof(port) + sizeof(uint64_t)];
> +  memcpy(msgbuff, &port, sizeof(port));
> +  memcpy(msgbuff + sizeof(port), serial, sizeof(uint64_t));

let's use a struct rather than this memcpy stuff.
Attachment #8664283 - Flags: review?(blassey.bugs) → review-
Attachment #8664283 - Attachment is obsolete: true
Attachment #8665402 - Flags: review?(blassey.bugs) → review+
Assignee: nobody → gpascutto
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: