MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool interferes w/debugging in e10s

RESOLVED FIXED in Firefox 44

Status

()

Core
Audio/Video
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jib, Assigned: gcp)

Tracking

Trunk
mozilla44
Points:
---

Firefox Tracking Flags

(e10s+, firefox43 affected, firefox44 fixed)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(4 attachments, 2 obsolete attachments)

Since Bug 1104616, this MOZ_ASSERT is making it difficult for me to debug GetUserMedia running in e10s. Any breakpoint set and hit in the content process while a gUM stream is active, is immediately followed by this assert in the chrome process, crashing the chrome process:

> IPDL Background (40)#0	0x0000000104c17fe7 in mozilla::ShmemBuffer mozilla::ShmemPool::Get<mozilla::camera::CamerasParent>(mozilla::camera::CamerasParent*, unsigned long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/ShmemPool.cpp:83
> #1	0x0000000104c0900d in mozilla::camera::CamerasParent::DeliverFrameOverIPC(mozilla::camera::CaptureEngine, int, mozilla::ShmemBuffer, unsigned char*, int, unsigned int, long long, long long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:161
> #2	0x0000000104c299ec in mozilla::camera::DeliverFrameRunnable::Run() at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:117
> #3	0x000000010173efb3 in nsThread::ProcessNextEvent(bool, bool*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:874
> #4	0x00000001017c1787 in NS_ProcessNextEvent(nsIThread*, bool) at /Users/Jan/moz/mozilla-central/xpcom/glue/nsThreadUtils.cpp:277
> #5	0x0000000101e4f6d1 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) at /Users/Jan/moz/mozilla-central/ipc/glue/MessagePump.cpp:355
> #6	0x0000000101db93f5 in MessageLoop::RunInternal() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:234
> #7	0x0000000101db9305 in MessageLoop::RunHandler() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:227
> #8	0x0000000101db92ad in MessageLoop::Run() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:201
> #9	0x000000010173d316 in nsThread::ThreadFunc(void*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:363
> #10	0x00000001013807ef in _pt_root at /Users/Jan/moz/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:212
> #11	0x00007fff8dfd5899 in _pthread_body ()
> #12	0x00007fff8dfd572a in _pthread_start ()
> #13	0x00007fff8dfd9fc9 in thread_start ()

Changing the code to log and fail instead seems tolerated by the calling code.
Created attachment 8661614 [details]
MozReview Request: Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.

Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.
Attachment #8661614 - Flags: review?(gpascutto)
(Assignee)

Comment 2

2 years ago
Comment on attachment 8661614 [details]
MozReview Request: Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.

https://reviewboard.mozilla.org/r/19415/#review17385

This looks like a genuine bug that should be investigated. Hiding the assert sounds like a very bad idea.
Attachment #8661614 - Flags: review?(gpascutto)
(Assignee)

Comment 3

2 years ago
https://reviewboard.mozilla.org/r/19415/#review17387
When I say "crashes" I just mean that the assert hits in debug build, which is hard to get out of in my debug work flow. If the content process is blocked and the chrome process is running, I imagine it's hard for DeliverFrameOverIPC to keep doing its job.
(Assignee)

Comment 5

2 years ago
I understand but you should still not hit this assert, AFAIK. If the two processes get desynced due to load they should drop frames, not get into an unknown/unwanted state.
Blocks: 912342
(Assignee)

Comment 6

2 years ago
Created attachment 8662427 [details] [diff] [review]
Make sure ShmemPool high water mark is logged in all allocators

Not a fix for this bug, but noticed this logging was missing.
Attachment #8662427 - Flags: review?(rjesup)
(Assignee)

Comment 7

2 years ago
FWIW this did not reproduce for me on Linux. Will try on Mac.
tracking-e10s: --- → +
(Assignee)

Comment 8

2 years ago
Created attachment 8662467 [details] [diff] [review]
Fix ShmemPool logic on buffer reallocation

Updated

2 years ago
Attachment #8662427 - Flags: review?(rjesup) → review+
Note that as mentioned in irc, this patch trades an assert for plain green output from getUserMedia past the point where the assert used to hit (which is actually ok for most debugging). Again, this only seems to happen when the content process is paused in the debugger.
No longer blocks: 912342
(Assignee)

Comment 10

2 years ago
(In reply to Jan-Ivar Bruaroey [:jib] from comment #9)
> Note that as mentioned in irc, this patch trades an assert for plain green
> output from getUserMedia past the point where the assert used to hit (which
> is actually ok for most debugging). 

Still need to understand the underlying cause for the memory allocation failing. It's not obvious to my (why) this should fail in chrome because content is blocked.
(Assignee)

Comment 11

2 years ago
Confirmed on Mac OS X. Sending SIGSTOP to plugin-container while capture is running causes an instant failure of Shmem allocation. So this is platform specific.
(Assignee)

Comment 12

2 years ago
Relevant log:

[Parent 22005] WARNING: didn't get an id (ipc/rcv) timed out 10004003
: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 591

So this means that Shmem allocation does need cross-process comms, at least on Mac/mach. If that is true then the patch here to just return failure seems OK.
(Assignee)

Comment 13

2 years ago
I guess it might be worth seeing if we can recover gracefully from this, i.e. allow the video to get unstuck if the debugger continues.
(Assignee)

Comment 14

2 years ago
So the original code should already recover from this. Frames are dropped until Shmem allocation starts working again. Because some frame deliveries can be queued up, we might temporarily use more of the ShmemPool, but debugging suggests this part normalizes quickly and goes back to 1 Shmem in use. 

I don't really see anything wrong with the code here that would explain the green frames. Is it possible these are generated elsewhere if frame delivery stops for a certain period if time?

I guess I'll have to debug in MediaStreamGraph next?
Flags: needinfo?(rjesup)
I'd check that if the shmem alloc fails that it's not passing a frame with an empty set of data.  This *smells* like something returning success when it actually failed.
Flags: needinfo?(rjesup)
Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of why it reacts differently here.  Are there any other implications of this behavior to shmem/IPC?
Flags: needinfo?(blassey.bugs)
(Assignee)

Comment 17

2 years ago
>Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of 
>why it reacts differently here.

This is already covered in the preceding comments (comment 10-12): Mac will fail to alloc Shmem if one of the two processes is blocked, by design. But as far as I can tell (with the patches here) the code handles and recovers from this by just stopping frame delivery until the allocations start succeeding again (comment 13-14).

I can double-check if the frame data really makes it across after the recovery but I haven't seen any indication that that is the issue.
Flags: needinfo?(blassey.bugs)
(Assignee)

Comment 18

2 years ago
So the data going into the shmem is the image data, and the data coming out is...zeroes.

A bit more experimenting seems to indicate this goes away if ShmemPool is restricted to 1 buffer. Investigating further.
(Assignee)

Comment 19

2 years ago
From debugging, my current suspicion is that the Shmem that was active at the moment the child gets stopped/resumed can silently get corrupted and be unusable, without this being visible through the Shmem API.

For example, we will log this failure:
[Child 76438] WARNING: SendMessage failed 0x10000004 (ipc/send) timed out
: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 273

But I see no evidence in the C++ code that that error is actually handled.

From my logging, the exact failure case is when the child was just resumed, Shmem allocation in the parent will succeed, but the resulting Shmem segment will always show up empty in the child. That would correlate with the above error, which can only happen when the child is running (again, i.e. resumed).
(Assignee)

Comment 20

2 years ago
Hmm, the code above does use a send/receive combo, so failure to send the reply in the Child would be perceived as a failure to receive in the parent, which is handled.

That said, there appears to be no numbering of message, the validity of the data is checked by

  if (msg.GetDataLength() != sizeof(mach_port_t)) {

So there seems ample opportunity for this to go wrong.
(Assignee)

Comment 21

2 years ago
Confirmed that the problem is that the Mac Shmem implementation doesn't number its messages, causing it to get confused if the replies don't come in the order it expects (or if messages are dropped):

[Parent 8808] WARNING: Serials do not match up 1 vs 9: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
[Parent 8808] WARNING: Serials do not match up 2 vs 10: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
[Parent 8808] WARNING: Serials do not match up 3 vs 11: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
(Assignee)

Comment 22

2 years ago
Created attachment 8664283 [details] [diff] [review]
Detect message races in Mach Shmem implementation
Attachment #8664283 - Flags: review?(blassey.bugs)
(Assignee)

Comment 23

2 years ago
Created attachment 8664284 [details] [diff] [review]
Fix ShmemPool detection of failed allocations
Attachment #8664284 - Flags: review?(rjesup)
(Assignee)

Updated

2 years ago
Attachment #8662467 - Attachment is obsolete: true

Updated

2 years ago
Attachment #8664284 - Flags: review?(rjesup) → review+
Comment on attachment 8664283 [details] [diff] [review]
Detect message races in Mach Shmem implementation

Review of attachment 8664283 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/SharedMemoryBasic_mach.cpp
@@ +271,5 @@
> +  // Construct the reply message, echoing the serial, and
> +  // adding the port
> +  unsigned char msgbuff[sizeof(port) + sizeof(uint64_t)];
> +  memcpy(msgbuff, &port, sizeof(port));
> +  memcpy(msgbuff + sizeof(port), serial, sizeof(uint64_t));

let's use a struct rather than this memcpy stuff.
Attachment #8664283 - Flags: review?(blassey.bugs) → review-
(Assignee)

Comment 25

2 years ago
Created attachment 8665402 [details] [diff] [review]
Detect message races in Mach Shmem implementation
Attachment #8665402 - Flags: review?(blassey.bugs)
(Assignee)

Updated

2 years ago
Attachment #8664283 - Attachment is obsolete: true
Attachment #8665402 - Flags: review?(blassey.bugs) → review+
(Assignee)

Comment 26

2 years ago
https://treeherder.mozilla.org/#/jobs?repo=try&revision=99ccced4cd52
(Assignee)

Comment 27

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/4cec5429fa92b68a0dcd39486898ffe9642583b5
Bug 1205164 - Make sure ShmemPool high water mark is logged in all allocators. r=jesup

https://hg.mozilla.org/integration/mozilla-inbound/rev/5c1d38c87429a9725a2b8fc6dda2a1c8332c9490
Bug 1205164 - Fix ShmemPool detection of failed allocations. r=jesup

https://hg.mozilla.org/integration/mozilla-inbound/rev/616b45c16382a11122535a2be35c6f6356d3e8a5
Bug 1205164 - Detect message races in Mach Shmem implementation. r=blassey
https://hg.mozilla.org/mozilla-central/rev/4cec5429fa92
https://hg.mozilla.org/mozilla-central/rev/5c1d38c87429
https://hg.mozilla.org/mozilla-central/rev/616b45c16382
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox44: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Assignee: nobody → gpascutto
You need to log in before you can comment on or make changes to this bug.