Closed
Bug 1205164
Opened 9 years ago
Closed 9 years ago
MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool interferes w/debugging in e10s
Categories
(Core :: Audio/Video, defect)
Core
Audio/Video
Tracking
()
RESOLVED
FIXED
mozilla44
People
(Reporter: jib, Assigned: gcp)
Details
Attachments
(4 files, 2 obsolete files)
40 bytes,
text/x-review-board-request
|
Details | |
1.18 KB,
patch
|
jesup
:
review+
|
Details | Diff | Splinter Review |
5.92 KB,
patch
|
jesup
:
review+
|
Details | Diff | Splinter Review |
3.79 KB,
patch
|
blassey
:
review+
|
Details | Diff | Splinter Review |
Since Bug 1104616, this MOZ_ASSERT is making it difficult for me to debug GetUserMedia running in e10s. Any breakpoint set and hit in the content process while a gUM stream is active, is immediately followed by this assert in the chrome process, crashing the chrome process: > IPDL Background (40)#0 0x0000000104c17fe7 in mozilla::ShmemBuffer mozilla::ShmemPool::Get<mozilla::camera::CamerasParent>(mozilla::camera::CamerasParent*, unsigned long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/ShmemPool.cpp:83 > #1 0x0000000104c0900d in mozilla::camera::CamerasParent::DeliverFrameOverIPC(mozilla::camera::CaptureEngine, int, mozilla::ShmemBuffer, unsigned char*, int, unsigned int, long long, long long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:161 > #2 0x0000000104c299ec in mozilla::camera::DeliverFrameRunnable::Run() at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:117 > #3 0x000000010173efb3 in nsThread::ProcessNextEvent(bool, bool*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:874 > #4 0x00000001017c1787 in NS_ProcessNextEvent(nsIThread*, bool) at /Users/Jan/moz/mozilla-central/xpcom/glue/nsThreadUtils.cpp:277 > #5 0x0000000101e4f6d1 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) at /Users/Jan/moz/mozilla-central/ipc/glue/MessagePump.cpp:355 > #6 0x0000000101db93f5 in MessageLoop::RunInternal() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:234 > #7 0x0000000101db9305 in MessageLoop::RunHandler() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:227 > #8 0x0000000101db92ad in MessageLoop::Run() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:201 > #9 0x000000010173d316 in nsThread::ThreadFunc(void*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:363 > #10 0x00000001013807ef in _pt_root at /Users/Jan/moz/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:212 > #11 0x00007fff8dfd5899 in _pthread_body () > #12 0x00007fff8dfd572a in _pthread_start () > #13 0x00007fff8dfd9fc9 in thread_start () Changing the code to log and fail instead seems tolerated by the calling code.
Reporter | ||
Comment 1•9 years ago
|
||
Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.
Attachment #8661614 -
Flags: review?(gpascutto)
Assignee | ||
Comment 2•9 years ago
|
||
Comment on attachment 8661614 [details] MozReview Request: Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s. https://reviewboard.mozilla.org/r/19415/#review17385 This looks like a genuine bug that should be investigated. Hiding the assert sounds like a very bad idea.
Attachment #8661614 -
Flags: review?(gpascutto)
Assignee | ||
Comment 3•9 years ago
|
||
https://reviewboard.mozilla.org/r/19415/#review17387
Reporter | ||
Comment 4•9 years ago
|
||
When I say "crashes" I just mean that the assert hits in debug build, which is hard to get out of in my debug work flow. If the content process is blocked and the chrome process is running, I imagine it's hard for DeliverFrameOverIPC to keep doing its job.
Assignee | ||
Comment 5•9 years ago
|
||
I understand but you should still not hit this assert, AFAIK. If the two processes get desynced due to load they should drop frames, not get into an unknown/unwanted state.
Assignee | ||
Comment 6•9 years ago
|
||
Not a fix for this bug, but noticed this logging was missing.
Attachment #8662427 -
Flags: review?(rjesup)
Assignee | ||
Comment 7•9 years ago
|
||
FWIW this did not reproduce for me on Linux. Will try on Mac.
Updated•9 years ago
|
tracking-e10s:
--- → +
Assignee | ||
Comment 8•9 years ago
|
||
Updated•9 years ago
|
Attachment #8662427 -
Flags: review?(rjesup) → review+
Reporter | ||
Comment 9•9 years ago
|
||
Note that as mentioned in irc, this patch trades an assert for plain green output from getUserMedia past the point where the assert used to hit (which is actually ok for most debugging). Again, this only seems to happen when the content process is paused in the debugger.
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Jan-Ivar Bruaroey [:jib] from comment #9) > Note that as mentioned in irc, this patch trades an assert for plain green > output from getUserMedia past the point where the assert used to hit (which > is actually ok for most debugging). Still need to understand the underlying cause for the memory allocation failing. It's not obvious to my (why) this should fail in chrome because content is blocked.
Assignee | ||
Comment 11•9 years ago
|
||
Confirmed on Mac OS X. Sending SIGSTOP to plugin-container while capture is running causes an instant failure of Shmem allocation. So this is platform specific.
Assignee | ||
Comment 12•9 years ago
|
||
Relevant log: [Parent 22005] WARNING: didn't get an id (ipc/rcv) timed out 10004003 : file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 591 So this means that Shmem allocation does need cross-process comms, at least on Mac/mach. If that is true then the patch here to just return failure seems OK.
Assignee | ||
Comment 13•9 years ago
|
||
I guess it might be worth seeing if we can recover gracefully from this, i.e. allow the video to get unstuck if the debugger continues.
Assignee | ||
Comment 14•9 years ago
|
||
So the original code should already recover from this. Frames are dropped until Shmem allocation starts working again. Because some frame deliveries can be queued up, we might temporarily use more of the ShmemPool, but debugging suggests this part normalizes quickly and goes back to 1 Shmem in use. I don't really see anything wrong with the code here that would explain the green frames. Is it possible these are generated elsewhere if frame delivery stops for a certain period if time? I guess I'll have to debug in MediaStreamGraph next?
Flags: needinfo?(rjesup)
Comment 15•9 years ago
|
||
I'd check that if the shmem alloc fails that it's not passing a frame with an empty set of data. This *smells* like something returning success when it actually failed.
Flags: needinfo?(rjesup)
Comment 16•9 years ago
|
||
Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of why it reacts differently here. Are there any other implications of this behavior to shmem/IPC?
Flags: needinfo?(blassey.bugs)
Assignee | ||
Comment 17•9 years ago
|
||
>Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of >why it reacts differently here. This is already covered in the preceding comments (comment 10-12): Mac will fail to alloc Shmem if one of the two processes is blocked, by design. But as far as I can tell (with the patches here) the code handles and recovers from this by just stopping frame delivery until the allocations start succeeding again (comment 13-14). I can double-check if the frame data really makes it across after the recovery but I haven't seen any indication that that is the issue.
Flags: needinfo?(blassey.bugs)
Assignee | ||
Comment 18•9 years ago
|
||
So the data going into the shmem is the image data, and the data coming out is...zeroes. A bit more experimenting seems to indicate this goes away if ShmemPool is restricted to 1 buffer. Investigating further.
Assignee | ||
Comment 19•9 years ago
|
||
From debugging, my current suspicion is that the Shmem that was active at the moment the child gets stopped/resumed can silently get corrupted and be unusable, without this being visible through the Shmem API. For example, we will log this failure: [Child 76438] WARNING: SendMessage failed 0x10000004 (ipc/send) timed out : file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 273 But I see no evidence in the C++ code that that error is actually handled. From my logging, the exact failure case is when the child was just resumed, Shmem allocation in the parent will succeed, but the resulting Shmem segment will always show up empty in the child. That would correlate with the above error, which can only happen when the child is running (again, i.e. resumed).
Assignee | ||
Comment 20•9 years ago
|
||
Hmm, the code above does use a send/receive combo, so failure to send the reply in the Child would be perceived as a failure to receive in the parent, which is handled. That said, there appears to be no numbering of message, the validity of the data is checked by if (msg.GetDataLength() != sizeof(mach_port_t)) { So there seems ample opportunity for this to go wrong.
Assignee | ||
Comment 21•9 years ago
|
||
Confirmed that the problem is that the Mac Shmem implementation doesn't number its messages, causing it to get confused if the replies don't come in the order it expects (or if messages are dropped): [Parent 8808] WARNING: Serials do not match up 1 vs 9: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610 [Parent 8808] WARNING: Serials do not match up 2 vs 10: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610 [Parent 8808] WARNING: Serials do not match up 3 vs 11: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
Assignee | ||
Comment 22•9 years ago
|
||
Attachment #8664283 -
Flags: review?(blassey.bugs)
Assignee | ||
Comment 23•9 years ago
|
||
Attachment #8664284 -
Flags: review?(rjesup)
Assignee | ||
Updated•9 years ago
|
Attachment #8662467 -
Attachment is obsolete: true
Updated•9 years ago
|
Attachment #8664284 -
Flags: review?(rjesup) → review+
Comment 24•9 years ago
|
||
Comment on attachment 8664283 [details] [diff] [review] Detect message races in Mach Shmem implementation Review of attachment 8664283 [details] [diff] [review]: ----------------------------------------------------------------- ::: ipc/glue/SharedMemoryBasic_mach.cpp @@ +271,5 @@ > + // Construct the reply message, echoing the serial, and > + // adding the port > + unsigned char msgbuff[sizeof(port) + sizeof(uint64_t)]; > + memcpy(msgbuff, &port, sizeof(port)); > + memcpy(msgbuff + sizeof(port), serial, sizeof(uint64_t)); let's use a struct rather than this memcpy stuff.
Attachment #8664283 -
Flags: review?(blassey.bugs) → review-
Assignee | ||
Comment 25•9 years ago
|
||
Attachment #8665402 -
Flags: review?(blassey.bugs)
Assignee | ||
Updated•9 years ago
|
Attachment #8664283 -
Attachment is obsolete: true
Updated•9 years ago
|
Attachment #8665402 -
Flags: review?(blassey.bugs) → review+
Assignee | ||
Comment 26•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=99ccced4cd52
Assignee | ||
Comment 27•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/4cec5429fa92b68a0dcd39486898ffe9642583b5 Bug 1205164 - Make sure ShmemPool high water mark is logged in all allocators. r=jesup https://hg.mozilla.org/integration/mozilla-inbound/rev/5c1d38c87429a9725a2b8fc6dda2a1c8332c9490 Bug 1205164 - Fix ShmemPool detection of failed allocations. r=jesup https://hg.mozilla.org/integration/mozilla-inbound/rev/616b45c16382a11122535a2be35c6f6356d3e8a5 Bug 1205164 - Detect message races in Mach Shmem implementation. r=blassey
Comment 28•9 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/4cec5429fa92 https://hg.mozilla.org/mozilla-central/rev/5c1d38c87429 https://hg.mozilla.org/mozilla-central/rev/616b45c16382
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox44:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → gpascutto
You need to log in
before you can comment on or make changes to this bug.
Description
•