Closed Bug 1205164 Opened 9 years ago Closed 9 years ago

MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool interferes w/debugging in e10s

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla44

Tracking Flags:

Tracking

Status

e10s

---

firefox43

---

affected

firefox44

---

fixed

People

(Reporter: jib, Assigned: gcp)

Details

Attachments

(4 files, 2 obsolete files)

MozReview Request: Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s. 9 years ago Jan-Ivar Bruaroey [:jib] (needinfo? me) 40 bytes, text/x-review-board-request		Details
Make sure ShmemPool high water mark is logged in all allocators 9 years ago Gian-Carlo Pascutto [:gcp] 1.18 KB, patch	jesup : review+	Details \| Diff \| Splinter Review
Fix ShmemPool logic on buffer reallocation 9 years ago Gian-Carlo Pascutto [:gcp] 2.02 KB, patch		Details \| Diff \| Splinter Review
Detect message races in Mach Shmem implementation 9 years ago Gian-Carlo Pascutto [:gcp] 3.50 KB, patch	blassey : review-	Details \| Diff \| Splinter Review
Fix ShmemPool detection of failed allocations 9 years ago Gian-Carlo Pascutto [:gcp] 5.92 KB, patch	jesup : review+	Details \| Diff \| Splinter Review
Detect message races in Mach Shmem implementation 9 years ago Gian-Carlo Pascutto [:gcp] 3.79 KB, patch	blassey : review+	Details \| Diff \| Splinter Review

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Reporter

Description

•

9 years ago

Since Bug 1104616, this MOZ_ASSERT is making it difficult for me to debug GetUserMedia running in e10s. Any breakpoint set and hit in the content process while a gUM stream is active, is immediately followed by this assert in the chrome process, crashing the chrome process:

> IPDL Background (40)#0	0x0000000104c17fe7 in mozilla::ShmemBuffer mozilla::ShmemPool::Get<mozilla::camera::CamerasParent>(mozilla::camera::CamerasParent*, unsigned long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/ShmemPool.cpp:83
> #1	0x0000000104c0900d in mozilla::camera::CamerasParent::DeliverFrameOverIPC(mozilla::camera::CaptureEngine, int, mozilla::ShmemBuffer, unsigned char*, int, unsigned int, long long, long long) at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:161
> #2	0x0000000104c299ec in mozilla::camera::DeliverFrameRunnable::Run() at /Users/Jan/moz/mozilla-central/dom/media/systemservices/CamerasParent.cpp:117
> #3	0x000000010173efb3 in nsThread::ProcessNextEvent(bool, bool*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:874
> #4	0x00000001017c1787 in NS_ProcessNextEvent(nsIThread*, bool) at /Users/Jan/moz/mozilla-central/xpcom/glue/nsThreadUtils.cpp:277
> #5	0x0000000101e4f6d1 in mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) at /Users/Jan/moz/mozilla-central/ipc/glue/MessagePump.cpp:355
> #6	0x0000000101db93f5 in MessageLoop::RunInternal() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:234
> #7	0x0000000101db9305 in MessageLoop::RunHandler() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:227
> #8	0x0000000101db92ad in MessageLoop::Run() at /Users/Jan/moz/mozilla-central/ipc/chromium/src/base/message_loop.cc:201
> #9	0x000000010173d316 in nsThread::ThreadFunc(void*) at /Users/Jan/moz/mozilla-central/xpcom/threads/nsThread.cpp:363
> #10	0x00000001013807ef in _pt_root at /Users/Jan/moz/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:212
> #11	0x00007fff8dfd5899 in _pthread_body ()
> #12	0x00007fff8dfd572a in _pthread_start ()
> #13	0x00007fff8dfd9fc9 in thread_start ()

Changing the code to log and fail instead seems tolerated by the calling code.

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Reporter

Comment 1

•

9 years ago

Attached file MozReview Request: Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s. — Details

Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.

Attachment #8661614 - Flags: review?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 2

•

9 years ago

Comment on attachment 8661614 [details]
MozReview Request: Bug 1205164 - remove MOZ_ASSERT(Shmem.IsWritable()) in ShmemPool, as it interfered w/debugging in e10s.

https://reviewboard.mozilla.org/r/19415/#review17385

This looks like a genuine bug that should be investigated. Hiding the assert sounds like a very bad idea.

Attachment #8661614 - Flags: review?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 3

•

9 years ago

https://reviewboard.mozilla.org/r/19415/#review17387

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Reporter

Comment 4

•

9 years ago

When I say "crashes" I just mean that the assert hits in debug build, which is hard to get out of in my debug work flow. If the content process is blocked and the chrome process is running, I imagine it's hard for DeliverFrameOverIPC to keep doing its job.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 5

•

9 years ago

I understand but you should still not hit this assert, AFAIK. If the two processes get desynced due to load they should drop frames, not get into an unknown/unwanted state.

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Reporter

Updated

•

9 years ago

Blocks: 912342

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 6

•

9 years ago

Attached patch Make sure ShmemPool high water mark is logged in all allocators — Details — Splinter Review

Not a fix for this bug, but noticed this logging was missing.

Attachment #8662427 - Flags: review?(rjesup)

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 7

•

9 years ago

FWIW this did not reproduce for me on Linux. Will try on Mac.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

9 years ago

tracking-e10s: --- → +

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 8

•

9 years ago

Attached patch Fix ShmemPool logic on buffer reallocation (obsolete) — Details — Splinter Review

Randell Jesup [:jesup] (needinfo me)

Updated

•

9 years ago

Attachment #8662427 - Flags: review?(rjesup) → review+

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Reporter

Comment 9

•

9 years ago

Note that as mentioned in irc, this patch trades an assert for plain green output from getUserMedia past the point where the assert used to hit (which is actually ok for most debugging). Again, this only seems to happen when the content process is paused in the debugger.

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Reporter

Updated

•

9 years ago

No longer blocks: 912342

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 10

•

9 years ago

(In reply to Jan-Ivar Bruaroey [:jib] from comment #9)
> Note that as mentioned in irc, this patch trades an assert for plain green
> output from getUserMedia past the point where the assert used to hit (which
> is actually ok for most debugging). 

Still need to understand the underlying cause for the memory allocation failing. It's not obvious to my (why) this should fail in chrome because content is blocked.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 11

•

9 years ago

Confirmed on Mac OS X. Sending SIGSTOP to plugin-container while capture is running causes an instant failure of Shmem allocation. So this is platform specific.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 12

•

9 years ago

Relevant log:

[Parent 22005] WARNING: didn't get an id (ipc/rcv) timed out 10004003
: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 591

So this means that Shmem allocation does need cross-process comms, at least on Mac/mach. If that is true then the patch here to just return failure seems OK.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 13

•

9 years ago

I guess it might be worth seeing if we can recover gracefully from this, i.e. allow the video to get unstuck if the debugger continues.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 14

•

9 years ago

So the original code should already recover from this. Frames are dropped until Shmem allocation starts working again. Because some frame deliveries can be queued up, we might temporarily use more of the ShmemPool, but debugging suggests this part normalizes quickly and goes back to 1 Shmem in use. 

I don't really see anything wrong with the code here that would explain the green frames. Is it possible these are generated elsewhere if frame delivery stops for a certain period if time?

I guess I'll have to debug in MediaStreamGraph next?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 15

•

9 years ago

I'd check that if the shmem alloc fails that it's not passing a frame with an empty set of data.  This *smells* like something returning success when it actually failed.

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 16

•

9 years ago

Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of why it reacts differently here.  Are there any other implications of this behavior to shmem/IPC?

Flags: needinfo?(blassey.bugs)

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 17

•

9 years ago

>Brad - IIRC, you updated the Mac Shmem implementation to use Mac-native methods; I imagine this is part of 
>why it reacts differently here.

This is already covered in the preceding comments (comment 10-12): Mac will fail to alloc Shmem if one of the two processes is blocked, by design. But as far as I can tell (with the patches here) the code handles and recovers from this by just stopping frame delivery until the allocations start succeeding again (comment 13-14).

I can double-check if the frame data really makes it across after the recovery but I haven't seen any indication that that is the issue.

Flags: needinfo?(blassey.bugs)

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 18

•

9 years ago

So the data going into the shmem is the image data, and the data coming out is...zeroes.

A bit more experimenting seems to indicate this goes away if ShmemPool is restricted to 1 buffer. Investigating further.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 19

•

9 years ago

From debugging, my current suspicion is that the Shmem that was active at the moment the child gets stopped/resumed can silently get corrupted and be unusable, without this being visible through the Shmem API.

For example, we will log this failure:
[Child 76438] WARNING: SendMessage failed 0x10000004 (ipc/send) timed out
: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 273

But I see no evidence in the C++ code that that error is actually handled.

From my logging, the exact failure case is when the child was just resumed, Shmem allocation in the parent will succeed, but the resulting Shmem segment will always show up empty in the child. That would correlate with the above error, which can only happen when the child is running (again, i.e. resumed).

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 20

•

9 years ago

Hmm, the code above does use a send/receive combo, so failure to send the reply in the Child would be perceived as a failure to receive in the parent, which is handled.

That said, there appears to be no numbering of message, the validity of the data is checked by

  if (msg.GetDataLength() != sizeof(mach_port_t)) {

So there seems ample opportunity for this to go wrong.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 21

•

9 years ago

Confirmed that the problem is that the Mac Shmem implementation doesn't number its messages, causing it to get confused if the replies don't come in the order it expects (or if messages are dropped):

[Parent 8808] WARNING: Serials do not match up 1 vs 9: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
[Parent 8808] WARNING: Serials do not match up 2 vs 10: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610
[Parent 8808] WARNING: Serials do not match up 3 vs 11: file /Users/morbo/hg/mozilla-central/ipc/glue/SharedMemoryBasic_mach.cpp, line 610

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 22

•

9 years ago

Attached patch Detect message races in Mach Shmem implementation (obsolete) — Details — Splinter Review

Attachment #8664283 - Flags: review?(blassey.bugs)

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 23

•

9 years ago

Attached patch Fix ShmemPool detection of failed allocations — Details — Splinter Review

Attachment #8664284 - Flags: review?(rjesup)

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

9 years ago

Attachment #8662467 - Attachment is obsolete: true

Randell Jesup [:jesup] (needinfo me)

Updated

•

9 years ago

Attachment #8664284 - Flags: review?(rjesup) → review+

Brad Lassey [:blassey] (use needinfo?)

Comment 24

•

9 years ago

Comment on attachment 8664283 [details] [diff] [review]
Detect message races in Mach Shmem implementation

Review of attachment 8664283 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/SharedMemoryBasic_mach.cpp
@@ +271,5 @@
> +  // Construct the reply message, echoing the serial, and
> +  // adding the port
> +  unsigned char msgbuff[sizeof(port) + sizeof(uint64_t)];
> +  memcpy(msgbuff, &port, sizeof(port));
> +  memcpy(msgbuff + sizeof(port), serial, sizeof(uint64_t));

let's use a struct rather than this memcpy stuff.

Attachment #8664283 - Flags: review?(blassey.bugs) → review-

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 25

•

9 years ago

Attached patch Detect message races in Mach Shmem implementation — Details — Splinter Review

Attachment #8665402 - Flags: review?(blassey.bugs)

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

9 years ago

Attachment #8664283 - Attachment is obsolete: true

Brad Lassey [:blassey] (use needinfo?)

Updated

•

9 years ago

Attachment #8665402 - Flags: review?(blassey.bugs) → review+

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 26

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=99ccced4cd52

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 27

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/4cec5429fa92b68a0dcd39486898ffe9642583b5
Bug 1205164 - Make sure ShmemPool high water mark is logged in all allocators. r=jesup

https://hg.mozilla.org/integration/mozilla-inbound/rev/5c1d38c87429a9725a2b8fc6dda2a1c8332c9490
Bug 1205164 - Fix ShmemPool detection of failed allocations. r=jesup

https://hg.mozilla.org/integration/mozilla-inbound/rev/616b45c16382a11122535a2be35c6f6356d3e8a5
Bug 1205164 - Detect message races in Mach Shmem implementation. r=blassey

Carsten Book [:Tomcat]

Comment 28

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/4cec5429fa92
https://hg.mozilla.org/mozilla-central/rev/5c1d38c87429
https://hg.mozilla.org/mozilla-central/rev/616b45c16382

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox44: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla44

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Reporter

Updated

•

9 years ago

Assignee: nobody → gpascutto

You need to log in before you can comment on or make changes to this bug.