1535335 - Crash in [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::PRemoteDecoderManagerChild::SendPRemoteDecoderConstructor]

Reporter

Description

•

6 years ago

This bug is for crash report bp-69fa5ed6-217f-46ad-9a83-0c1d90190304.

Seen while looking at 67 nightly crash stats: https://bit.ly/2T8PmGZ. These crashes seem to have started in 20190222081112 and are predominantly Mac, although there is one Windows crash.

The Moz Crash reason for all is: IPDL error: "constructor for actor failed". abort()ing as a result.

(100.0% in signature vs 00.26% overall) moz_crash_reason = IPDL error: "constructor for actor failed". abort()ing as a result.
(100.0% in signature vs 02.96% overall) Module "libplugin_child_interpose.dylib" = true
(100.0% in signature vs 03.17% overall) Module "libsandbox.1.dylib" = true
(100.0% in signature vs 03.48% overall) reason = EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 10 frames of crashing thread:

0 XUL MOZ_Crash mfbt/Assertions.h:314
1 XUL mozilla::ipc::FatalError ipc/glue/ProtocolUtils.cpp:264
2 XUL mozilla::ipc::IProtocol::HandleFatalError const ipc/glue/ProtocolUtils.cpp:440
3 XUL mozilla::PRemoteDecoderManagerChild::SendPRemoteDecoderConstructor ipc/glue/ProtocolUtils.cpp:431
4 XUL mozilla::RemoteVideoDecoderChild::InitIPDL dom/media/ipc/RemoteVideoDecoder.cpp:130
5 XUL mozilla::detail::RunnableFunction<mozilla::RemoteDecoderModule::CreateVideoDecoder dom/media/ipc/RemoteDecoderModule.cpp:112
6 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1162
7 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:474
8 XUL mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:303
9 XUL nsThread::ThreadFunc ipc/chromium/src/base/message_loop.cc:315

[:philipp]

Comment 1

•

6 years ago

•

Edited

could these crashes be related to bug 1500596?

Flags: needinfo?(mfroman)

Michael Froman [:mjf]

Assignee

Comment 2

•

6 years ago

This crash signature looks a lot like the one in Bug 1529020, which is related to Quick Heal security software breaking our IPC connections. I wonder if we're seeing something similar on OS X as well?

Flags: needinfo?(mfroman)

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Updated

•

6 years ago

Priority: -- → P1

Neha Kochar [:neha]

Comment 3

•

6 years ago

Nils, can you please find an assignee for this P1 and also inform if this is on target for a fix in 67?

Flags: needinfo?(drno)

Bryce Seager van Dyk [:bryce] (he/him) - Not reading bugmail

Comment 4

•

6 years ago

Nils is on PTO for the next week, in case we do not hear from him sooner. Michael, is this something you have familiarity with and/or know someone who could investigate?

Flags: needinfo?(mfroman)

Marcia Knous [:marcia]

Reporter

Updated

•

6 years ago

status-firefox68: --- → affected

Michael Froman [:mjf]

Assignee

Comment 5

•

6 years ago

I'll take a look to see if I can see anything that might be causing this.

Flags: needinfo?(mfroman)

Michael Froman [:mjf]

Assignee

Comment 6

•

6 years ago

I'm assigning myself.

Assignee: nobody → mfroman

Flags: needinfo?(drno)

Andrew McCreight [:mccr8]

Comment 7

•

6 years ago

[Tracking Requested - why for this release]: fairly frequent crash on OSX

tracking-firefox67: --- → ?

tracking-firefox68: --- → ?

Julien Cristau [:jcristau]

Updated

•

6 years ago

tracking-firefox67: ? → +

tracking-firefox68: ? → +

Julien Cristau [:jcristau]

Comment 8

•

6 years ago

Any update here Michael?

Flags: needinfo?(mfroman)

Michael Froman [:mjf]

Assignee

Comment 9

•

6 years ago

I'm trying, but since I can't repro here, I'm mostly taking guesses on fixes. I tried a fix in Bug 1539030 hoping it would help, but according to crash reports it did not. I'll continue pouring over code hoping to find something to go on.

Flags: needinfo?(mfroman)

Haik Aftandilian [:haik]

Updated

•

6 years ago

Comment 10

•

6 years ago

Michael, any updates on this?

Flags: needinfo?(mfroman)

Michael Froman [:mjf]

Assignee

Comment 11

•

6 years ago

•

Edited

No, not really. Jean-Yves and I talked last night, and he came up with one change I'll be making today or tomorrow in the hopes that it helps. If the proposed change doesn't fix the issue, I'm at a loss without STR.

I will add that it appears to be a very early RDD startup crash, or IPC channel failure. All the crash reports have 'IPDL error: "constructor for actor failed"' in them. That message comes from 3 places, [1], [2], [3]. [2] and [3] would be errors from deserializing IPDL params, which seems unlikely. [1] results from a issue sending the message PRemoteDecoderManager::Msg_PRemoteDecoderConstructor. This seems most likely. This also would seem to be an early startup crash, but I've had zero luck reproducing it locally on Win, macOS, or Linux.

[1] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PRemoteDecoderManagerChild.cpp#155
[2] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PRemoteDecoderManagerChild.cpp#167
[3] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PRemoteDecoderManagerChild.cpp#178

Flags: needinfo?(mfroman)

Marcia Knous [:marcia]

Reporter

Comment 12

•

6 years ago

I can add some specific youtube URLs and have QA give it a try if that helps. One comment says "This bug occurs only with few Youtube URLs same behavior when refreshing... "

Strong correlation on MacOS to 10.11 - (93.64% in signature vs 00.95% overall) platform_pretty_version = OS X 10.11 [99.62% vs 25.23% if platform = Mac OS X].

Michael Froman [:mjf]

Assignee

Comment 13

•

6 years ago

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #12)

I can add some specific youtube URLs and have QA give it a try if that helps. One comment says "This bug occurs only with few Youtube URLs same behavior when refreshing... "

Strong correlation on MacOS to 10.11 - (93.64% in signature vs 00.95% overall) platform_pretty_version = OS X 10.11 [99.62% vs 25.23% if platform = Mac OS X].

The recent spike is caused by bug 1525086 and bug 1540288 flipped the pref (security.sandbox.rdd.mac.earlyinit) to false to disable the feature for now. 1540288 landed 18 hours ago, so we should see that spike come back down quickly. Then we're back to the unknown cause.

It will almost certainly be a youtube link for Nightly since that is the most likely (maybe only) place to get AV1 content, which is the only time RDD process is launched and used. If someone in QA can reliably reproduce, I can make a build with lots of startup logging on the RDD process to help us focus on the cause.

Michael Froman [:mjf]

Assignee

Comment 14

•

6 years ago

These crashes seem to have stopped after buildid 20190402083512. I'll continue to monitor for a week or so.

Pascal Chevrel:pascalc

Comment 15

•

6 years ago

The spike was nightly-only and on beta the number of crashes after beta 6 went down to a trickle. Adjusting 67 flags accordingly.

status-firefox67: affected → fix-optional

tracking-firefox67: + → ---

Michael Froman [:mjf]

Assignee

Comment 16

•

6 years ago

(In reply to Michael Froman [:mjf] from comment #14)

These crashes seem to have stopped after buildid 20190402083512. I'll continue to monitor for a week or so.

Builds after 20190402083512 did not completely eliminate the crash in Nightly. I see a crash on Linux for 20190409095652 that happened yesterday. Frequency is way down, but unfortunately not gone.

Haik Aftandilian [:haik]

Comment 17

•

6 years ago

I just hit a crash that looks similar and immediately after restarting the crashing tab, I got the page telling me I need to restart the browser. I wonder if this is similar to the crashes fixed by bug 1366808 and if we need a similar fix for the RDD process. The issue bug 1366808 fixed was that the child process executable can be updated while the browser is still running and then when the parent launches a new child process, it has a different build ID compared to the long running parent process.

Haik Aftandilian [:haik]

Comment 18

•

6 years ago

•

Edited

(In reply to Haik Aftandilian [:haik] from comment #17)

I just hit a crash that looks similar and immediately after restarting the crashing tab, I got the page telling me I need to restart the browser. I wonder if this is similar to the crashes fixed by bug 1366808 and if we need a similar fix for the RDD process. The issue bug 1366808 fixed was that the child process executable can be updated while the browser is still running and then when the parent launches a new child process, it has a different build ID compared to the long running parent process.

I see we do have the call to channel->SendBuildIDsMatchMessage() in RDDParent::Init() to check for build ID mismatches. But this was the first time I've hit the RDD crash and also very rarely see the page prompting me to restart the browser due to build ID mismatches. So I wonder if the crashes relate to updates.

My crash (apparently with an erroneous top stack frame) is crash report d81ecd2c-0646-4dcc-9449-94f0f0190415

Michael Froman [:mjf]

Assignee

Comment 19

•

6 years ago

(In reply to Haik Aftandilian [:haik] from comment #17)

I just hit a crash that looks similar and immediately after restarting the crashing tab, I got the page telling me I need to restart the browser. I wonder if this is similar to the crashes fixed by bug 1366808 and if we need a similar fix for the RDD process. The issue bug 1366808 fixed was that the child process executable can be updated while the browser is still running and then when the parent launches a new child process, it has a different build ID compared to the long running parent process.

That would happen very early in the RDD launch, correct? I ask because today I was playing with a particular mochitest that I hacked down to a manageable size and was able to cause the crash signature above at will. For my case, it is definitely a race where the RDD process crashes after the initial IPC setup has succeeded, but there is already another media element asking for a decoder and gets to RemoteVideoDecoderChild::InitIPDL and the call to manager->SendPRemoteDecoderConstructor[1] before the call to RemoteDecoderManagerChild::ActorDestroy arrives which would set mCanSend to false[2].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RemoteVideoDecoder.cpp#129
[2] https://searchfox.org/mozilla-central/source/dom/media/ipc/RemoteDecoderManagerChild.cpp#131

Michael Froman [:mjf]

Assignee

Comment 20

•

6 years ago

Interestingly, after a conversation with Haik, I tried forcing KillHard in RDDProcessHost::InitAfterConnect, which also produces the case where we have open IPC channels, the process crashes, but the ActorDestroy call hasn't arrived before we service RemoteVideoDecoderChild::InitIPDL.

In particular, the place that could cause that to happen in the wild is for RDDChild::Init to return false, which only happens if there is something awry with linux sandboxing here[1].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RDDChild.cpp#43

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 21

•

6 years ago

(In reply to Michael Froman [:mjf] from comment #20)

In particular, the place that could cause that to happen in the wild is for RDDChild::Init to return false, which only happens if there is something awry with linux sandboxing here[1].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RDDChild.cpp#43

That should only happen in case of fd exhaustion or failure to create a thread (see the SandboxBroker constructor), which I'd expect to cause other problems as well.

Michael Froman [:mjf]

Assignee

Comment 22

•

6 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #21)

(In reply to Michael Froman [:mjf] from comment #20)

In particular, the place that could cause that to happen in the wild is for RDDChild::Init to return false, which only happens if there is something awry with linux sandboxing here[1].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RDDChild.cpp#43

That should only happen in case of fd exhaustion or failure to create a thread (see the SandboxBroker constructor), which I'd expect to cause other problems as well.

I agree completely. The crashes have been low volume on linux, so I think this likely supports your statement. I've never seen this crash happen in the wild. This may just be a canary for other problems.

Darkspirit

Comment 23

•

6 years ago

Debian Testing, media.rdd-vorbis.enabled;true
I was affected by this crash.

bp-7c94ee63-de32-4f8b-a5a1-8f4eb0190407 07.04.19, 23:07 = bug 1547768 (also libc, but about vorbis)
bp-7bb6d3d9-49c0-4948-a827-06ba40190328 28.03.19, 12:22
bp-0b6da8ee-d585-431f-9e32-ad05d0190328 28.03.19, 12:22 <this bug>
bp-455c8b0e-dd14-4258-80f8-4a2620190327 27.03.19, 10:30 <this bug>
bp-c9847f5d-44ce-4773-b721-5e9fd0190327 27.03.19, 10:30
bp-c85c4cdd-ecd7-4b04-b3b1-6f0a90190325 25.03.19, 17:28 <this bug>
bp-e1f3a275-a31b-4ee6-b2fc-711710190325 25.03.19, 17:28

Michael Froman [:mjf]

Assignee

Comment 24

•

6 years ago

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #23)

Debian Testing, media.rdd-vorbis.enabled;true
I was affected by this crash.

bp-7c94ee63-de32-4f8b-a5a1-8f4eb0190407 07.04.19, 23:07 = bug 1547768 (also libc, but about vorbis)
bp-7bb6d3d9-49c0-4948-a827-06ba40190328 28.03.19, 12:22
bp-0b6da8ee-d585-431f-9e32-ad05d0190328 28.03.19, 12:22 <this bug>
bp-455c8b0e-dd14-4258-80f8-4a2620190327 27.03.19, 10:30 <this bug>
bp-c9847f5d-44ce-4773-b721-5e9fd0190327 27.03.19, 10:30
bp-c85c4cdd-ecd7-4b04-b3b1-6f0a90190325 25.03.19, 17:28 <this bug>
bp-e1f3a275-a31b-4ee6-b2fc-711710190325 25.03.19, 17:28

You are correct - Vorbis isn't ready for RDD decoding yet. There are sandboxing issues that remain to be fixed for Vorbis, so we haven't enabled that pref default. Thank you for the report!

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 25

•

6 years ago

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #23)

bp-7c94ee63-de32-4f8b-a5a1-8f4eb0190407 07.04.19, 23:07 = bug 1547768 (also libc, but about vorbis)

Bug 1543858.

bp-7bb6d3d9-49c0-4948-a827-06ba40190328 28.03.19, 12:22
bp-c9847f5d-44ce-4773-b721-5e9fd0190327 27.03.19, 10:30
bp-e1f3a275-a31b-4ee6-b2fc-711710190325 25.03.19, 17:28

Probably bug 1536127.

Michael Froman [:mjf]

Assignee

Comment 26

•

6 years ago

As of 2019/05/16, all nightly crashes in last 14 days have build dates 2019/03/31 or earlier. I think the remaining 67 crashes we're seeing are mostly likely a result of DAV1D crashes during init that have been fixed in 68, but were not approved for uplift. Comment 19 above gives details on this particular scenario.

Jan Varga [:janv]

Comment 27

•

6 years ago

•

Edited

Bug 1534882 may help with fixing this.

Michael Froman [:mjf]

Assignee

Comment 28

•

6 years ago

I think Bug 1550771 will fix the remaining low volume macOS crashes. In the last two weeks, there has been one macOS crash w/ a 4/22 build date:
https://crash-stats.mozilla.org/report/index/d7d9e0e0-f48a-487f-aeb2-cfc3d0190516

When I look at that crash report, there are 2 threads (Thread 2 and 43) both waiting in PortServerThread on WaitForMessage.

Otherwise, in the last 2 weeks, the latest build id I see for a Linux crash is 3/26, and no Win crashes.

Liz Henry (:lizzard) (relman/hg->git project)

Comment 29

•

6 years ago

This is now very low volume so I don't think we need to keep tracking it for 68.

status-firefox68: affected → fix-optional

tracking-firefox68: + → -

Michael Froman [:mjf]

Assignee

Comment 30

•

5 years ago

I haven't seen this in some time for anything after 67 so I'm marking it fixed as a duplicate of Bug 1540288.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → DUPLICATE