Open Bug 1535335 Opened 6 months ago Updated 3 months ago

Crash in [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::PRemoteDecoderManagerChild::SendPRemoteDecoderConstructor]

Categories

(Core :: Audio/Video: Playback, defect, P1, critical)

Unspecified
macOS
defect

Tracking

()

Tracking Status
firefox66 --- unaffected
firefox67 --- fix-optional
firefox68 - fix-optional

People

(Reporter: marcia, Assigned: mjf)

References

Details

(Keywords: crash, regression)

Crash Data

This bug is for crash report bp-69fa5ed6-217f-46ad-9a83-0c1d90190304.

Seen while looking at 67 nightly crash stats: https://bit.ly/2T8PmGZ. These crashes seem to have started in 20190222081112 and are predominantly Mac, although there is one Windows crash.

The Moz Crash reason for all is: IPDL error: "constructor for actor failed". abort()ing as a result.

(100.0% in signature vs 00.26% overall) moz_crash_reason = IPDL error: "constructor for actor failed". abort()ing as a result.
(100.0% in signature vs 02.96% overall) Module "libplugin_child_interpose.dylib" = true
(100.0% in signature vs 03.17% overall) Module "libsandbox.1.dylib" = true
(100.0% in signature vs 03.48% overall) reason = EXC_BAD_ACCESS / KERN_INVALID_ADDRESS

Top 10 frames of crashing thread:

0 XUL MOZ_Crash mfbt/Assertions.h:314
1 XUL mozilla::ipc::FatalError ipc/glue/ProtocolUtils.cpp:264
2 XUL mozilla::ipc::IProtocol::HandleFatalError const ipc/glue/ProtocolUtils.cpp:440
3 XUL mozilla::PRemoteDecoderManagerChild::SendPRemoteDecoderConstructor ipc/glue/ProtocolUtils.cpp:431
4 XUL mozilla::RemoteVideoDecoderChild::InitIPDL dom/media/ipc/RemoteVideoDecoder.cpp:130
5 XUL mozilla::detail::RunnableFunction<mozilla::RemoteDecoderModule::CreateVideoDecoder dom/media/ipc/RemoteDecoderModule.cpp:112
6 XUL nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1162
7 XUL NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:474
8 XUL mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:303
9 XUL nsThread::ThreadFunc ipc/chromium/src/base/message_loop.cc:315

could these crashes be related to bug 1500596?

Flags: needinfo?(mfroman)

This crash signature looks a lot like the one in Bug 1529020, which is related to Quick Heal security software breaking our IPC connections. I wonder if we're seeing something similar on OS X as well?

Flags: needinfo?(mfroman)
Priority: -- → P1

Nils, can you please find an assignee for this P1 and also inform if this is on target for a fix in 67?

Flags: needinfo?(drno)

Nils is on PTO for the next week, in case we do not hear from him sooner. Michael, is this something you have familiarity with and/or know someone who could investigate?

Flags: needinfo?(mfroman)

I'll take a look to see if I can see anything that might be causing this.

Flags: needinfo?(mfroman)

I'm assigning myself.

Assignee: nobody → mfroman
Flags: needinfo?(drno)

[Tracking Requested - why for this release]: fairly frequent crash on OSX

Any update here Michael?

Flags: needinfo?(mfroman)

I'm trying, but since I can't repro here, I'm mostly taking guesses on fixes. I tried a fix in Bug 1539030 hoping it would help, but according to crash reports it did not. I'll continue pouring over code hoping to find something to go on.

Flags: needinfo?(mfroman)
See Also: → 1540288

Michael, any updates on this?

Flags: needinfo?(mfroman)

No, not really. Jean-Yves and I talked last night, and he came up with one change I'll be making today or tomorrow in the hopes that it helps. If the proposed change doesn't fix the issue, I'm at a loss without STR.

I will add that it appears to be a very early RDD startup crash, or IPC channel failure. All the crash reports have 'IPDL error: "constructor for actor failed"' in them. That message comes from 3 places, [1], [2], [3]. [2] and [3] would be errors from deserializing IPDL params, which seems unlikely. [1] results from a issue sending the message PRemoteDecoderManager::Msg_PRemoteDecoderConstructor. This seems most likely. This also would seem to be an early startup crash, but I've had zero luck reproducing it locally on Win, macOS, or Linux.

[1] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PRemoteDecoderManagerChild.cpp#155
[2] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PRemoteDecoderManagerChild.cpp#167
[3] https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PRemoteDecoderManagerChild.cpp#178

Flags: needinfo?(mfroman)

I can add some specific youtube URLs and have QA give it a try if that helps. One comment says "This bug occurs only with few Youtube URLs same behavior when refreshing... "

Strong correlation on MacOS to 10.11 - (93.64% in signature vs 00.95% overall) platform_pretty_version = OS X 10.11 [99.62% vs 25.23% if platform = Mac OS X].

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #12)

I can add some specific youtube URLs and have QA give it a try if that helps. One comment says "This bug occurs only with few Youtube URLs same behavior when refreshing... "

Strong correlation on MacOS to 10.11 - (93.64% in signature vs 00.95% overall) platform_pretty_version = OS X 10.11 [99.62% vs 25.23% if platform = Mac OS X].

The recent spike is caused by bug 1525086 and bug 1540288 flipped the pref (security.sandbox.rdd.mac.earlyinit) to false to disable the feature for now. 1540288 landed 18 hours ago, so we should see that spike come back down quickly. Then we're back to the unknown cause.

It will almost certainly be a youtube link for Nightly since that is the most likely (maybe only) place to get AV1 content, which is the only time RDD process is launched and used. If someone in QA can reliably reproduce, I can make a build with lots of startup logging on the RDD process to help us focus on the cause.

These crashes seem to have stopped after buildid 20190402083512. I'll continue to monitor for a week or so.

The spike was nightly-only and on beta the number of crashes after beta 6 went down to a trickle. Adjusting 67 flags accordingly.

(In reply to Michael Froman [:mjf] from comment #14)

These crashes seem to have stopped after buildid 20190402083512. I'll continue to monitor for a week or so.

Builds after 20190402083512 did not completely eliminate the crash in Nightly. I see a crash on Linux for 20190409095652 that happened yesterday. Frequency is way down, but unfortunately not gone.

I just hit a crash that looks similar and immediately after restarting the crashing tab, I got the page telling me I need to restart the browser. I wonder if this is similar to the crashes fixed by bug 1366808 and if we need a similar fix for the RDD process. The issue bug 1366808 fixed was that the child process executable can be updated while the browser is still running and then when the parent launches a new child process, it has a different build ID compared to the long running parent process.

(In reply to Haik Aftandilian [:haik] from comment #17)

I just hit a crash that looks similar and immediately after restarting the crashing tab, I got the page telling me I need to restart the browser. I wonder if this is similar to the crashes fixed by bug 1366808 and if we need a similar fix for the RDD process. The issue bug 1366808 fixed was that the child process executable can be updated while the browser is still running and then when the parent launches a new child process, it has a different build ID compared to the long running parent process.

I see we do have the call to channel->SendBuildIDsMatchMessage() in RDDParent::Init() to check for build ID mismatches. But this was the first time I've hit the RDD crash and also very rarely see the page prompting me to restart the browser due to build ID mismatches. So I wonder if the crashes relate to updates.

My crash (apparently with an erroneous top stack frame) is crash report d81ecd2c-0646-4dcc-9449-94f0f0190415

(In reply to Haik Aftandilian [:haik] from comment #17)

I just hit a crash that looks similar and immediately after restarting the crashing tab, I got the page telling me I need to restart the browser. I wonder if this is similar to the crashes fixed by bug 1366808 and if we need a similar fix for the RDD process. The issue bug 1366808 fixed was that the child process executable can be updated while the browser is still running and then when the parent launches a new child process, it has a different build ID compared to the long running parent process.

That would happen very early in the RDD launch, correct? I ask because today I was playing with a particular mochitest that I hacked down to a manageable size and was able to cause the crash signature above at will. For my case, it is definitely a race where the RDD process crashes after the initial IPC setup has succeeded, but there is already another media element asking for a decoder and gets to RemoteVideoDecoderChild::InitIPDL and the call to manager->SendPRemoteDecoderConstructor[1] before the call to RemoteDecoderManagerChild::ActorDestroy arrives which would set mCanSend to false[2].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RemoteVideoDecoder.cpp#129
[2] https://searchfox.org/mozilla-central/source/dom/media/ipc/RemoteDecoderManagerChild.cpp#131

Interestingly, after a conversation with Haik, I tried forcing KillHard in RDDProcessHost::InitAfterConnect, which also produces the case where we have open IPC channels, the process crashes, but the ActorDestroy call hasn't arrived before we service RemoteVideoDecoderChild::InitIPDL.

In particular, the place that could cause that to happen in the wild is for RDDChild::Init to return false, which only happens if there is something awry with linux sandboxing here[1].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RDDChild.cpp#43

(In reply to Michael Froman [:mjf] from comment #20)

In particular, the place that could cause that to happen in the wild is for RDDChild::Init to return false, which only happens if there is something awry with linux sandboxing here[1].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RDDChild.cpp#43

That should only happen in case of fd exhaustion or failure to create a thread (see the SandboxBroker constructor), which I'd expect to cause other problems as well.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #21)

(In reply to Michael Froman [:mjf] from comment #20)

In particular, the place that could cause that to happen in the wild is for RDDChild::Init to return false, which only happens if there is something awry with linux sandboxing here[1].

[1] https://searchfox.org/mozilla-central/source/dom/media/ipc/RDDChild.cpp#43

That should only happen in case of fd exhaustion or failure to create a thread (see the SandboxBroker constructor), which I'd expect to cause other problems as well.

I agree completely. The crashes have been low volume on linux, so I think this likely supports your statement. I've never seen this crash happen in the wild. This may just be a canary for other problems.

Debian Testing, media.rdd-vorbis.enabled;true
I was affected by this crash.

bp-7c94ee63-de32-4f8b-a5a1-8f4eb0190407 07.04.19, 23:07 = bug 1547768 (also libc, but about vorbis)
bp-7bb6d3d9-49c0-4948-a827-06ba40190328 28.03.19, 12:22
bp-0b6da8ee-d585-431f-9e32-ad05d0190328 28.03.19, 12:22 <this bug>
bp-455c8b0e-dd14-4258-80f8-4a2620190327 27.03.19, 10:30 <this bug>
bp-c9847f5d-44ce-4773-b721-5e9fd0190327 27.03.19, 10:30
bp-c85c4cdd-ecd7-4b04-b3b1-6f0a90190325 25.03.19, 17:28 <this bug>
bp-e1f3a275-a31b-4ee6-b2fc-711710190325 25.03.19, 17:28

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #23)

Debian Testing, media.rdd-vorbis.enabled;true
I was affected by this crash.

bp-7c94ee63-de32-4f8b-a5a1-8f4eb0190407 07.04.19, 23:07 = bug 1547768 (also libc, but about vorbis)
bp-7bb6d3d9-49c0-4948-a827-06ba40190328 28.03.19, 12:22
bp-0b6da8ee-d585-431f-9e32-ad05d0190328 28.03.19, 12:22 <this bug>
bp-455c8b0e-dd14-4258-80f8-4a2620190327 27.03.19, 10:30 <this bug>
bp-c9847f5d-44ce-4773-b721-5e9fd0190327 27.03.19, 10:30
bp-c85c4cdd-ecd7-4b04-b3b1-6f0a90190325 25.03.19, 17:28 <this bug>
bp-e1f3a275-a31b-4ee6-b2fc-711710190325 25.03.19, 17:28

You are correct - Vorbis isn't ready for RDD decoding yet. There are sandboxing issues that remain to be fixed for Vorbis, so we haven't enabled that pref default. Thank you for the report!

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #23)

bp-7c94ee63-de32-4f8b-a5a1-8f4eb0190407 07.04.19, 23:07 = bug 1547768 (also libc, but about vorbis)

Bug 1543858.

bp-7bb6d3d9-49c0-4948-a827-06ba40190328 28.03.19, 12:22
bp-c9847f5d-44ce-4773-b721-5e9fd0190327 27.03.19, 10:30
bp-e1f3a275-a31b-4ee6-b2fc-711710190325 25.03.19, 17:28

Probably bug 1536127.

As of 2019/05/16, all nightly crashes in last 14 days have build dates 2019/03/31 or earlier. I think the remaining 67 crashes we're seeing are mostly likely a result of DAV1D crashes during init that have been fixed in 68, but were not approved for uplift. Comment 19 above gives details on this particular scenario.

Bug 1534882 may help with fixing this.

I think Bug 1550771 will fix the remaining low volume macOS crashes. In the last two weeks, there has been one macOS crash w/ a 4/22 build date:
https://crash-stats.mozilla.org/report/index/d7d9e0e0-f48a-487f-aeb2-cfc3d0190516

When I look at that crash report, there are 2 threads (Thread 2 and 43) both waiting in PortServerThread on WaitForMessage.

Otherwise, in the last 2 weeks, the latest build id I see for a Linux crash is 3/26, and no Win crashes.

This is now very low volume so I don't think we need to keep tracking it for 68.

You need to log in before you can comment on or make changes to this bug.