Closed Bug 1574571 Opened 3 years ago Closed 2 years ago

Crash in [@ CrashReporter::AnnotateCrashReport]

Categories

(Core :: IPC, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox68 --- unaffected
firefox69 --- unaffected
firefox70 + fixed

People

(Reporter: marcia, Assigned: jld)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-128c6ccd-d8fc-4930-aa11-f9f6b0190816.

Windows nightly crash spike which started in 8-15 build: https://bit.ly/31JZgDy. 82 crashes/32 installs so far.

all crashes have MOZ_RELEASE_ASSERT(NS_IsMainThread())

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a6ba020c9f7cd1abecd8eb2287020468ec1da6e8&tochange=b283a7ef186c216d765631f6cb1260a3fa2ee42c

Bug 1479960? ni on :jld

Top 10 frames of crashing thread:

0 xul.dll CrashReporter::AnnotateCrashReport toolkit/crashreporter/nsExceptionHandler.cpp:2105
1 xul.dll mozilla::ipc::FatalError ipc/glue/ProtocolUtils.cpp:161
2 xul.dll mozilla::ipc::IProtocol::HandleFatalError ipc/glue/ProtocolUtils.cpp:404
3 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::layers::SurfaceDescriptor>::Write ipc/ipdl/LayersSurfaces.cpp:3887
4 xul.dll static void mozilla::ipc::WriteIPDLParam<const mozilla::layers::SurfaceDescriptor&> ipc/glue/IPDLParamTraits.h:59
5 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::RemoteVideoDataIPDL>::Write ipc/ipdl/PRemoteDecoder.cpp:168
6 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::DecodedOutputIPDL>::Write ipc/ipdl/PRemoteDecoder.cpp:646
7 xul.dll mozilla::PRemoteDecoderParent::SendOutput ipc/ipdl/PRemoteDecoderParent.cpp:186
8 xul.dll class mozilla::MediaResult mozilla::RemoteVideoDecoderParent::ProcessDecodedData dom/media/ipc/RemoteVideoDecoder.cpp:382
9 xul.dll void mozilla::MozPromise<nsTArray<RefPtr<mozilla::MediaData> >, mozilla::MediaResult, 1>::ThenValue<`lambda at z:/task_1565861147/build/src/dom/media/ipc/RemoteDecoderParent.cpp:97:7', `lambda at z:/task_1565861147/build/src/dom/media/ipc/RemoteDecoderParent.cpp:108:7'>::DoResolveOrRejectInternal xpcom/threads/MozPromise.h:722

Flags: needinfo?(jld)

Two things going on here:

  1. Something in the RDD process trying to send a bad (default-constructed?) SurfaceDescriptor. There might already be a bug for this (maybe with the wrong signature…).
  2. The error reporting breaks because the CrashReporterClient doesn't exist.

I was going to suspect bug 1571711 for the second part of this, because it's in the regression range, but it was also backed out in that range and not relanded, so that can't be it. I need to look at this some more.

More things I've noticed:

  • All the failures seem to be on Windows 7, but they include both 64-bit and 32-bit.
  • That crash signature should pick up any case where the CrashReporterClient doesn't exist, but all the reports I looked at were RemoteVideoDecoderParentSurfaceDescriptor serialization fail
  • Two cases: the CrashReporterClient failed to be initialized, or it was already destroyed (i.e., if something is racing with RDD process shutdown).
  • CrashReporterClient::InitSingleton returns bool, but nothing uses the return type, so if it were failing then we wouldn't know until/unless something tried to add crash annotations off-main-thread.

I still don't understand how bug 1479960 could cause CrashReporterClient to fail allocate shared memory (or otherwise explain this) without also causing a lot more breakage than that.

There's one more thing that's suspicious: all the crashes have been sent from the infobar. If they were regular child process crashes they wouldn't have that particular annotation set, so these crashes all happened before the CrashReporterHost object was created in the main process (if it was created at all).

If we're failing the AllocShmem call in InitSingleton, we'd never get to the SendInitCrashReporter and the parent would never construct the CrashReporterHost.

One thing that could be done here is to land a patch to MOZ_DIAGNOSTIC_ASSERT that it succeeds and see what breaks. (It would be more helpful to try to report why it's failing….)

Assignee: nobody → jld
Flags: needinfo?(jld)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #4)

One thing that could be done here is to land a patch to MOZ_DIAGNOSTIC_ASSERT that it succeeds and see what breaks.

I pushed that to Try. The assertion fails for GMP processes, which makes sense because they don't normally create shared memory or send handles; and, unlike the situation with /dev/shm on Linux (ambient authority that's revoked later), this is probably something like being registered for DuplicateHandle brokering that has to be set up on launch. So this has probably always been broken for GMP, and it reproduces locally.

What I don't see there (and can't reproduce locally) are failures in the RDD process, which is where almost all of the crashes with this signature happen.

See Also: → 1575906

I have an idea for how bug 1479960 might be related: one its side effects was applying the changes from bug 1565744 to all shared memory. It doesn't seem to be necessary if the memory isn't meant to be frozen, but Chrome applies it to everything, and in theory it shouldn't hurt.

Concretely, shared memory now has an empty DACL instead of none (deny all instead of allow all), and on Win7 (but not 8.1 or 10) it gives them randomly generated names to work around a bug/feature where the security attributes would be ignored. So, there could be some side effect from those metadata changes, or maybe this is a failure to get entropy for the object name ​— I was originally going to crash for that, but I was advised during code review to just fail the allocation.

There's still something missing, because this doesn't reproduce on Win7 on Try. But I could try landing a patch to turn off the security changes for normal shared memory and see if the crash reports stop.

Jed, are you still planning to try that (Maybe best to do during nightly 70 while we have frequent builds?)

Flags: needinfo?(jld)

Yes. I'm going to hide this bug so I can use it for landing that change.

Group: core-security-release
Flags: needinfo?(jld)
Comment 6 is private: false

No crashes with this signature reported since build 20190828214452, which is the last one before the patch. So that's good. I'll file a followup to try to find out why this was happening, if it affects the freezeable case, and what we can do about it.

Group: core-security-release
You need to log in before you can comment on or make changes to this bug.