Crash in [@ logging::LogMessage::~LogMessage] in sandbox::PolicyBase::AddHandleToShare

RESOLVED FIXED in Firefox 68

Status

()

defect
P2
critical
RESOLVED FIXED
2 months ago
25 days ago

People

(Reporter: philipp, Assigned: mjf)

Tracking

(Regression, {crash, regression})

68 Branch
mozilla69
All
Windows
Points:
---

Firefox Tracking Flags

(firefox-esr60 unaffected, firefox67 unaffected, firefox67.0.1 unaffected, firefox68+ fixed, firefox69+ fixed)

Details

(Whiteboard: keep-open, crash signature)

Attachments

(1 attachment)

This bug is for crash report bp-69db24bc-3735-4ad3-b150-9d38f0190528.

Top 10 frames of crashing thread:

0 firefox.exe logging::LogMessage::~LogMessage security/sandbox/chromium-shim/base/logging.cpp:120
1 firefox.exe sandbox::ResultCode sandbox::InterceptionManager::PatchNtdll security/sandbox/chromium/sandbox/win/src/interception.cc:424
2 firefox.exe sandbox::InterceptionManager::InitializeInterceptions security/sandbox/chromium/sandbox/win/src/interception.cc:150
3 firefox.exe sandbox::ResultCode sandbox::PolicyBase::SetupAllInterceptions security/sandbox/chromium/sandbox/win/src/sandbox_policy_base.cc:676
4 firefox.exe sandbox::PolicyBase::AddTarget security/sandbox/chromium/sandbox/win/src/sandbox_policy_base.cc:543
5 firefox.exe sandbox::ResultCode sandbox::BrokerServicesBase::SpawnTarget security/sandbox/chromium/sandbox/win/src/broker_services.cc:474
6 xul.dll mozilla::SandboxBroker::LaunchApp security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp:244
7 xul.dll bool mozilla::ipc::GeckoChildProcessHost::PerformAsyncLaunch ipc/glue/GeckoChildProcessHost.cpp:1216
8 xul.dll static void mozilla::ipc::GeckoChildProcessHost::RunPerformAsyncLaunch::<unnamed-tag>::operator ipc/glue/GeckoChildProcessHost.cpp:601
9 xul.dll nsresult mozilla::detail::RunnableFunction<`lambda at z:/task_1558609157/build/src/ipc/glue/GeckoChildProcessHost.cpp:600:24'>::Run xpcom/threads/nsThreadUtils.h:562

this windows crash signature is spiking up during the 68 cycle - currently it's accounting for 0.6% of browser crashes in 68.0b. the first affected nightly build was 20190410100020.

Flags: needinfo?(bobowencode)

Sorry, was on PTO.

mjf: looks like this is due to the handles being closed before the process is launched and they are inherited.

Specifically here:
https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/media/ipc/RDDProcessHost.cpp#59

The launch is done async below, so my guess is we have a race at some point, possibly just at timeout (although I'd assume that would be a fairly rare thing).

jld: you appear to have code to prevent a similar thing on the content process launch, could you help mjf on this?
https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/ipc/ContentParent.cpp#2128-2132

I wonder if we should duplicate the handle in GeckoChildProcessHost::AddHandleToShare, so GeckoChildProcessHost has its own copy and can close it when it knows the process has launched or the launch has failed.

A similar thing has been added in Fx69 for many of the other process types, so they may well hit the same problem.

Flags: needinfo?(mfroman)
Flags: needinfo?(jld)
Flags: needinfo?(bobowencode)

Just checked and the first crashes like this on Fx68 Nightly were build id 20190410100020, which matches when bug 1539029 landed.

No longer blocks: 1445167
Regressed by: 1539029
Summary: Crash in [@ logging::LogMessage::~LogMessage] in sandbox::PolicyBase::AddTarget → Crash in [@ logging::LogMessage::~LogMessage] in sandbox::PolicyBase::AddHandleToShare

[Tracking Requested - why for this release]:
AV1 + RDD process startup crash issue.

Component: Security: Process Sandboxing → Audio/Video: Playback

(In reply to Bob Owen (:bobowen) from comment #1)

Sorry, was on PTO.

mjf: looks like this is due to the handles being closed before the process is launched and they are inherited.

Specifically here:
https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/media/ipc/RDDProcessHost.cpp#59

The launch is done async below, so my guess is we have a race at some point, possibly just at timeout (although I'd assume that would be a fairly rare thing).

jld: you appear to have code to prevent a similar thing on the content process launch, could you help mjf on this?
https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/ipc/ContentParent.cpp#2128-2132

I wonder if we should duplicate the handle in GeckoChildProcessHost::AddHandleToShare, so GeckoChildProcessHost has its own copy and can close it when it knows the process has launched or the launch has failed.

A similar thing has been added in Fx69 for many of the other process types, so they may well hit the same problem.

bobowen: you mention handles being closed, but I'm not closing them specifically anywhere. The handle is created here[1] and added to the RDDProcessHost (a GeckoChildProcessHost) immediately afterwards. The shared mem pointed to by that handle belongs to mPrefSerializer, correct? I thought that was the reason for keeping the SharedPreferenceSerializer around until after the process has launched, but possibly something else is going on that is destroying the shmem earlier than I expect. SharedPreferenceSerializer is only destroyed here[2] if AsyncOpen returns an error, or here[3] in InitAfterConnect.

[1] https://searchfox.org/mozilla-central/source/ipc/glue/ProcessUtils_common.cpp#67
[2] https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/media/ipc/RDDProcessHost.cpp#70
[3] https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/media/ipc/RDDProcessHost.cpp#150

Flags: needinfo?(mfroman) → needinfo?(bobowencode)

(In reply to Michael Froman [:mjf] from comment #4)

(In reply to Bob Owen (:bobowen) from comment #1)
...
bobowen: you mention handles being closed, but I'm not closing them specifically anywhere. The handle is created here[1] and added to the RDDProcessHost (a GeckoChildProcessHost) immediately afterwards. The shared mem pointed to by that handle belongs to mPrefSerializer, correct? I thought that was the reason for keeping the SharedPreferenceSerializer around until after the process has launched, but possibly something else is going on that is destroying the shmem earlier than I expect. SharedPreferenceSerializer is only destroyed here[2] if AsyncOpen returns an error, or here[3] in InitAfterConnect.

[1] https://searchfox.org/mozilla-central/source/ipc/glue/ProcessUtils_common.cpp#67
[2] https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/media/ipc/RDDProcessHost.cpp#70
[3] https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/dom/media/ipc/RDDProcessHost.cpp#150

Yeah, I couldn't obviously see how this was happening other than [3], where the timeout could race with the launch of the process.

Flags: needinfo?(bobowencode)

Bob can I ask to be the owner for now, until we have an idea what exactly is causing this?

Assignee: nobody → bobowencode
Priority: -- → P2

I agree that the timeout case in WaitForLaunchInitAfterConnect is the only thing that seems to be capable of causing it. I'd suggest either not clearing mPrefSerializer in the error case, or maybe moving it into a callback added to the WhenProcessHandleReady() promise.

Flags: needinfo?(jld)

(In reply to Nils Ohlmeier [:drno] from comment #6)

Bob can I ask to be the owner for now, until we have an idea what exactly is causing this?

I think we're pretty clear on the issue here and that it was caused by bug 1539029.
There certainly appears to be a race in the timeout case, it looks like that's the only one.
jld has a suggestion to fix this.

I think longer term we should possibly look at passing the ownership of the HANDLE into the launching code or duplicating it.
However, I don't have time to look at this at the moment.

Assignee: bobowencode → nobody

I'll land a patch that avoids clearing mPrefSerializer in the error case to see if it helps.

Assignee: nobody → mfroman
Pushed by mfroman@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b2cb7ee93eef
avoid a potential timeout race on Windows RDD launch. r=jld
Status: NEW → RESOLVED
Closed: Last month
Resolution: --- → FIXED
Target Milestone: --- → mozilla69
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: keep-open

I want to leave this open until we see a definite change in the crash reports since this was a proposed fix from :jld.

I imagine this will need to be uplifted before we can be sure it fixes the problem.

Flags: needinfo?(mfroman)

We should see results in Nightly in the next day or so. I'll wait to see those results before I request uplift.

Flags: needinfo?(mfroman)

No crashes in nightly since the June 11 build.

Flags: needinfo?(mfroman)

Err.. not true. There are 3 crashes (from a single install) today.

(In reply to Julien Cristau [:jcristau] from comment #17)

Err.. not true. There are 3 crashes (from a single install) today.

Unfortunately, the errors that are terminal in ~LogMessage all get picked up by that search. You have to look in the proto signature.
Those 3 are for a different crash not the AddHandleToShare one.

(In reply to Bob Owen (:bobowen) from comment #18)

(In reply to Julien Cristau [:jcristau] from comment #17)

Err.. not true. There are 3 crashes (from a single install) today.

Unfortunately, the errors that are terminal in ~LogMessage all get picked up by that search. You have to look in the proto signature.
Those 3 are for a different crash not the AddHandleToShare one.

Bob, how do you see the proto signature in the crash reports?

Flags: needinfo?(mfroman) → needinfo?(bobowencode)

Or are you talking about looking at the previous call in the back trace?

We're still holding steady with no crashes showing with this signature starting with the 20190612 build. I'm going to mark this fixed and request uplift.

Status: REOPENED → RESOLVED
Closed: Last month26 days ago
Resolution: --- → FIXED

Comment on attachment 9070622 [details]
Bug 1555076 - avoid a potential timeout race on Windows RDD launch. r?jld!

Beta/Release Uplift Approval Request

  • User impact if declined: Without this fix, can crash RDD during timeout scenarios.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is a very small change that only applies during times when the process launch timeout has occurred.
  • String changes made/needed: none
Attachment #9070622 - Flags: approval-mozilla-beta?

Comment on attachment 9070622 [details]
Bug 1555076 - avoid a potential timeout race on Windows RDD launch. r?jld!

approved for 68.0b13

Attachment #9070622 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.