Crash in SetContentProcessSandbox while stability testing

RESOLVED WORKSFORME

Status

()

defect
RESOLVED WORKSFORME
4 years ago
4 years ago

People

(Reporter: ggrisco, Assigned: jld)

Tracking

({crash})

unspecified
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:2.2?)

Details

(Whiteboard: [b2g-crash][caf-crash 637][caf priority: p3][CR 846198])

Attachments

(4 attachments)

Reporter

Description

4 years ago
Crash in automated stability testing with following signature:

[@ mozilla::SetContentProcessSandbox | mozilla::dom::ContentChild::RecvSetProcessSandbox | mozilla::dom::PContentChild::OnMessageReceived | mozilla::ipc::MessageChannel::DispatchAsyncMessage ]

This crash is intermittent, seen once on AU 154, once on AU 170, and now one time on AU 214.

cafbot will upload logs.
Reporter

Updated

4 years ago
blocking-b2g: --- → 2.2?
Whiteboard: [CR 846198] → [caf priority: p3][CR 846198]
Whiteboard: [caf priority: p3][CR 846198] → [b2g-crash][caf-crash 637][caf priority: p3][CR 846198]
Keywords: crash
07-23 12:42:44.000 27029 27029 E Sandbox : Thread 27033 unresponsive for 10 seconds.  Killing process.

I started to write a lot of text about this, assuming that the thread was actually unresponsive for 10s, but then I noticed that this is the 2.2 / 37 branch.  Which means it doesn't have the fix for bug 1176085.  I'd been thinking of that bug as a false negative for this assertion, because I discovered it in a case the assertion should have fired and didn't (and looped forever instead)… but it could also be a false positive.

So what actually happened here is that the thread didn't respond within 10 *milli*seconds (and also didn't exit), and a more or less random number in the range [0, 999999999] (the nanoseconds part of a clock reading) was less than the number of seconds since boot (i.e., the CLOCK_MONOTONIC time in seconds).  Which is a relatively low probability, and it's not even checked if the thread handles the signal promptly, but it's not zero.

Specifically, the log has this:

07-23 12:42:23.280   266   266 I Gecko   : Uptime: 2932m

If that's the host uptime, then the probability is about 1 in 5000, on top of the probability that the timeout case happens at all, but that's applied to every non-main thread in the content process every time an app is started.  If that's a typical uptime, and if there are tens or hundreds of test devices, then this starts looking plausible.
Assignee: nobody → jld
For those not following bug 1176085: I could try to uplift it and (hopefully?) fix this bug, but I'd have to warn release management that it caused bug 1185118 to start manifesting as crashes instead of something else (probably hanging the content process indefinitely).  I expect that that would be considered excessive risk (even though the code as-is is obviously wrong and causing *these* crashes).  That bug seems to occur only on Flame devices, and I strongly suspect a kernel bug, but it's hard to get any farther than that with no STR and only the limited data available in Gecko minidumps.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
"Closing issue which has not been seen since 07/15/15 17:25"
You need to log in before you can comment on or make changes to this bug.