Crash in static void mozilla::ipc::IPDLParamTraits<T>::Write
Categories
(Core :: DOM: Service Workers, defect, P2)
Tracking
()
People
(Reporter: jseward, Assigned: ytausky)
References
(Depends on 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files, 1 obsolete file)
47 bytes,
text/x-phabricator-request
|
lizzard
:
approval-mozilla-beta+
|
Details | Review |
13.60 KB,
text/plain
|
Details |
Reporter | ||
Updated•7 years ago
|
Comment 1•7 years ago
|
||
Comment 2•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Comment 3•7 years ago
|
||
Updated•7 years ago
|
Comment 5•7 years ago
|
||
Updated•7 years ago
|
Assignee | ||
Comment 6•7 years ago
|
||
Comment 7•7 years ago
|
||
Assignee | ||
Comment 8•7 years ago
|
||
Comment 9•7 years ago
|
||
Assignee | ||
Comment 10•7 years ago
|
||
Assignee | ||
Comment 11•7 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Comment 12•7 years ago
|
||
Comment 13•7 years ago
|
||
Assignee | ||
Comment 14•7 years ago
|
||
Assignee | ||
Comment 15•7 years ago
|
||
Comment 16•7 years ago
|
||
Comment 17•7 years ago
|
||
bugherder uplift |
Comment 18•7 years ago
|
||
bugherder |
Updated•7 years ago
|
Comment 19•7 years ago
|
||
![]() |
||
Comment 20•7 years ago
|
||
Assignee | ||
Comment 21•7 years ago
|
||
Updated•7 years ago
|
Assignee | ||
Comment 23•7 years ago
|
||
Comment 25•7 years ago
|
||
Updated•7 years ago
|
Comment 26•7 years ago
|
||
Comment 27•7 years ago
|
||
Comment 28•7 years ago
|
||
Comment 29•7 years ago
|
||
Comment 30•7 years ago
|
||
Andrew Sutherland told me he's going to be able to get back to this soon.
Updated•7 years ago
|
Comment 31•7 years ago
|
||
Actual details soon, but current speculation is that the fix here won't be something we can uplift to 66.
Comment 32•7 years ago
|
||
[Tracking Requested - why for this release]: Ongoing issue in 67.
Updated•7 years ago
|
Comment 33•7 years ago
|
||
At least one of the signatures, mozilla::ipc::IPDLParamTraits<T>::Write, is full of various Yahoo URLs: https://www.yahoo.com/entertainment/kim-kardashian-says-she-thinks-052732710.html is one example.
Comment 34•7 years ago
|
||
If there ends up being anything upliftable to mitigate this crash I'd be willing to try it in beta 66, because of the huge crash volume on 65 release.
Updated•7 years ago
|
Comment 36•6 years ago
|
||
From discussion with asuth, sounds like ytausky is going to dig into this more on Monday.
I'm not treating it as a release blocker for 66, since it looks no worse than it has been in 65 and 64.
But, I will keep it on the radar for possible dot release in 66 and hope we can improve the crash rate in 67.
Updated•6 years ago
|
Assignee | ||
Comment 38•6 years ago
|
||
I've had a discussion with :asuth about a possible solution to the problem, namely to allow issuing a StrongWorkerRef
to system code when the worker is in the Canceling
phase, and regarding stream serialization code as system code. The following is somewhat of a brain dump concerning this solution. Sorry for the wall of text.
The idea behind IPC code panicking (a term I'll use for an intentional crash) on failure is that IPC is meant to be a low level system that either "just works" or panics because it's impossible to recover in any meaningful way. This implies that under normal operating conditions, once the decision to use IPC was made by any part of the DOM, this operation should never fail, unless some truly unavoidable and unrecoverable situation is encountered. Under this regime, we would ideally like to understand all the conditions which could cause such a failure and convince ourselves that they really are unavoidable and unrecoverable, otherwise we'd get spurious crashes like this one that could have been avoided. In particular, I wouldn't include policy decisions under the unavoidable banner -- if something is possible but not allowed by our own policy, that's hardly unavoidable.
This is why I am somewhat opposed to issuing StrongWorkerRef
s in the bowels of the stream serialization code: long after DOM code made the decision to initiate a succeed-or-panic operation, we're suddenly asking foreign code to make a policy decision (as opposed to a mechanism operation that can fail), whether we're allowed to get a StrongWorkerRef
or not. Now we're coupling foreign code to the (hopefully) small subset of conditions that can cause a panic. True, one might argue that the underlying thread and its lifetime management is part of the IPC system, but that I don't find it convincing, seeing as the DOM code that starts the whole process has a reference to the object in question itself.
Another concern I have is testability. As this issue shows, we're somewhat hindered by not being able to test a lot system states that are not script-observable, and improving the situation requires conscious effort. The IPC interface is a natural place to draw a testing boundary, but that's not really possible if code beyond that boundary reaches back through a global variable into the system under test. (At the very least it makes it harder to test.) Ideally calling into IPC would be one testing boundary (where we can test DOM components), and pushing serialized structures to actors would be another (so we can test serialization code).
Having that said, since we first want to stop crashing, we could go for that solution. It just makes me uneasy due to these concerns, and I think we should see it as a band-aid rather than a solution.
Assignee | ||
Comment 39•6 years ago
|
||
By allowing the creation of StrongWorkerRefs in the Canceling state we
ensure that IPC will not fail and lead to crashes.
Comment 40•6 years ago
|
||
Re: comment 38
Agreed on this being a band-aid. In particular, there are a lot of ripple effects from our IPC system's thread invariants, especially as it relates to how the Cache API is implemented on workers. (Including all those non-owning references because so many things aren't refcounted.)
Comment 41•6 years ago
|
||
Comment 42•6 years ago
|
||
bugherder |
Comment 43•6 years ago
|
||
Is this upliftable to beta? I'd like to get it into the RC build on Monday.
Assignee | ||
Comment 44•6 years ago
|
||
It should be upliftable. Tomorrow is a holiday here, so I'm trying to find someone who could take care of it.
Assignee | ||
Comment 45•6 years ago
|
||
Comment on attachment 9048159 [details]
Bug 1484524: Allow creating a StrongWorkerRef for IPC in the Canceling state
Beta/Release Uplift Approval Request
- Feature/Bug causing the regression: Bug 1484524
- User impact if declined: Crashes under non-deterministic conditions
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): Bypassing the StrongWorkerRef creation policy could potentially lead to workers running longer than they should. I don't know it possible in practice. An alternative would be to do nothing and let the patch ride the trains.
- String changes made/needed:
Comment 46•6 years ago
|
||
Comment on attachment 9048159 [details]
Bug 1484524: Allow creating a StrongWorkerRef for IPC in the Canceling state
May be risky but probably not going to be worse than 30K crashes per week on release.
Let's uplift for the RC build.
Comment 47•6 years ago
|
||
Got a conflict when trying to uplift.
grafting 530955:5da51a386275 "Bug 1484524: Allow creating a StrongWorkerRef for IPC in the Canceling state r=asuth"
merging dom/workers/WorkerRef.cpp
warning: conflicts while merging dom/workers/WorkerRef.cpp! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue
Comment 48•6 years ago
|
||
I grafted locally, kdiff3 magically fixed it up, and the build seemed sufficiently okay locally, so I pushed to beta:
https://hg.mozilla.org/releases/mozilla-beta/rev/f83ccafddfa29dc1b510d0d0d0803aac105b03b3
Comment 49•6 years ago
|
||
While this bug is marked as fixed I saw a similar stack trace again these days, from an ASAN built from April 1st.
Shouldn't that already have the fix?
Comment 51•6 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #50)
Hanno can you file a new bug for your trace?
Updated•4 years ago
|
Description
•