Closed Bug 1484524 Opened 3 years ago Closed 2 years ago
Crash in static void mozilla::ipc::IPDLParam
46 bytes, text/x-phabricator-request
|Details | Review|
47 bytes, text/x-phabricator-request
|Details | Review|
13.60 KB, text/plain
This bug was filed from the Socorro interface and is report bp-d2d5b55a-a483-4393-9314-a2a3e0180817. ============================================================= This is topcrash #11 in the Windows nightly 20180816100035, with 8 crashes in 4 different instances. Top 10 frames of crashing thread: 0 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::ipc::IPCRemoteStreamType>::Write ipc/ipdl/IPCStream.cpp:489 1 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::ipc::IPCRemoteStream>::Write ipc/ipdl/IPCStream.cpp:700 2 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::IPCStream>::Write ipc/ipdl/IPCStream.cpp:1052 3 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::OptionalIPCStream>::Write ipc/ipdl/IPCStream.cpp:1423 4 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheReadStream>::Write ipc/ipdl/CacheTypes.cpp:184 5 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheReadStreamOrVoid>::Write ipc/ipdl/CacheTypes.cpp:548 6 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheResponse>::Write ipc/ipdl/CacheTypes.cpp:1416 7 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheRequestResponse>::Write ipc/ipdl/CacheTypes.cpp:1960 8 xul.dll static void mozilla::ipc::IPDLParamTraits<nsTArray<mozilla::dom::cache::CacheRequestResponse> >::WriteInternal<const nsTArray<mozilla::dom::cache::CacheRequestResponse>&> ipc/glue/IPDLParamTraits.h:151 9 xul.dll static void mozilla::ipc::WriteIPDLParam<const mozilla::dom::cache::CachePutAllArgs&> ipc/glue/IPDLParamTraits.h:63 =============================================================
I'm not familiar with Cache implementation. Perhaps asuth is?
Flags: needinfo?(bugs) → needinfo?(bugmail)
This looks like it is some variation on bug 1476872 and in particular my analysis in https://bugzilla.mozilla.org/show_bug.cgi?id=1476872#c5. Unfortunately, these workers appear to be in steady state and not terminating, which is also what I was seeing at https://bugzilla.mozilla.org/show_bug.cgi?id=1476872#c14. They all do seem to be in microtask checkpoints, however, so it's possible there's a similar shutdown-related edge-case. However, many of these crashes are associated with gmail, so it's also possible that things running in a microtask checkpoint is just correlation. Similarly, it's possible the reasons the crashes stopped is gmail changed its behavior. (Likely because anything that would make us crash this way is also likely to result in errors to content if a crash doesn't occur.)
Component: IPC → DOM: Service Workers
Depends on: 1476872
this issue regressed in volume during the 63 nightly and beta cycle. the first affected nightly was 63.0a1 build 20180713100109 with the following changelog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=96c61b&tochange=e951f4 due to comment #2 could it be the changes in bug 1471189 making these crash signatures worse? the volume is also just spiking up again in beta builds today, where the signature is accounting for 4.5% of all content crashes: https://crash-stats.mozilla.com/signature/?product=Firefox&version=63.0b&signature=mozilla%3A%3Aipc%3A%3AIPDLParamTraits<T>%3A%3AWrite#graphs
Crash Signature: [@ static void mozilla::ipc::IPDLParamTraits<T>::Write] → [@ static void mozilla::ipc::IPDLParamTraits<T>::Write] [@ mozilla::ipc::IPDLParamTraits<T>::Write] [@ static void mozilla::ipc::WriteIPDLParam<T>] [@ mozilla::ipc::WriteIPDLParam<T>] [@ static void mozilla::ipc::IPDLParamTraits<T>::Write]
OS: Windows 10 → All
Hardware: Unspecified → All
I crashed with the same stack as in the original comment using Firefox 64 (20181004224156). Is there any particular info that would help?
(In reply to [:philipp] from comment #3) > due to comment #2 could it be the changes in bug 1471189 making these crash > signatures worse? Yes. Especially given that ServiceWorkerPrivate::TerminateWorker() changed on line https://searchfox.org/mozilla-central/rev/924e3d96d81a40d2f0eec1db5f74fc6594337128/dom/serviceworkers/ServiceWorkerPrivate.cpp#1943 as part of the fix. And/or, I think gmail has started pushing ServiceWorkers to Firefox as well, so this could be just a change in usage of our ServiceWorkers. Gmail tabs will tend to be long-lived, likely have a non-trivial amount of background logic triggering, but also be idle from a user perspective. Also, a lot of people use gmail. I think the most likely scenario that is occurring here is that the Cache API has unusual shutdown semantics that amount to "attempt to complete all outstanding requests issued prior to the shutdown of the (Service)Worker". This was a major issue in bug 1475627 in terms of being able to change the WorkerRefs implementation. This means it's conceivable that SW (Worker) shutdown is causing StrongWorkerRef-creation to fail as described at https://bugzilla.mozilla.org/show_bug.cgi?id=1476872#c5 and bad things happen. Note that the previous is a little bit of hand-waving, as the stacks of the Worker state did not seem to obviously indicate that the worker was in shutdown... but Worker shutdown is still somewhat complicated so I could be missing something. :ytausky, I'm going to put this on your plate for additional investigation because of the involvement of Worker life-cycle, but I would also suggest consulting with Tom and Eden who both have experience with the Caches API side of things. Please don't hesitate to ping me if you have any questions/theories you want to bounce off me. (And this is for after the weekend, obviously.)
Assignee: nobody → ytausky
Status: NEW → ASSIGNED
QA Contact: mdaly
To me it seems like the first action to take would be to add a nullptr check in https://searchfox.org/mozilla-central/source/ipc/glue/IPCStreamUtils.cpp#179 and make sure the failure is propagated upwards. That should stop the crashes. Once that's done, we can discuss whether it's appropriate to discard the cache operation or rather extend the worker's lifetime.
(In reply to Yaron Tausky [:ytausky] from comment #6) > To me it seems like the first action to take would be to add a nullptr check > in > https://searchfox.org/mozilla-central/source/ipc/glue/IPCStreamUtils.cpp#179 > and make sure the failure is propagated upwards. That should stop the > crashes. Once that's done, we can discuss whether it's appropriate to > discard the cache operation or rather extend the worker's lifetime. Could we land the mitigation fix you suggetred and uplift to our last 63 beta this week?
That should be possible -- when is deadline for the beta?
(In reply to Yaron Tausky [:ytausky] from comment #8) > That should be possible -- when is deadline for the beta? We go to build on Thursday early morning PST, so ideally land on mozilla-central today/tomorrow and request for uplift in the process
I'm testing a patch on try. Since I don't have a way to test that it really works (trying to concoct such a test will probably take a lot of time), the most I could say is that I hope it would stop the crashes and it probably wouldn't cause other problems.
DOM cache IPC code requires a StrongWorkerRef to its worker when invoked from a WorkerGlobalScope. Under certain condition, e.g. when worker termination was initiated, it's no longer possible to obtain such a reference, and said code fails silently by storing a nullptr in the CacheOpArgs object. This leads to a crash when that object gets serialized. This is a temporary workaround for this problem, until a more reasonable solution is implemented.
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/e18103445f9e Workaround for a nullptr-induced crash in cache IPC r=edenchuang
Yaron, could you request uplift so as that we have the patch in the last beta? Thanks
I'm trying to figure out what the process for that is, I've never done it before.
Comment on attachment 9015843 [details] Bug 1484524: Workaround for a nullptr-induced crash in cache IPC [Beta/Release Uplift Approval Request] Feature/Bug causing the regression: Bug 1484524 User impact if declined: Browser crashes under certain non-deterministic conditions Is this code covered by automated tests?: No Has the fix been verified in Nightly?: No Needs manual test from QE?: No If yes, steps to reproduce: List of other uplifts needed: None Risk to taking this patch: Low Why is the change risky/not risky? (and alternatives if risky): I don't believe it's risky because it indicates an error if a nullptr is found in an IPC message. When that message is sent the browser crashes, so the worst it could do is fail to stop the crash. String changes made/needed:
Attachment #9015843 - Flags: approval-mozilla-beta?
Comment on attachment 9015843 [details] Bug 1484524: Workaround for a nullptr-induced crash in cache IPC Mitigation for a crash with significant volume on 63 beta, CI is green, accepting the uplift to our last 63 beta, thanks!
Attachment #9015843 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Backout by email@example.com: https://hg.mozilla.org/integration/autoland/rev/e7995331d638 Backed out changeset e18103445f9e on request from pascalc
The naive attempt didn't work and had to be backed out. I'm working on a test that reproduces this problem, so we could actually test if the next patch fixes the problem or makes things worse.
Yaron: Status on this?
When I last touched this the problem was that currently it's not possible to write a test to verify a potential fix because this issue depends on a specific order of events that I don't know how to reproduce from a content script. I started writing a C++ test for this, but ran into some problems since the IPC code is not mockable.
Andrew: Thoughts here?
The current summary is that the band-aid fix had side-effects due to existing invariants being violated. As I alluded to in comment 5, I think we need to address the life-cycle issues, and this is partially an issue of us better understanding the behavior of the Cache API in these cases. I think a reasonable step towards that might be instrumenting the Cache API with MOZ_LOG so that we can get traces out (and a tool like https://github.com/mayhemer/logan can help us process them). I think it's the case in bug 1495275 that our investigation of the failure would be aided if we could MOZ_LOG cache along-side IndexedDB. Leaving NI on myself for additional investigation.
Assignee: ytausky → bugmail
Status: REOPENED → ASSIGNED
Too late to fix in Fx64 at this point. Still holding out hope for a fix in Fx65, though.
this signature was doubling in volume yesterday (it's 15% of tab crashes on 64.0rc2 currently). many comments refer to gmail as source of the crashes.
Interesting, I thought the gmail crashes had gone away. Maybe they had pulled back a rollout and are rolling it out again. I'll focus my repro investigation on gmail, thanks!
FWIW, I got this crash when I selected all unread mails in my inbox and pressed the button to mark them as read. After restoring the tab and trying the same thing again it worked though.
Assignee: bugmail → ytausky
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/5da51a386275 Allow creating a StrongWorkerRef for IPC in the Canceling state r=asuth
You need to log in before you can comment on or make changes to this bug.