Crash in static void mozilla::ipc::IPDLParamTraits<T>::Write

RESOLVED FIXED in Firefox 66

Status

()

defect
P2
critical
RESOLVED FIXED
11 months ago
4 months ago

People

(Reporter: jseward, Assigned: ytausky)

Tracking

(Depends on 1 bug, {crash, regression})

unspecified
mozilla67
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr60 wontfix, firefox62 wontfix, firefox63+ wontfix, firefox64+ wontfix, firefox65+ wontfix, firefox66+ fixed, firefox67+ fixed)

Details

(crash signature)

Attachments

(3 attachments)

This bug was filed from the Socorro interface and is
report bp-d2d5b55a-a483-4393-9314-a2a3e0180817.
=============================================================

This is topcrash #11 in the Windows nightly 20180816100035, with
8 crashes in 4 different instances.

Top 10 frames of crashing thread:

0 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::ipc::IPCRemoteStreamType>::Write ipc/ipdl/IPCStream.cpp:489
1 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::ipc::IPCRemoteStream>::Write ipc/ipdl/IPCStream.cpp:700
2 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::IPCStream>::Write ipc/ipdl/IPCStream.cpp:1052
3 xul.dll mozilla::ipc::IPDLParamTraits<mozilla::ipc::OptionalIPCStream>::Write ipc/ipdl/IPCStream.cpp:1423
4 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheReadStream>::Write ipc/ipdl/CacheTypes.cpp:184
5 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheReadStreamOrVoid>::Write ipc/ipdl/CacheTypes.cpp:548
6 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheResponse>::Write ipc/ipdl/CacheTypes.cpp:1416
7 xul.dll static void mozilla::ipc::IPDLParamTraits<mozilla::dom::cache::CacheRequestResponse>::Write ipc/ipdl/CacheTypes.cpp:1960
8 xul.dll static void mozilla::ipc::IPDLParamTraits<nsTArray<mozilla::dom::cache::CacheRequestResponse> >::WriteInternal<const nsTArray<mozilla::dom::cache::CacheRequestResponse>&> ipc/glue/IPDLParamTraits.h:151
9 xul.dll static void mozilla::ipc::WriteIPDLParam<const mozilla::dom::cache::CachePutAllArgs&> ipc/glue/IPDLParamTraits.h:63

=============================================================
Flags: needinfo?(bugs)
I'm not familiar with Cache implementation. Perhaps asuth is?
Flags: needinfo?(bugs) → needinfo?(bugmail)
This looks like it is some variation on bug 1476872 and in particular my analysis in https://bugzilla.mozilla.org/show_bug.cgi?id=1476872#c5.  Unfortunately, these workers appear to be in steady state and not terminating, which is also what I was seeing at https://bugzilla.mozilla.org/show_bug.cgi?id=1476872#c14.

They all do seem to be in microtask checkpoints, however, so it's possible there's a similar shutdown-related edge-case.  However, many of these crashes are associated with gmail, so it's also possible that things running in a microtask checkpoint is just correlation.  Similarly, it's possible the reasons the crashes stopped is gmail changed its behavior.  (Likely because anything that would make us crash this way is also likely to result in errors to content if a crash doesn't occur.)
Component: IPC → DOM: Service Workers
Depends on: 1476872
Flags: needinfo?(bugmail)
Priority: -- → P2
this issue regressed in volume during the 63 nightly and beta cycle. the first affected nightly was 63.0a1 build 20180713100109 with the following changelog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=96c61b&tochange=e951f4
due to comment #2 could it be the changes in bug 1471189 making these crash signatures worse?

the volume is also just spiking up again in beta builds today, where the signature is accounting for 4.5% of all content crashes: https://crash-stats.mozilla.com/signature/?product=Firefox&version=63.0b&signature=mozilla%3A%3Aipc%3A%3AIPDLParamTraits<T>%3A%3AWrite#graphs
Crash Signature: [@ static void mozilla::ipc::IPDLParamTraits<T>::Write] → [@ static void mozilla::ipc::IPDLParamTraits<T>::Write] [@ mozilla::ipc::IPDLParamTraits<T>::Write] [@ static void mozilla::ipc::WriteIPDLParam<T>] [@ mozilla::ipc::WriteIPDLParam<T>] [@ static void mozilla::ipc::IPDLParamTraits<T>::Write]
Flags: needinfo?(bugmail)
Flags: needinfo?(amarchesini)
Keywords: regression
OS: Windows 10 → All
Hardware: Unspecified → All
I crashed with the same stack as in the original comment using Firefox 64 (20181004224156).  Is there any particular info that would help?
(In reply to [:philipp] from comment #3)
> due to comment #2 could it be the changes in bug 1471189 making these crash
> signatures worse?

Yes.  Especially given that ServiceWorkerPrivate::TerminateWorker() changed on line https://searchfox.org/mozilla-central/rev/924e3d96d81a40d2f0eec1db5f74fc6594337128/dom/serviceworkers/ServiceWorkerPrivate.cpp#1943 as part of the fix.

And/or, I think gmail has started pushing ServiceWorkers to Firefox as well, so this could be just a change in usage of our ServiceWorkers.  Gmail tabs will tend to be long-lived, likely have a non-trivial amount of background logic triggering, but also be idle from a user perspective.  Also, a lot of people use gmail.

I think the most likely scenario that is occurring here is that the Cache API has unusual shutdown semantics that amount to "attempt to complete all outstanding requests issued prior to the shutdown of the (Service)Worker".  This was a major issue in bug 1475627 in terms of being able to change the WorkerRefs implementation.  This means it's conceivable that SW (Worker) shutdown is causing StrongWorkerRef-creation to fail as described at https://bugzilla.mozilla.org/show_bug.cgi?id=1476872#c5 and bad things happen.

Note that the previous is a little bit of hand-waving, as the stacks of the Worker state did not seem to obviously indicate that the worker was in shutdown... but Worker shutdown is still somewhat complicated so I could be missing something.

:ytausky, I'm going to put this on your plate for additional investigation because of the involvement of Worker life-cycle, but I would also suggest consulting with Tom and Eden who both have experience with the Caches API side of things.  Please don't hesitate to ping me if you have any questions/theories you want to bounce off me.  (And this is for after the weekend, obviously.)
Assignee: nobody → ytausky
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
Flags: needinfo?(amarchesini)
QA Contact: mdaly
QA Contact: mdaly
To me it seems like the first action to take would be to add a nullptr check in https://searchfox.org/mozilla-central/source/ipc/glue/IPCStreamUtils.cpp#179 and make sure the failure is propagated upwards. That should stop the crashes. Once that's done, we can discuss whether it's appropriate to discard the cache operation or rather extend the worker's lifetime.
(In reply to Yaron Tausky [:ytausky] from comment #6)
> To me it seems like the first action to take would be to add a nullptr check
> in
> https://searchfox.org/mozilla-central/source/ipc/glue/IPCStreamUtils.cpp#179
> and make sure the failure is propagated upwards. That should stop the
> crashes. Once that's done, we can discuss whether it's appropriate to
> discard the cache operation or rather extend the worker's lifetime.

Could we land the mitigation fix you suggetred and uplift to our last 63 beta this week?
Flags: needinfo?(ytausky)
That should be possible -- when is deadline for the beta?
(In reply to Yaron Tausky [:ytausky] from comment #8)
> That should be possible -- when is deadline for the beta?

We go to build on Thursday early morning PST, so ideally land on mozilla-central today/tomorrow and request for uplift in the process
I'm testing a patch on try. Since I don't have a way to test that it really works (trying to concoct such a test will probably take a lot of time), the most I could say is that I hope it would stop the crashes and it probably wouldn't cause other problems.
Flags: needinfo?(ytausky)
DOM cache IPC code requires a StrongWorkerRef to its worker when invoked
from a WorkerGlobalScope. Under certain condition, e.g. when worker
termination was initiated, it's no longer possible to obtain such a reference,
and said code fails silently by storing a nullptr in the CacheOpArgs object.
This leads to a crash when that object gets serialized.

This is a temporary workaround for this problem, until a more reasonable
solution is implemented.
Keywords: checkin-needed
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/e18103445f9e
Workaround for a nullptr-induced crash in cache IPC r=edenchuang
Keywords: checkin-needed
Yaron, could you request uplift so as that we have the patch in the last beta? Thanks
Flags: needinfo?(ytausky)
I'm trying to figure out what the process for that is, I've never done it before.
Flags: needinfo?(ytausky)
Comment on attachment 9015843 [details]
Bug 1484524: Workaround for a nullptr-induced crash in cache IPC

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: Bug 1484524

User impact if declined: Browser crashes under certain non-deterministic conditions

Is this code covered by automated tests?: No

Has the fix been verified in Nightly?: No

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): I don't believe it's risky because it indicates an error if a nullptr is found in an IPC message. When that message is sent the browser crashes, so the worst it could do is fail to stop the crash.

String changes made/needed:
Attachment #9015843 - Flags: approval-mozilla-beta?
Comment on attachment 9015843 [details]
Bug 1484524: Workaround for a nullptr-induced crash in cache IPC

Mitigation for a crash with significant volume on 63 beta, CI is green, accepting the uplift to our last 63 beta, thanks!
Attachment #9015843 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/mozilla-central/rev/e18103445f9e
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Depends on: 1498784
Backout by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/e7995331d638
Backed out changeset e18103445f9e on request from pascalc
The naive attempt didn't work and had to be backed out. I'm working on a test that reproduces this problem, so we could actually test if the next patch fixes the problem or makes things worse.
Flags: needinfo?(ytausky)
Depends on: 1500914
Yaron: Status on this?
Flags: needinfo?(ytausky)
When I last touched this the problem was that currently it's not possible to write a test to verify a potential fix because this issue depends on a specific order of events that I don't know how to reproduce from a content script. I started writing a C++ test for this, but ran into some problems since the IPC code is not mockable.
Flags: needinfo?(ytausky)
Andrew: Thoughts here?
Flags: needinfo?(bugmail)
The current summary is that the band-aid fix had side-effects due to existing invariants being violated.  As I alluded to in comment 5, I think we need to address the life-cycle issues, and this is partially an issue of us better understanding the behavior of the Cache API in these cases.  I think a reasonable step towards that might be instrumenting the Cache API with MOZ_LOG so that we can get traces out (and a tool like https://github.com/mayhemer/logan can help us process them).  I think it's the case in bug 1495275 that our investigation of the failure would be aided if we could MOZ_LOG cache along-side IndexedDB.

Leaving NI on myself for additional investigation.
Assignee: ytausky → bugmail
Status: REOPENED → ASSIGNED
Flags: needinfo?(bugmail)
Too late to fix in Fx64 at this point. Still holding out hope for a fix in Fx65, though.
this signature was doubling in volume yesterday (it's 15% of tab crashes on 64.0rc2 currently). many comments refer to gmail as source of the crashes.
Interesting, I thought the gmail crashes had gone away.  Maybe they had pulled back a rollout and are rolling it out again.  I'll focus my repro investigation on gmail, thanks!
FWIW, I got this crash when I selected all unread mails in my inbox and pressed the button to mark them as read. After restoring the tab and trying the same thing again it worked though.

Andrew Sutherland told me he's going to be able to get back to this soon.

Actual details soon, but current speculation is that the fix here won't be something we can uplift to 66.

[Tracking Requested - why for this release]: Ongoing issue in 67.

At least one of the signatures, mozilla::ipc::IPDLParamTraits<T>::Write, is full of various Yahoo URLs: https://www.yahoo.com/entertainment/kim-kardashian-says-she-thinks-052732710.html is one example.

If there ends up being anything upliftable to mitigate this crash I'd be willing to try it in beta 66, because of the huge crash volume on 65 release.

Duplicate of this bug: 1528167

From discussion with asuth, sounds like ytausky is going to dig into this more on Monday.
I'm not treating it as a release blocker for 66, since it looks no worse than it has been in 65 and 64.
But, I will keep it on the radar for possible dot release in 66 and hope we can improve the crash rate in 67.

I'm picking this up again.

Assignee: bugmail → ytausky

I've had a discussion with :asuth about a possible solution to the problem, namely to allow issuing a StrongWorkerRef to system code when the worker is in the Canceling phase, and regarding stream serialization code as system code. The following is somewhat of a brain dump concerning this solution. Sorry for the wall of text.

The idea behind IPC code panicking (a term I'll use for an intentional crash) on failure is that IPC is meant to be a low level system that either "just works" or panics because it's impossible to recover in any meaningful way. This implies that under normal operating conditions, once the decision to use IPC was made by any part of the DOM, this operation should never fail, unless some truly unavoidable and unrecoverable situation is encountered. Under this regime, we would ideally like to understand all the conditions which could cause such a failure and convince ourselves that they really are unavoidable and unrecoverable, otherwise we'd get spurious crashes like this one that could have been avoided. In particular, I wouldn't include policy decisions under the unavoidable banner -- if something is possible but not allowed by our own policy, that's hardly unavoidable.
This is why I am somewhat opposed to issuing StrongWorkerRefs in the bowels of the stream serialization code: long after DOM code made the decision to initiate a succeed-or-panic operation, we're suddenly asking foreign code to make a policy decision (as opposed to a mechanism operation that can fail), whether we're allowed to get a StrongWorkerRef or not. Now we're coupling foreign code to the (hopefully) small subset of conditions that can cause a panic. True, one might argue that the underlying thread and its lifetime management is part of the IPC system, but that I don't find it convincing, seeing as the DOM code that starts the whole process has a reference to the object in question itself.

Another concern I have is testability. As this issue shows, we're somewhat hindered by not being able to test a lot system states that are not script-observable, and improving the situation requires conscious effort. The IPC interface is a natural place to draw a testing boundary, but that's not really possible if code beyond that boundary reaches back through a global variable into the system under test. (At the very least it makes it harder to test.) Ideally calling into IPC would be one testing boundary (where we can test DOM components), and pushing serialized structures to actors would be another (so we can test serialization code).

Having that said, since we first want to stop crashing, we could go for that solution. It just makes me uneasy due to these concerns, and I think we should see it as a band-aid rather than a solution.

By allowing the creation of StrongWorkerRefs in the Canceling state we
ensure that IPC will not fail and lead to crashes.

Re: comment 38

Agreed on this being a band-aid. In particular, there are a lot of ripple effects from our IPC system's thread invariants, especially as it relates to how the Cache API is implemented on workers. (Including all those non-owning references because so many things aren't refcounted.)

Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/5da51a386275
Allow creating a StrongWorkerRef for IPC in the Canceling state r=asuth
Status: ASSIGNED → RESOLVED
Closed: 9 months ago5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

Is this upliftable to beta? I'd like to get it into the RC build on Monday.

Flags: needinfo?(ytausky)

It should be upliftable. Tomorrow is a holiday here, so I'm trying to find someone who could take care of it.

Flags: needinfo?(ytausky)

Comment on attachment 9048159 [details]
Bug 1484524: Allow creating a StrongWorkerRef for IPC in the Canceling state

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: Bug 1484524
  • User impact if declined: Crashes under non-deterministic conditions
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Bypassing the StrongWorkerRef creation policy could potentially lead to workers running longer than they should. I don't know it possible in practice. An alternative would be to do nothing and let the patch ride the trains.
  • String changes made/needed:
Attachment #9048159 - Flags: approval-mozilla-beta?

Comment on attachment 9048159 [details]
Bug 1484524: Allow creating a StrongWorkerRef for IPC in the Canceling state

May be risky but probably not going to be worse than 30K crashes per week on release.
Let's uplift for the RC build.

Attachment #9048159 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Got a conflict when trying to uplift.

grafting 530955:5da51a386275 "Bug 1484524: Allow creating a StrongWorkerRef for IPC in the Canceling state r=asuth"
merging dom/workers/WorkerRef.cpp
warning: conflicts while merging dom/workers/WorkerRef.cpp! (edit, then use 'hg resolve --mark')
abort: unresolved conflicts, can't continue

Flags: needinfo?(ytausky)

I grafted locally, kdiff3 magically fixed it up, and the build seemed sufficiently okay locally, so I pushed to beta:

https://hg.mozilla.org/releases/mozilla-beta/rev/f83ccafddfa29dc1b510d0d0d0803aac105b03b3

Flags: needinfo?(ytausky)

While this bug is marked as fixed I saw a similar stack trace again these days, from an ASAN built from April 1st.

Shouldn't that already have the fix?

Hanno can you file a new bug for your trace?

Flags: needinfo?(hanno)

(In reply to Julien Cristau [:jcristau] from comment #50)

Hanno can you file a new bug for your trace?

Done: https://bugzilla.mozilla.org/show_bug.cgi?id=1542593

Flags: needinfo?(hanno)
You need to log in before you can comment on or make changes to this bug.