Closed Bug 1375704 Opened 7 years ago Closed 4 years ago

Thousands of content shutdown crashes caused by ScriptPreloader

Categories

(Core :: XPConnect, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 blocking fixed
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- ?

People

(Reporter: marco, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: regression)

Crash Data

Requesting tracking for 55 as the number of crashes is really high, even though this appears to be a content process shutdown crash.
Hi Mike, could you please help find an owner who can investigate this new content shutdown crash? This might blocking the rollout of 55 to go from 25% (current rate) to 100% of beta population. Thanks!
Flags: needinfo?(miket)
These stacks don't make much sense, but they all pretty consistently contain at least:

mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:68
mozilla::ScriptPreloader::WaitForCachedScript(JSContext*, mozilla::ScriptPreloader::CachedScript*) 	js/xpconnect/loader/ScriptPreloader.cpp:771
mozilla::ScriptPreloader::GetCachedScript(JSContext*, nsCString const&) 	js/xpconnect/loader/ScriptPreloader.cpp:737
mozJSComponentLoader::ObjectForLocation(ComponentLoaderInfo&, nsIFile*, JS::MutableHandle<JSObject*>, JS::MutableHandle<JSScript*>, char**, bool, JS::MutableHandle<JS::Value>) 	js/xpconnect/loader/mozJSComponentLoader.cpp:626
nsACString::Replace(unsigned int, unsigned int, nsCSubstringTuple const&) 	xpcom/string/nsTSubstring.cpp:659
mozilla::dom::ScriptLoader::WaitForModuleFetch(mozilla::dom::ModuleLoadRequest*) 	dom/script/ScriptLoader.cpp:351
nsTHashtable<nsBaseHashtableET<nsCStringHashKey, nsAutoPtr<`anonymous namespace'::KeyedHistogram> > >::s_InitEntry(PLDHashEntryHdr*, void const*)

And all of the JS helper threads are idle.

If the stack above made any kind of sense, I'd guess that we'd started
shutting down the JS runtime, but then tried to load another module or JS
component, and wound up deadlocked.

But the stack unfortunately doesn't make any kind of sense.
<https://crash-stats.mozilla.com/search/?signature=%3DIPCError-browser%20%7C%20ShutDownKill&proto_signature=~ScriptPreloader&product=Firefox&date=%3E%3D2017-03-23T14%3A40%3A32.000Z&date=%3C2017-06-23T14%3A40%3A00.000Z&_sort=date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports>

If I expand the search back further to include nightly, we get anywhere from 0 to ~20 crashes a day starting on 2017-05-11, but things pick up steam significantly on 2017-06-07, to around 70 - 80 crashes per day (I stopped counting after 2017-06-08...)
Kris, assuming that jump in crashes is somehow meaningful, does anything stick out to you in this range?

<https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?startdate=2017-06-07&enddate=2017-06-08>
Flags: needinfo?(miket) → needinfo?(kmaglione+bmo)
The only thing that immediately stands out in that range is bug 1366356, which does seem like a pretty plausible cause.
Flags: needinfo?(kmaglione+bmo)
baku, can you take a look please? Potentially Bug 1366356 is at play here, thanks.
Flags: needinfo?(amarchesini)
CAn we back bug 1366356 out and see what happens with the number of crashes? (ni(smaug) because he reviewed and knows the situation)
Flags: needinfo?(bugs)
Why would bug 1366356 change anything here? That just landed from preliminary code, and the pref is even disabled, as far as I see.
Flags: needinfo?(bugs)
If it's not currently enabled, I agree it's probably not related. If it was, I wouldn't be surprised to see it cause a spike in content process shutdown timeouts.

I don't see anything else in that range that I'd expect to cause something like this, though, except perhaps for some GC/CC timing changes.

I'll look into this some more. A lot of the stacks don't make sense, but most of the ones that do seem to be coming from the message manager loading a process script, which I wouldn't expect to see during shutdown.
Crash Signature: IPCError-browser | ShutDownKill → [@ IPCError-browser | ShutDownKill ]
(In reply to Kris Maglione [:kmag] from comment #11)
> I'll look into this some more. A lot of the stacks don't make sense, but
> most of the ones that do seem to be coming from the message manager loading
> a process script, which I wouldn't expect to see during shutdown.

Is there a chance that it comes from the preallocated process manager? It should not launch a process during shutdown, but there could be an unlucky timing where the process already started and we're loading the process scripts for it when the shutdown starts or something...
Flags: needinfo?(kmaglione+bmo)
(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #8)
> baku, can you take a look please? Potentially Bug 1366356 is at play here,
> thanks.

That code is not actually running because we don't have platform specific implementations yet.
Flags: needinfo?(amarchesini)
Hi :kmag,
Can you help shed some light here?
Crash Signature: [@ IPCError-browser | ShutDownKill ] → [@ IPCError-browser | ShutDownKill ] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-browser%20|%20ShutDownKill&proto_signature=~ScriptPreloader
Keywords: regression
Crash Signature: [@ IPCError-browser | ShutDownKill ] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-browser%20|%20ShutDownKill&proto_signature=~ScriptPreloader → [@ IPCError-browser | ShutDownKill ]
Crash Signature: [@ IPCError-browser | ShutDownKill ] → [@ IPCError-browser | ShutDownKill ] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-browser%20|%20ShutDownKill&proto_signature=~ScriptPreloader
This might be fixed in bug 1382329, which just landed this morning.
There are still ScriptPreloader shutdown crashes on nightly, but as far as I can tell none in 55.0b13 (where 1382329 landed), so I'm ready to call this fixed for 55.
This is marked fixed for 55, but there are still 16K crash reports with this signature in 56 RC builds. Maybe just the same signature as for bug 1399847? 

overholt - is this a separate issue? Or is that still not certain?
Flags: needinfo?(overholt)
Sadly, I think there's a family of issues here and while people have been heroically fixing individual issues, it seems we still have a big problem. I'm not sure what to do, though. Maybe Gabor has an idea of how to proceed?
Flags: needinfo?(overholt) → needinfo?(gkrizsanits)
Is the spike here related to the counting of the parent process killing the content process? We used to count that differently (or not at all) from actual content process crashes but did something change when 56 went to beta?
yes, the [@ IPCError-browser | ShutDownKill] signature is a catch-all signature for those killed content processes.
those only get submitted from beta users when we show the infobar to them in early beta builds (bug 1303067) - otherwise there's no ui to send reports for such kind of crashes. in 56.0b that mechanism was turned on for longer as it was planned due to bug 1399796 and i'd attribute the spike only to that...
(In reply to [:philipp] from comment #20)
> yes, the [@ IPCError-browser | ShutDownKill] signature is a catch-all
> signature for those killed content processes.
> those only get submitted from beta users when we show the infobar to them in
> early beta builds (bug 1303067) - otherwise there's no ui to send reports
> for such kind of crashes. in 56.0b that mechanism was turned on for longer
> as it was planned due to bug 1399796 and i'd attribute the spike only to
> that...

Thanks. Just so I've got this clear in my head, you think the spike and subsequent volume are due to reports automatically being submitted? Do we normally submit crash reports for these shutdownkill crashes? I see the volume appears to have decreased since ~the 22nd; do you think that trend will continue?

(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #11)
> If it was,
> I wouldn't be surprised to see it cause a spike in content process shutdown
> timeouts.

Kris, was the script preloader disabled/enabled with a pref? Is that flipped to enable it now?
Flags: needinfo?(madperson)
Does anyone know why we submit crash reports for these intentional content process kills (after a 5 s timeout, I believe)?
(In reply to Andrew Overholt [:overholt] from comment #22)
> Does anyone know why we submit crash reports for these intentional content
> process kills (after a 5 s timeout, I believe)?

It is bad if the content process is hanging for multiple seconds during shutdown, so we want some information about where it is happening.
(In reply to Andrew Overholt [:overholt] from comment #21)
> Thanks. Just so I've got this clear in my head, you think the spike and
> subsequent volume are due to reports automatically being submitted?
yes the spikes of the generic [@ IPCError-browser | ShutDownKill] signature correspond to the infobar mechanism being turned on during the early beta cycle.

> Do we normally submit crash reports for these shutdownkill crashes?
not for users on release and late beta builds since they won't notice this kind of killing of the content process and there is no reporting ui for such issues (other than manually going to about:crashes and clicking on unsubmitted crash report ids). telemetry data might capture the issue for those channels more steadily, as there's no submission bias.

>I see the volume appears to have decreased since ~the 22nd; do you think that trend
> will continue?
since we're starting to roll out 57.0b to the general beta audience since yesterday the shutdownkill signature will start rising again.

however this graph will just focus on the ScriptPreloader content kills that this bug is about:
https://crash-stats.mozilla.com/signature/?product=Firefox&proto_signature=~ScriptPreloader&release_channel=beta&signature=IPCError-browser%20%7C%20ShutDownKill&date=%3E%3D2017-06-01#graphs
the big spikes around beginning of july and mid august were all from 55.0b users, so the issue has somehow declined and seems no longer such pressing as back when marco has filed the bug.
Flags: needinfo?(madperson)
(In reply to Andrew Overholt [:overholt] from comment #21)
> (In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're
> blocked) from comment #11)
> > If it was, I wouldn't be surprised to see it cause a spike in content
> > process shutdown timeouts.
> 
> Kris, was the script preloader disabled/enabled with a pref? Is that flipped
> to enable it now?

No, it's always enabled. We could add a pref to disable it for the content
process, and see if disabling that decreases the crash rate (disabling it for
the parent process would cause a 300-500ms startup time regression, which I
don't think we can afford), but I think it's pretty clear from the crash stacks
that process being killed within ScriptPreloader code is just a coincidence.

A lot of the stacks are during a mutex wait in WaitForCachedScript, and if I
thought a deadlock there was likely, we could annotate the crash report with the
current wait time. But lots of the other stacks are in other places, like
merging compartments, decoding XDR data, or loading JAR input streams.

Given that these crashes are almost all very near the start of the browser
session, the most likely explanation seems to be that we're trying to kill the
process during startup, and the script loading code just happens to be where we
spend the most time at that phase.
Flags: needinfo?(kmaglione+bmo)
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #25)
> Given that these crashes are almost all very near the start of the browser
> session, the most likely explanation seems to be that we're trying to kill
> the
> process during startup, and the script loading code just happens to be where
> we
> spend the most time at that phase.

Maybe we want to start setting the "dom.ipc.keepProcessesAlive.web" pref to something >= 1, instead of its (default in the code) current value of 0?  I presume that would reduce our desire to be so process-termination-happy.
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #25)
> Given that these crashes are almost all very near the start of the browser
> session, the most likely explanation seems to be that we're trying to kill
> the
> process during startup, and the script loading code just happens to be where
> we
> spend the most time at that phase.

During startup and maybe in some other situations too, we tend to spawn short lived processes. Maybe the shit down takes too long for various sync startup operations and the parent decides to shut down these processes. The ppm patch I'm landing should help here a bit as it's eager to reuse these processes instead of killing them. We'll see.

(In reply to Andrew Sutherland [:asuth] from comment #26)
> Maybe we want to start setting the "dom.ipc.keepProcessesAlive.web" pref to
> something >= 1, instead of its (default in the code) current value of 0?  I
> presume that would reduce our desire to be so process-termination-happy.

keepProcessesAlive is only for testing purposes. Keeping content processes alive indefinitely would not help memory fragmentation... I know that the current setup is not ideal, and we might keep them alive forever anyway, but keeping all the content processes alive intentionally seems like a step to the wrong direction.

(In reply to Andrew Overholt [:overholt] from comment #18)
> Sadly, I think there's a family of issues here and while people have been
> heroically fixing individual issues, it seems we still have a big problem.
> I'm not sure what to do, though. Maybe Gabor has an idea of how to proceed?

I'm not sure. Based on comment 24 from :philipp this is no longer a pressing issue. In general, content process shutdownkill's so far have not been considered as high priority issues, since the user would not notice much from it. If we wait more for content process shutdowns there will be less crashes (force shutdowns) but that might be a worse user experience. Having more content process than one will ofc increase these shutdowns. Is there a reason to put more resources on this issue in the future?
Flags: needinfo?(gkrizsanits)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #27)
> If we wait more for content process shutdowns there will be less
> crashes (force shutdowns) but that might be a worse user experience. Having
> more content process than one will ofc increase these shutdowns. Is there a
> reason to put more resources on this issue in the future?

Maybe longer-term but in the short term I'd prefer we not regress the user experience. I'm happy to be told that's silly logic, though :)
Priority: -- → P3

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.