1375704 - Thousands of content shutdown crashes caused by ScriptPreloader

Reporter

Description

•

7 years ago

Use this search:
https://crash-stats.mozilla.com/search/?signature=%3DIPCError-browser%20%7C%20ShutDownKill&proto_signature=~ScriptPreloader&product=Firefox&version=55.0b&date=%3E%3D2017-06-15T23%3A10%3A00.000Z&date=%3C2017-06-22T23%3A10%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

There are ~3400 reports.

Marco Castelluccio [:marco]

Reporter

Comment 1

•

7 years ago

Requesting tracking for 55 as the number of crashes is really high, even though this appears to be a content process shutdown crash.

status-firefox55: --- → affected

tracking-firefox55: --- → ?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 2

•

7 years ago

Tracked as blocking for 55.

tracking-firefox55: ? → blocking

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 3

•

7 years ago

Hi Mike, could you please help find an owner who can investigate this new content shutdown crash? This might blocking the rollout of 55 to go from 25% (current rate) to 100% of beta population. Thanks!

Flags: needinfo?(miket)

Kris Maglione [:kmag]

Comment 4

•

7 years ago

These stacks don't make much sense, but they all pretty consistently contain at least:

mozilla::detail::ConditionVariableImpl::wait(mozilla::detail::MutexImpl&) 	mozglue/misc/ConditionVariable_windows.cpp:58
mozilla::CondVar::Wait(unsigned int) 	obj-firefox/dist/include/mozilla/CondVar.h:68
mozilla::ScriptPreloader::WaitForCachedScript(JSContext*, mozilla::ScriptPreloader::CachedScript*) 	js/xpconnect/loader/ScriptPreloader.cpp:771
mozilla::ScriptPreloader::GetCachedScript(JSContext*, nsCString const&) 	js/xpconnect/loader/ScriptPreloader.cpp:737
mozJSComponentLoader::ObjectForLocation(ComponentLoaderInfo&, nsIFile*, JS::MutableHandle<JSObject*>, JS::MutableHandle<JSScript*>, char**, bool, JS::MutableHandle<JS::Value>) 	js/xpconnect/loader/mozJSComponentLoader.cpp:626
nsACString::Replace(unsigned int, unsigned int, nsCSubstringTuple const&) 	xpcom/string/nsTSubstring.cpp:659
mozilla::dom::ScriptLoader::WaitForModuleFetch(mozilla::dom::ModuleLoadRequest*) 	dom/script/ScriptLoader.cpp:351
nsTHashtable<nsBaseHashtableET<nsCStringHashKey, nsAutoPtr<`anonymous namespace'::KeyedHistogram> > >::s_InitEntry(PLDHashEntryHdr*, void const*)

And all of the JS helper threads are idle.

If the stack above made any kind of sense, I'd guess that we'd started
shutting down the JS runtime, but then tried to load another module or JS
component, and wound up deadlocked.

But the stack unfortunately doesn't make any kind of sense.

Mike Taylor [:miketaylr]

Comment 5

•

7 years ago

<https://crash-stats.mozilla.com/search/?signature=%3DIPCError-browser%20%7C%20ShutDownKill&proto_signature=~ScriptPreloader&product=Firefox&date=%3E%3D2017-03-23T14%3A40%3A32.000Z&date=%3C2017-06-23T14%3A40%3A00.000Z&_sort=date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports>

If I expand the search back further to include nightly, we get anywhere from 0 to ~20 crashes a day starting on 2017-05-11, but things pick up steam significantly on 2017-06-07, to around 70 - 80 crashes per day (I stopped counting after 2017-06-08...)

Mike Taylor [:miketaylr]

Comment 6

•

7 years ago

Kris, assuming that jump in crashes is somehow meaningful, does anything stick out to you in this range?

<https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?startdate=2017-06-07&enddate=2017-06-08>

Flags: needinfo?(miket) → needinfo?(kmaglione+bmo)

Kris Maglione [:kmag]

Comment 7

•

7 years ago

The only thing that immediately stands out in that range is bug 1366356, which does seem like a pretty plausible cause.

Flags: needinfo?(kmaglione+bmo)

Mike Taylor [:miketaylr]

Comment 8

•

7 years ago

baku, can you take a look please? Potentially Bug 1366356 is at play here, thanks.

Flags: needinfo?(amarchesini)

Andrew Overholt [:overholt]

Comment 9

•

7 years ago

CAn we back bug 1366356 out and see what happens with the number of crashes? (ni(smaug) because he reviewed and knows the situation)

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 10

•

7 years ago

Why would bug 1366356 change anything here? That just landed from preliminary code, and the pref is even disabled, as far as I see.

Flags: needinfo?(bugs)

Kris Maglione [:kmag]

Comment 11

•

7 years ago

If it's not currently enabled, I agree it's probably not related. If it was, I wouldn't be surprised to see it cause a spike in content process shutdown timeouts.

I don't see anything else in that range that I'd expect to cause something like this, though, except perhaps for some GC/CC timing changes.

I'll look into this some more. A lot of the stacks don't make sense, but most of the ones that do seem to be coming from the message manager loading a process script, which I wouldn't expect to see during shutdown.

Tobias B. Besemer [:BesTo] (QA)

Updated

•

7 years ago

Crash Signature: IPCError-browser | ShutDownKill → [@ IPCError-browser | ShutDownKill ]

Gabor Krizsanits (INACTIVE)

Comment 12

•

7 years ago

(In reply to Kris Maglione [:kmag] from comment #11)
> I'll look into this some more. A lot of the stacks don't make sense, but
> most of the ones that do seem to be coming from the message manager loading
> a process script, which I wouldn't expect to see during shutdown.

Is there a chance that it comes from the preallocated process manager? It should not launch a process during shutdown, but there could be an unlucky timing where the process already started and we're loading the process scripts for it when the shutdown starts or something...

Flags: needinfo?(kmaglione+bmo)

Andrea Marchesini [:baku]

Comment 13

•

7 years ago

(In reply to Mike Taylor [:miketaylr] (55 Regression Engineering Owner) from comment #8)
> baku, can you take a look please? Potentially Bug 1366356 is at play here,
> thanks.

That code is not actually running because we don't have platform specific implementations yet.

Flags: needinfo?(amarchesini)

Gerry Chang [:gchang]

Comment 14

•

7 years ago

Hi :kmag,
Can you help shed some light here?

Tobias B. Besemer [:BesTo] (QA)

Updated

•

7 years ago

URL: https://crash-stats.mozilla.com/signa...

status-firefox54: --- → wontfix

status-firefox56: --- → affected

status-firefox-esr52: --- → wontfix

Marco Castelluccio [:marco]

Reporter

Updated

•

7 years ago

status-firefox54: wontfix → unaffected

status-firefox-esr52: wontfix → unaffected

Liz Henry (:lizzard) (relman/hg->git project)

Updated

•

7 years ago

Crash Signature: [@ IPCError-browser | ShutDownKill ] → [@ IPCError-browser | ShutDownKill ] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-browser%20|%20ShutDownKill&proto_signature=~ScriptPreloader

Marco Castelluccio [:marco]

Reporter

Updated

•

7 years ago

Keywords: regression

Virtual_ManPL [:Virtual] 🇵🇱 - (please needinfo? me - so I will see your comment/reply/question/etc.)

Updated

•

7 years ago

Crash Signature: [@ IPCError-browser | ShutDownKill ] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-browser%20|%20ShutDownKill&proto_signature=~ScriptPreloader → [@ IPCError-browser | ShutDownKill ]

Virtual_ManPL [:Virtual] 🇵🇱 - (please needinfo? me - so I will see your comment/reply/question/etc.)

Updated

•

7 years ago

Crash Signature: [@ IPCError-browser | ShutDownKill ] → [@ IPCError-browser | ShutDownKill ] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=IPCError-browser%20|%20ShutDownKill&proto_signature=~ScriptPreloader

Liz Henry (:lizzard) (relman/hg->git project)

Comment 15

•

7 years ago

This might be fixed in bug 1382329, which just landed this morning.

Julien Cristau [:jcristau]

Comment 16

•

7 years ago

There are still ScriptPreloader shutdown crashes on nightly, but as far as I can tell none in 55.0b13 (where 1382329 landed), so I'm ready to call this fixed for 55.

status-firefox55: affected → fixed

Liz Henry (:lizzard) (relman/hg->git project)

Comment 17

•

7 years ago

This is marked fixed for 55, but there are still 16K crash reports with this signature in 56 RC builds. Maybe just the same signature as for bug 1399847? 

overholt - is this a separate issue? Or is that still not certain?

status-firefox57: --- → affected

status-firefox58: --- → affected

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 18

•

7 years ago

Sadly, I think there's a family of issues here and while people have been heroically fixing individual issues, it seems we still have a big problem. I'm not sure what to do, though. Maybe Gabor has an idea of how to proceed?

Flags: needinfo?(overholt) → needinfo?(gkrizsanits)

Andrew Overholt [:overholt]

Comment 19

•

7 years ago

Is the spike here related to the counting of the parent process killing the content process? We used to count that differently (or not at all) from actual content process crashes but did something change when 56 went to beta?

[:philipp]

Comment 20

•

7 years ago

yes, the [@ IPCError-browser | ShutDownKill] signature is a catch-all signature for those killed content processes.
those only get submitted from beta users when we show the infobar to them in early beta builds (bug 1303067) - otherwise there's no ui to send reports for such kind of crashes. in 56.0b that mechanism was turned on for longer as it was planned due to bug 1399796 and i'd attribute the spike only to that...

Andrew Overholt [:overholt]

Comment 21

•

7 years ago

(In reply to [:philipp] from comment #20)
> yes, the [@ IPCError-browser | ShutDownKill] signature is a catch-all
> signature for those killed content processes.
> those only get submitted from beta users when we show the infobar to them in
> early beta builds (bug 1303067) - otherwise there's no ui to send reports
> for such kind of crashes. in 56.0b that mechanism was turned on for longer
> as it was planned due to bug 1399796 and i'd attribute the spike only to
> that...

Thanks. Just so I've got this clear in my head, you think the spike and subsequent volume are due to reports automatically being submitted? Do we normally submit crash reports for these shutdownkill crashes? I see the volume appears to have decreased since ~the 22nd; do you think that trend will continue?

(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #11)
> If it was,
> I wouldn't be surprised to see it cause a spike in content process shutdown
> timeouts.

Kris, was the script preloader disabled/enabled with a pref? Is that flipped to enable it now?

Flags: needinfo?(madperson)

Andrew Overholt [:overholt]

Comment 22

•

7 years ago

Does anyone know why we submit crash reports for these intentional content process kills (after a 5 s timeout, I believe)?

Andrew McCreight [:mccr8]

Comment 23

•

7 years ago

(In reply to Andrew Overholt [:overholt] from comment #22)
> Does anyone know why we submit crash reports for these intentional content
> process kills (after a 5 s timeout, I believe)?

It is bad if the content process is hanging for multiple seconds during shutdown, so we want some information about where it is happening.

[:philipp]

Comment 24

•

7 years ago

(In reply to Andrew Overholt [:overholt] from comment #21)
> Thanks. Just so I've got this clear in my head, you think the spike and
> subsequent volume are due to reports automatically being submitted?
yes the spikes of the generic [@ IPCError-browser | ShutDownKill] signature correspond to the infobar mechanism being turned on during the early beta cycle.

> Do we normally submit crash reports for these shutdownkill crashes?
not for users on release and late beta builds since they won't notice this kind of killing of the content process and there is no reporting ui for such issues (other than manually going to about:crashes and clicking on unsubmitted crash report ids). telemetry data might capture the issue for those channels more steadily, as there's no submission bias.

>I see the volume appears to have decreased since ~the 22nd; do you think that trend
> will continue?
since we're starting to roll out 57.0b to the general beta audience since yesterday the shutdownkill signature will start rising again.

however this graph will just focus on the ScriptPreloader content kills that this bug is about:
https://crash-stats.mozilla.com/signature/?product=Firefox&proto_signature=~ScriptPreloader&release_channel=beta&signature=IPCError-browser%20%7C%20ShutDownKill&date=%3E%3D2017-06-01#graphs
the big spikes around beginning of july and mid august were all from 55.0b users, so the issue has somehow declined and seems no longer such pressing as back when marco has filed the bug.

Flags: needinfo?(madperson)

Kris Maglione [:kmag]

Comment 25

•

7 years ago

(In reply to Andrew Overholt [:overholt] from comment #21)
> (In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're
> blocked) from comment #11)
> > If it was, I wouldn't be surprised to see it cause a spike in content
> > process shutdown timeouts.
> 
> Kris, was the script preloader disabled/enabled with a pref? Is that flipped
> to enable it now?

No, it's always enabled. We could add a pref to disable it for the content
process, and see if disabling that decreases the crash rate (disabling it for
the parent process would cause a 300-500ms startup time regression, which I
don't think we can afford), but I think it's pretty clear from the crash stacks
that process being killed within ScriptPreloader code is just a coincidence.

A lot of the stacks are during a mutex wait in WaitForCachedScript, and if I
thought a deadlock there was likely, we could annotate the crash report with the
current wait time. But lots of the other stacks are in other places, like
merging compartments, decoding XDR data, or loading JAR input streams.

Given that these crashes are almost all very near the start of the browser
session, the most likely explanation seems to be that we're trying to kill the
process during startup, and the script loading code just happens to be where we
spend the most time at that phase.

Flags: needinfo?(kmaglione+bmo)

Andrew Sutherland [:asuth] (he/him)

Comment 26

•

7 years ago

(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #25)
> Given that these crashes are almost all very near the start of the browser
> session, the most likely explanation seems to be that we're trying to kill
> the
> process during startup, and the script loading code just happens to be where
> we
> spend the most time at that phase.

Maybe we want to start setting the "dom.ipc.keepProcessesAlive.web" pref to something >= 1, instead of its (default in the code) current value of 0?  I presume that would reduce our desire to be so process-termination-happy.

Gabor Krizsanits (INACTIVE)

Comment 27

•

7 years ago

(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #25)
> Given that these crashes are almost all very near the start of the browser
> session, the most likely explanation seems to be that we're trying to kill
> the
> process during startup, and the script loading code just happens to be where
> we
> spend the most time at that phase.

During startup and maybe in some other situations too, we tend to spawn short lived processes. Maybe the shit down takes too long for various sync startup operations and the parent decides to shut down these processes. The ppm patch I'm landing should help here a bit as it's eager to reuse these processes instead of killing them. We'll see.

(In reply to Andrew Sutherland [:asuth] from comment #26)
> Maybe we want to start setting the "dom.ipc.keepProcessesAlive.web" pref to
> something >= 1, instead of its (default in the code) current value of 0?  I
> presume that would reduce our desire to be so process-termination-happy.

keepProcessesAlive is only for testing purposes. Keeping content processes alive indefinitely would not help memory fragmentation... I know that the current setup is not ideal, and we might keep them alive forever anyway, but keeping all the content processes alive intentionally seems like a step to the wrong direction.

(In reply to Andrew Overholt [:overholt] from comment #18)
> Sadly, I think there's a family of issues here and while people have been
> heroically fixing individual issues, it seems we still have a big problem.
> I'm not sure what to do, though. Maybe Gabor has an idea of how to proceed?

I'm not sure. Based on comment 24 from :philipp this is no longer a pressing issue. In general, content process shutdownkill's so far have not been considered as high priority issues, since the user would not notice much from it. If we wait more for content process shutdowns there will be less crashes (force shutdowns) but that might be a worse user experience. Having more content process than one will ofc increase these shutdowns. Is there a reason to put more resources on this issue in the future?

Flags: needinfo?(gkrizsanits)

Andrew Overholt [:overholt]

Comment 28

•

7 years ago

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #27)
> If we wait more for content process shutdowns there will be less
> crashes (force shutdowns) but that might be a worse user experience. Having
> more content process than one will ofc increase these shutdowns. Is there a
> reason to put more resources on this issue in the future?

Maybe longer-term but in the short term I'd prefer we not regress the user experience. I'm happy to be told that's silly logic, though :)

Priority: -- → P3

Trevor Skywalker

Updated

•

7 years ago

Blocks: IPCError_ShutDownKill

Jim Mathies [:jimm]

Updated

•

7 years ago

status-firefox56: affected → wontfix

status-firefox57: affected → wontfix

status-firefox58: affected → fix-optional

Firefox Bug Husbandry Bot

Comment 29

•

6 years ago

https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Move_fix-optionals

status-firefox58: fix-optional → wontfix

status-firefox59: --- → ?

BugBot [:suhaib / :marco/ :calixte]

Comment 30

•

4 years ago

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → WORKSFORME