Closed Bug 1255977 Opened 8 years ago Closed 8 years ago

Async shutdown crash with SessionStore: flushing all windows

Categories

(Firefox :: Session Restore, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
Firefox 52
Tracking Status
firefox48 --- affected
firefox49 --- affected
firefox-esr45 --- affected
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: mak, Assigned: wiwang)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file, 2 obsolete files)

Async shutdown is sometimes aborting and crashing on session restore shutdown:

https://crash-stats.mozilla.com/report/index/a9699a83-bae6-425b-b585-a7d152160311

{"phase":"quit-application-granted","conditions":[{"name":"SessionStore: flushing all windows","state":{"total":1,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1443,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1443","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:642","chrome://mozapps/content/extensions/extensions.js:gViewController.commands.cmd_restartApp.doCommand:844","chrome://mozapps/content/extensions/extensions.xml:restart:1522","about:addons:oncommand:1"]}]}
This now represents 30% of async shutdown crashes.

off-hand, looks like we send "SessionStore:flush" to each browser in that window, but we don't hear back from some. Maybe something already shutdown? Looks like "SessionStore:update" is what is expected to resolve those flushes, but it's complex and seems to be handling the fact the frameLoader could have been destroyed already, in such a case it won't resolve anything.
(In reply to Marco Bonardo [::mak] from comment #1)
> This now represents 30% of async shutdown crashes.
> 
> off-hand, looks like we send "SessionStore:flush" to each browser in that
> window, but we don't hear back from some. Maybe something already shutdown?
> Looks like "SessionStore:update" is what is expected to resolve those
> flushes, but it's complex and seems to be handling the fact the frameLoader
> could have been destroyed already, in such a case it won't resolve anything.

I believe we do handle the frameloader destroyed case - the SessionStore:update messages are special in that they're part of a small group of messages that we allow to be received from frameloaders that have gone away.

This is frustrating. I wish we could see what's in the Content Process Console at the time of shutdown, in case something is going wrong inside content-sessionStore.js.
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #2)
> I believe we do handle the frameloader destroyed case - the
> SessionStore:update messages are special in that they're part of a small
> group of messages that we allow to be received from frameloaders that have
> gone away.

yes but here we seem to early return if the frameloader has died
http://mxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/SessionStore.jsm#707
just before the calls to TabStateFlusher

sorry, I really don't know this code, I'm only going by visual inspection.
The other possibility is that the flushes are still proceeding, but they're just taking their sweet time. Before the async flushing stuff landed, we were doing sync flushing when nsAppStartup::Quit closed each window here:

https://hg.mozilla.org/mozilla-central/file/d1d47ba19ce9/toolkit/components/startup/nsAppStartup.cpp#l415

I'm not 100% familiar with how AsyncShutdown works, but since CloseAllWindows here is occurring _outside_ of an AsyncShutdown blocker, it's not part of the termination timeout group. So maybe this is just the first time we've clamped down on the time it's taken to close and flush all of the windows, but that it always took a while.

Yoric - if non-AsyncShutdown blocker code is taking a while... like, for example, nsAppStartup::Quit, does AsyncShutdown detect that and kill if we take too long? Or does it only do the killing for AsyncShutdown tasks?
Flags: needinfo?(dteller)
(In reply to Marco Bonardo [::mak] from comment #3)
> (In reply to Mike Conley (:mconley) - Needinfo me! from comment #2)
> > I believe we do handle the frameloader destroyed case - the
> > SessionStore:update messages are special in that they're part of a small
> > group of messages that we allow to be received from frameloaders that have
> > gone away.
> 
> yes but here we seem to early return if the frameloader has died
> http://mxr.mozilla.org/mozilla-central/source/browser/components/
> sessionstore/SessionStore.jsm#707
> just before the calls to TabStateFlusher
> 

Only if the frameLoader doesn't for some reason match the one that we have on file for that particular browser. That can occur if we switch from remoteness state A to remoteness state B for a browser - we want to make sure we ignore update messages from state A.
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #4)
> The other possibility is that the flushes are still proceeding, but they're
> just taking their sweet time. Before the async flushing stuff landed, we
> were doing sync flushing when nsAppStartup::Quit closed each window here:

We have crashes (probably the most of them) with just 1 window.
And the timeout is 1 minute. I see that it may take a while, but that sounds too much to be plausible.

(In reply to Mike Conley (:mconley) - Needinfo me! from comment #5)
> Only if the frameLoader doesn't for some reason match the one that we have
> on file for that particular browser. That can occur if we switch from
> remoteness state A to remoteness state B for a browser - we want to make
> sure we ignore update messages from state A.

So is it not possible that one of aMessage.targetFrameLoader or frameLoader is null while the other one is still valid?
(In reply to Marco Bonardo [::mak] from comment #6)
> (In reply to Mike Conley (:mconley) - Needinfo me! from comment #4)
> > The other possibility is that the flushes are still proceeding, but they're
> > just taking their sweet time. Before the async flushing stuff landed, we
> > were doing sync flushing when nsAppStartup::Quit closed each window here:
> 
> We have crashes (probably the most of them) with just 1 window.
> And the timeout is 1 minute. I see that it may take a while, but that sounds
> too much to be plausible.
> 
> (In reply to Mike Conley (:mconley) - Needinfo me! from comment #5)
> > Only if the frameLoader doesn't for some reason match the one that we have
> > on file for that particular browser. That can occur if we switch from
> > remoteness state A to remoteness state B for a browser - we want to make
> > sure we ignore update messages from state A.
> 
> So is it not possible that one of aMessage.targetFrameLoader or frameLoader
> is null while the other one is still valid?

I don't think it's possible for frameLoader in that context to be null. We stash the frameloader for each browser.permanentKey in _lastKnownFrameLoader, which is a WeakMap keyed on the permanentKey. It's never deleted manually, so the only way it would have gone away is if it'd been overwritten with another browser.permanentKey (which is the scenario laid out in comment 5), or if the browser.permanentKey has been GC'd, which can't occur while we're in this SessionStore:update handler, since we have a reference to the browser.

Which leaves aMessage.targetFrameLoader. Poking around in the IPC gunk that defines it, it doesn't look like anything fatal will occur if targetFrameLoader is not set, but I can't think of a case where the frameloader would be null for a TabParent (which is what the targetFrameLoader gets set to when the message is received).

It might also be worth checking to see what proportion of this 30% of shutdown hangs have e10s enabled. Another possible explanation is that the content process is hung.
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ]
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #4)
> Yoric - if non-AsyncShutdown blocker code is taking a while... like, for
> example, nsAppStartup::Quit, does AsyncShutdown detect that and kill if we
> take too long? Or does it only do the killing for AsyncShutdown tasks?

It only kills for AsyncShutdown tasks.

Note that the Terminator will still kill the process, if AsyncShutdown doesn't.
Flags: needinfo?(dteller)
> It might also be worth checking to see what proportion of this 30% of shutdown hangs have e10s enabled. Another possible explanation is that the content process is hung.

I have seen very little e10s code that handles timeouts (including hung content processes) properly. So that might be related.
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ mozalloc_abort | xul.dll@0x77f718 ]
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ mozalloc_abort | xul.dll@0x77f718 ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ mozalloc_abort | xul.dll@0x77f718 ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::A…
Keywords: crash
¡Hola Marco!

Just crashed like https://crash-stats.mozilla.com/report/index/bp-09b3a864-7d15-43e7-8ec7-7c3992160824 a couple minutes ago on Mozilla/5.0 (X11; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0 ID:20160823072522 CSet: 052656fc513c05da969590ac5934abd67271a897

Commenting as I happen to run Nightly from a terminal and it contained the lines below at the time of the crash. hope they're useful.

Am I seeing this bug or a different one?

¡Gracias!
Alex

(file-roller:13159): Gtk-WARNING **: Allocating size to GtkScrolledWindow 0x1ba62a0 without calling gtk_widget_get_preferred_width/height(). How does the code know the size to allocate?
1472056119523	addons.xpi	WARN	Add-on {c37bac34-849a-4d28-be41-549b2c76c64e} is missing bootstrap method shutdown
WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"SessionStore: flushing all windows","state":{"total":2,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1465,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1465","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:648","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1"]}] Barrier: quit-application-granted
FATAL ERROR: AsyncShutdown timeout in quit-application-granted Conditions: [{"name":"SessionStore: flushing all windows","state":{"total":2,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1465,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1465","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:648","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1"]}] At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources.
[Parent 4430] ###!!! ABORT: file resource:///modules/sessionstore/SessionStore.jsm, line 1465
[Parent 4430] ###!!! ABORT: file resource:///modules/sessionstore/SessionStore.jsm, line 1465
ExceptionHandler::GenerateDump cloned child 13632
ExceptionHandler::SendContinueSignalToChild sent continue signal to child
ExceptionHandler::WaitForContinueSignal waiting for continue signal...
Flags: needinfo?(mak77)
yes it's the same shutdown crash.
From the available data you had 2 windows open, the first one blocked while trying to flush(). The stack doesn't say more.
To me looks like there may be a race condition in the shutdown flush code... but I can't tell where off-hand. For sure doesn't seem useful to crash the user, maybe there could be a timeout for flushing, but then we could never notice a regression in it...
Mike is the current owner of Session Restore, and maybe he has some ideas about this, or could just add it to the current backlog.
Flags: needinfo?(mak77) → needinfo?(mdeboer)
In the metadata, I see:

{"phase":"quit-application-granted","conditions":[{"name":"SessionStore: flushing all windows","state":{"total":2,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1465,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1465","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:648","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1"]}]}

Which says that alex was clicking on the "Restart to update" button in the About Firefox dialog. He had 2 windows opened, and neither of them finished "flushing".

Was the second window the "About Firefox" dialog? Is it possible that for Linux, we don't handle flushing that type of window properly? Just spitballing.
(In reply to Mike Conley (:mconley) - (Needinfo me!) from comment #13)
> Which says that alex was clicking on the "Restart to update" button in the
> About Firefox dialog. He had 2 windows opened, and neither of them finished
> "flushing".

The flushes happen in a strict loop:
http://searchfox.org/mozilla-central/rev/44f6964ba95b8ddd8ebf70c55b34cd2323afeef4/browser/components/sessionstore/SessionStore.jsm#1517
Looks like it's enough that one window doesn't resolve to lock all the next ones, in this case it's the first window, we don't know anything about the second one.
It may be nicer to start all the flushes at the same time, and wait later, provided there aren't technical reasons to start them one by one. (Basically, attach a then to each promise, start the flush in the loop but don't yield, later use Promise.all or such).
Increasing a "bogus" (since it only counts up to the first hang) counter for async shutdown progress log doesn't sound useful at all, so I'm not sure why this is a strict loop.

It would also be nice to annotate in the object progress something that allows us to identify the window, it may be nice to know the root (xul) document, now we could tell "Sure it's the About window!". But this may cross the privacy boundary so it needs some attention.
I have a simple way to reproduce a similar crash. The crasher.html attachment in bug 1288997 causes the content process to crash on Windows, but hang on Linux (example report: https://crash-stats.mozilla.com/report/index/a37a145e-6cce-40f8-9b5c-583852160830).
This is looking awfully similar to bug 1273770...
(In reply to Mike de Boer [:mikedeboer] from comment #16)
> This is looking awfully similar to bug 1273770...

I think it's the same and should probably be duped here. it depends on the AsyncShutdownTimeout metadata value, but looking on crash-stats I mostly see reports with "SessionStore: flushing all windows".
What I'd like to repeat here is what Marco said in https://bugzilla.mozilla.org/show_bug.cgi?id=1273770#c23:

I can probably try to answer, while I didn't write async shutdown code, I am using it in Places and looking into similar crashes.

when you register a promise with addBlocker, you basically sign a contract that such promise on shutdown won't take more than 1 minute to resolve. If you fail resolving the promise for whatever reason, like a bug causing an exception before you invoke resolve, Async Shutdown time outs and forces an application crash, whose stack is the one you see, that is sort of useless cause it's the same for any async shutdown consumer.
Though, attached to the crash report there is metadata, one of those is AsyncShutdownTimeout that is a "progress" object you pass along with your shutdown promise and you are expected to fill it with data useful to figure out the timeout reasons.
(In reply to Marco Castelluccio [:marco] from comment #15)
> I have a simple way to reproduce a similar crash. The crasher.html
> attachment in bug 1288997 causes the content process to crash on Windows,
> but hang on Linux (example report:
> https://crash-stats.mozilla.com/report/index/a37a145e-6cce-40f8-9b5c-
> 583852160830).

Brr, this produces a horrible crash on my MBP running OSX 10.11.4. There are a couple of interesting tidbits that I found might interest some ppl following along here:

First, my STR for OSX:
1. Open Nightly DEBUG with the '-jsconsole' option, mainly for seeing some fun red hues on your screen.
2. Open the following tabs in the first window:
   1st: 'about:home'
   2nd: 'https://html.spec.whatwg.org/'
   3rd: 'https://duckduckgo.com/?q=lala&t=ffab'
   4th: 'https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.messages.doc/com.ibm.ws.ssl.channel.resources.sslchannelmessages.html'
3. Open a second window with the following tabs:
   1st: 'about:home'
   2nd: 'https://dxr.mozilla.org/mozilla-central/source/toolkit/components/places/PlacesUtils.jsm'
   3rd: 'https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Error'
   4th: 'https://www.yahoo.com/'
   5th: 'https://en.wikipedia.org/wiki/Firefox'
4. Open a blank tab and drag 'crasher.html' into it and hit CMD+Q right after that.
5. Wait for it... BOOM!

JavaScript error: resource:///modules/ContentCrashHandlers.jsm, line 75: TypeError: browser.permanentKey is not a non-null object

^-- That one is for mconley. permanentKey may be `null`! So we might be missing out on useful info in our crash reports!

The main cause of the crash is of course bug 1288997, BUT that's just one way to invoke it. The real meat is in the effects it triggers:
1. It nukes the messageManager out of orbit in this case, but there are more ways to think of a content process crash, OOM or a script stalling the process, causing a hang.
2. The SessionStore TabStateFlusher is relying on the content process (frame)script to do the heavy lifting and the promises are only resolved when the ACK comes back to the main process via an IPC message. Only then will the AsyncShutdown blocker be released. But these messages never arrive. Ever.

AsyncShutdown is doing the right thing here, because hanging forever is not acceptable UX.

What SessionStore _could_ do is keep track of its own timer to notice flush timeouts and move along when nothing's responding properly. That way it may be able to restore more than it does now, which is zip.
Additionally, I'm wondering if `messageManager.sendAsyncMessage()` is infallible? If so, why? It clearly isn't possible to send messages anymore, but there's no way to currently detect that. Or is there?
Flags: needinfo?(mdeboer) → needinfo?(mconley)
See Also: → 1288997
What SessionStore/ TabStateFlusher could also do is observe 'ipc:content-shutdown' and 'oop-frameloader-crashed', like ContentCrashHandlers.jsm, which will trigger to resolve all promises.
(In reply to Mike de Boer [:mikedeboer] from comment #20)
> What SessionStore _could_ do is keep track of its own timer to notice flush
> timeouts and move along when nothing's responding properly. That way it may
> be able to restore more than it does now, which is zip.

The only downside, is that if you do your own promise timeout, you give up the async shutdown advantage of knowing that something broke during shutdown. If we'd have done that from the beginning, we'd not have this bug and thus we'd not have any analysis of issues that this bug found.
I had similar ideas, but then figured I needed an alternative way to know if something went wrong in the shutdown path (maybe telemetry?)
Generally, I believe that message managers should offer a way to post a message with a timeout. But yes, we also need some Telemetry.
Crash Signature: nsDebugImpl::Abort | NS_InvokeByIndex ] → nsDebugImpl::Abort | NS_InvokeByIndex ] [@ AsyncShutdownTimeout | quit-application-granted | SessionStore: flushing all windows ]
¡Hola!

FWIW just got this crash on Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 ID:20160906030431 CSet: 8c9c4e816e86f903c1d820f3f29715dc070a5a4a too:

Report ID 	Date Submitted
bp-7d249b70-ed71-421b-9c13-0811c2160906
	9/6/2016	09:29

1293 crashes in the past week at https://crash-stats.mozilla.com/signature/?product=Firefox&signature=AsyncShutdownTimeout%20%7C%20quit-application-granted%20%7C%20SessionStore%3A%20flushing%20all%20windows , setting flags accordingly.

¡Gracias!
Alex

Crashing Thread (0)
Frame	Module	Signature	Source
0	mozglue.dll	mozalloc_abort(char const* const)	memory/mozalloc/mozalloc_abort.cpp:33
1	xul.dll	NS_DebugBreak	xpcom/base/nsDebugImpl.cpp:436
2	xul.dll	nsDebugImpl::Abort(char const*, int)	xpcom/base/nsDebugImpl.cpp:146
3	xul.dll	NS_InvokeByIndex	xpcom/reflect/xptcall/md/win32/xptcinvoke_asm_x86_msvc.asm:54
4	xul.dll	XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*)	js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1143
5	xul.dll	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)	js/src/vm/Interpreter.cpp:454
6	xul.dll	InternalCall	js/src/vm/Interpreter.cpp:499
7	xul.dll	Interpret	js/src/vm/Interpreter.cpp:2916
8	xul.dll	js::RunScript(JSContext*, js::RunState&)	js/src/vm/Interpreter.cpp:400
9	xul.dll	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)	js/src/vm/Interpreter.cpp:472
10	xul.dll	InternalCall	js/src/vm/Interpreter.cpp:499
11	xul.dll	js::fun_call(JSContext*, unsigned int, JS::Value*)	js/src/jsfun.cpp:1249
12	xul.dll	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)	js/src/vm/Interpreter.cpp:454
13	xul.dll	InternalCall	js/src/vm/Interpreter.cpp:499
14	xul.dll	js::CrossCompartmentWrapper::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&)	js/src/proxy/CrossCompartmentWrapper.cpp:333
15	xul.dll	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)	js/src/vm/Interpreter.cpp:442
16	xul.dll	InternalCall	js/src/vm/Interpreter.cpp:499
17	xul.dll	Interpret	js/src/vm/Interpreter.cpp:2916
18	xul.dll	js::RunScript(JSContext*, js::RunState&)	js/src/vm/Interpreter.cpp:400
19	xul.dll	js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct)	js/src/vm/Interpreter.cpp:472
20	xul.dll	InternalCall	js/src/vm/Interpreter.cpp:499
21	xul.dll	js::jit::DoCallFallback	js/src/jit/BaselineIC.cpp:5998
22	xul.dll	js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class const*, js::TaggedProto, JSObject*)	js/src/vm/ObjectGroup.cpp:531
(In reply to Mike de Boer [:mikedeboer] from comment #20)
> JavaScript error: resource:///modules/ContentCrashHandlers.jsm, line 75:
> TypeError: browser.permanentKey is not a non-null object
> 
> ^-- That one is for mconley. permanentKey may be `null`! So we might be
> missing out on useful info in our crash reports!
> 

If permanentKey is null, that probably means that the <xul:browser> has been removed from the DOM, since the XBL binding has been removed (the XBL binding is what supplies the permanentKey).

The observer in ContentCrashHandlers is hoping for the permanentKey so that it can map attempts to submit crash reports from about:tabcrashed to reports that have been collected. I suppose we have a case where the browser may have been removed from the DOM before that observer fires.

The notification that shows up to submit unsubmitted crash reports should still be catching these unsubmitted crash reports though, so we have another opportunity to get them.

I don't think there's a whole lot we can do if the <xul:browser> has been removed from the DOM before the oop-frameloader-crashed notification fires. The best course of action in that case might be to detect if the permanent key does not exist, and just bail out properly instead of throwing an exception.
Flags: needinfo?(mconley)
(In reply to Mike de Boer [:mikedeboer] from comment #20)
> 
> What SessionStore _could_ do is keep track of its own timer to notice flush
> timeouts and move along when nothing's responding properly. That way it may
> be able to restore more than it does now, which is zip.

This sounds like a reasonable course of action, yes.

> Additionally, I'm wondering if `messageManager.sendAsyncMessage()` is
> infallible? If so, why? It clearly isn't possible to send messages anymore,
> but there's no way to currently detect that. Or is there?

It's not infallible - here's where it can fail: http://searchfox.org/mozilla-central/rev/95b0ecf4b59a01e0524ca02f6c96ecaabe38f4d5/dom/base/nsFrameMessageManager.cpp#829

If mCallback is nullptr, it throws NS_ERROR_NOT_INITIALIZED. It looks like mCallback can be set to nullptr here: http://searchfox.org/mozilla-central/rev/95b0ecf4b59a01e0524ca02f6c96ecaabe38f4d5/dom/base/nsFrameMessageManager.cpp#1401, and here: http://searchfox.org/mozilla-central/rev/95b0ecf4b59a01e0524ca02f6c96ecaabe38f4d5/dom/base/nsFrameMessageManager.cpp#1416
Hi Andrew, this is now the topcrash on Aurora50. It jumped up 83 spots! Can you please help find an owner? Thanks!
Flags: needinfo?(overholt)
Kan-Ru, I hear you're interested in session restore. Can you take a look?
Flags: needinfo?(overholt) → needinfo?(kchen)
(In reply to Andrew Overholt [:overholt] from comment #28)
> Kan-Ru, I hear you're interested in session restore. Can you take a look?

You're right. I'm mentoring a few people to work on session restore. I'm also trying to reduce shutdown crashes. So, looks like a perfect match! Will Wang will look at this together with me.
Assignee: nobody → wiwang
Flags: needinfo?(kchen)
A few interesting observations, overall is Firefox 48:
(73.32% in signature vs 04.53% overall) reason = EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
(73.41% in signature vs 07.08% overall) platform = Mac OS X
(89.40% in signature vs 25.44% overall) dom_ipc_enabled = 1
(73.58% in signature vs 12.77% overall) address = 0x0
(47.98% in signature vs 91.63% overall) build_id = 20160823121617
(49.42% in signature vs 05.89% overall) build_id = 20160817112116
(44.83% in signature vs 02.21% overall) platform_pretty_version = OS X 10.11
(23.27% in signature vs 61.63% overall) Addon "Mozilla Firefox hotfix" = true
(23.27% in signature vs 01.15% overall) platform_version = 10.11.6 15G1004
(16.08% in signature vs 00.98% overall) platform_pretty_version = OS X 10.10
See Also: → 1258350
Hi Kan-Ru, Will, this remains a top crasher on Aurora (at 5th place now). Have you had any luck investigating the problem so far? I hope we can get this one fixed soon as 50 hits Beta channel this week.
Flags: needinfo?(wiwang)
Flags: needinfo?(kchen)
Hi!

Based on the rich information above, allow me to propose following actions: (please correct me if any :) )
1. Add our(session restore) own promise timeout as first step to solve this top crasher. This timeout will be needed anyway since there are many ways which can cause a hang, as comment 20 indicated.
2. In another bug, add telemetry and/or sort of tracking mechanism to ensure we can still be aware of something wrong, as comment 22 and 23 suggested.


As for adding our own promise timeout, I'd like to have the promise resolved/rejected under 2 conditions:
1. before hitting the AsyncShutdown's 1 minute limit.
2. once 'ipc:content-shutdown' and 'oop-frameloader-crashed' are observed, especially the latter.

Any suggestion would be greatly appreciated!

====

Hi Ritu,

I'd like to confirm with you that if I try to make patch landed before 10/24 (two weeks before firefox 50 release), does this meet your expectation?
Thanks! :)
Flags: needinfo?(wiwang)
Flags: needinfo?(rkothari)
Flags: needinfo?(kchen)
(In reply to Will Wang [:WillWang] from comment #32)
> 
> Hi Ritu,
> 
> I'd like to confirm with you that if I try to make patch landed before 10/24
> (two weeks before firefox 50 release), does this meet your expectation?
> Thanks! :)

Hi Will, it depends on the risk associated with the patch. If we have good verification data from pre-beta channels that the crash goes away with the fix and if it's deemed low risk fix, I would be happy to uplift to Beta50. Usually post-Beta6, we only take fixes for severe crashes, hangs, sec-crits/highs and recent regressions. Hope that helps! The sooner we get a fix in the better.
Flags: needinfo?(rkothari)
Attached patch 1255977.patch (obsolete) — Splinter Review
Hi Yoric,

Would you mind giving your thought for this patch? Thanks!

As comment 32, this patch:
1. Add our(session restore) own promise timeout, which is set to 50s for now to avoid hitting 60s limit of async shutdown.
2. Add two observers to resolve promise for oop-frameloader-crashed and ipc:content-shutdown.
3. Add a looseTimer() which is derived from the one(Thanks for your works! :p ) in AsyncShutdown.jsm and refactor as DOM promise version.

If patch is basically ok, I would like to add comments and some details then send it to review.

Thank you!
Attachment #8802068 - Flags: feedback?(dteller)
Comment on attachment 8802068 [details] [diff] [review]
1255977.patch

Review of attachment 8802068 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, that looks like a valid strategy.

You'll need to document against what you're defending and what's the worst that can happen in case of timeout.
Attachment #8802068 - Flags: feedback?(dteller) → feedback+
¡Hola!

FWIW, this is also a thing on Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 ID:20161017030209 CSet: 94b0fddf96b43942bdd851a3275042909ea37e09

Report ID 	Date Submitted
bp-bd77ff9d-73d7-478b-8be7-7d87e2161019
	19/10/2016	11:54 a.m.

¡Gracias!
Alex

STR:
- Come back to a locked Dell Latitude E6420 running Windows 7 Enterprise.
- Notice there's a green badge with a white up arrow requesting an update to Nightly.
- Click the "Restart Nightly to apply updates" option in the "Hamburger menu"
- Notice there's another window on the back with "Restart later" and "Restart now" buttons.
- Click "Restart now"

Result:
Nightly crashes

Expected result:
Nightly updates and starts with no crashes.
Attached patch bug-1255977.patch (obsolete) — Splinter Review
Hi Yoric,

Could you help to review this patch?

This patch is based on the one in comment 34 with several improvements:
1. add more comments
2. sort all Cu.import()
3. add timer.cancel() to timer function and move the timer function into SessionStoreInternal object.

Thank you for your help :)
Attachment #8802068 - Attachment is obsolete: true
Attachment #8802861 - Flags: review?(dteller)
Comment on attachment 8802861 [details] [diff] [review]
bug-1255977.patch

Review of attachment 8802861 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with a little more doc.

::: browser/components/sessionstore/SessionStore.jsm
@@ +1464,5 @@
> +      // To prevent blocker from breaking the 60 sec limit(which will cause a
> +      // crash) of async shutdown during flushing all windows, we resolve the
> +      // promise passed to blocker once:
> +      // 1. the flushing exceed 50 sec, or
> +      // 2. 'oop-frameloader-crashed' or 'ipc:content-shutdown' is observed.

Could you add a little something explaining what the resulting session will be on next startup if we have had a oop-frameloader-crahsed or a ipc:content-shutdown?

I *think* that we're going to reuse the state we saved last, which sounds ok to me, but this needs documenting.
Attachment #8802861 - Flags: review?(dteller) → review+
Yoric, thanks for your review!
I add a line of comment and carry r+ from you.

(Please correct me if any, thanks!)
Attachment #8802861 - Attachment is obsolete: true
Attachment #8803277 - Flags: review+
Keywords: checkin-needed
Pushed by kchen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fbd5239a7ed6
Resolve promise once flushing timeout or crash observed. r=yoric
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/fbd5239a7ed6
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 52
Comment on attachment 8803277 [details] [diff] [review]
Patch(v2): Resolve promise once flushing timeout or crash observed.

Approval Request Comment


[Feature/regressing bug #]:
    None. (Session Restore is not a new feature, and there is no obvious regressing bug)

[User impact if declined]:
    If declined, Firefox will crash during the shutdown if content process is unable to response(e.g., hanged).

[Describe test coverage new/current, TreeHerder]:
    Current test passed at both local and treeherder[1].

[Risks and why]:
    Low risk since the patched file don't affect other critical components.

[String/UUID change made/needed]:
    None.


====

[1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=fbd5239a7ed6a37beec87b0ea1a5aea1ac8723d8
Attachment #8803277 - Flags: approval-mozilla-beta?
Attachment #8803277 - Flags: approval-mozilla-aurora?
Comment on attachment 8803277 [details] [diff] [review]
Patch(v2): Resolve promise once flushing timeout or crash observed.

I verified that the last signature which has the highest volume seems gone from 52 after the fix landed, hoping this helps on 51 and 50 as well. Aurora51+, Beta50+
Attachment #8803277 - Flags: approval-mozilla-beta?
Attachment #8803277 - Flags: approval-mozilla-beta+
Attachment #8803277 - Flags: approval-mozilla-aurora?
Attachment #8803277 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: