1255977 - Async shutdown crash with SessionStore: flushing all windows

Reporter

Description

•

9 years ago

Async shutdown is sometimes aborting and crashing on session restore shutdown: https://crash-stats.mozilla.com/report/index/a9699a83-bae6-425b-b585-a7d152160311 {"phase":"quit-application-granted","conditions":[{"name":"SessionStore: flushing all windows","state":{"total":1,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1443,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1443","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:642","chrome://mozapps/content/extensions/extensions.js:gViewController.commands.cmd_restartApp.doCommand:844","chrome://mozapps/content/extensions/extensions.xml:restart:1522","about:addons:oncommand:1"]}]}

Marco Bonardo [:mak]

Reporter

Comment 1

•

9 years ago

This now represents 30% of async shutdown crashes. off-hand, looks like we send "SessionStore:flush" to each browser in that window, but we don't hear back from some. Maybe something already shutdown? Looks like "SessionStore:update" is what is expected to resolve those flushes, but it's complex and seems to be handling the fact the frameLoader could have been destroyed already, in such a case it won't resolve anything.

Mike Conley (:mconley) (:⚙️)

Comment 2

•

9 years ago

(In reply to Marco Bonardo [::mak] from comment #1) > This now represents 30% of async shutdown crashes. > > off-hand, looks like we send "SessionStore:flush" to each browser in that > window, but we don't hear back from some. Maybe something already shutdown? > Looks like "SessionStore:update" is what is expected to resolve those > flushes, but it's complex and seems to be handling the fact the frameLoader > could have been destroyed already, in such a case it won't resolve anything. I believe we do handle the frameloader destroyed case - the SessionStore:update messages are special in that they're part of a small group of messages that we allow to be received from frameloaders that have gone away. This is frustrating. I wish we could see what's in the Content Process Console at the time of shutdown, in case something is going wrong inside content-sessionStore.js.

Marco Bonardo [:mak]

Reporter

Comment 3

•

9 years ago

(In reply to Mike Conley (:mconley) - Needinfo me! from comment #2) > I believe we do handle the frameloader destroyed case - the > SessionStore:update messages are special in that they're part of a small > group of messages that we allow to be received from frameloaders that have > gone away. yes but here we seem to early return if the frameloader has died http://mxr.mozilla.org/mozilla-central/source/browser/components/sessionstore/SessionStore.jsm#707 just before the calls to TabStateFlusher sorry, I really don't know this code, I'm only going by visual inspection.

Mike Conley (:mconley) (:⚙️)

Comment 4

•

9 years ago

The other possibility is that the flushes are still proceeding, but they're just taking their sweet time. Before the async flushing stuff landed, we were doing sync flushing when nsAppStartup::Quit closed each window here: https://hg.mozilla.org/mozilla-central/file/d1d47ba19ce9/toolkit/components/startup/nsAppStartup.cpp#l415 I'm not 100% familiar with how AsyncShutdown works, but since CloseAllWindows here is occurring _outside_ of an AsyncShutdown blocker, it's not part of the termination timeout group. So maybe this is just the first time we've clamped down on the time it's taken to close and flush all of the windows, but that it always took a while. Yoric - if non-AsyncShutdown blocker code is taking a while... like, for example, nsAppStartup::Quit, does AsyncShutdown detect that and kill if we take too long? Or does it only do the killing for AsyncShutdown tasks?

Flags: needinfo?(dteller)

Mike Conley (:mconley) (:⚙️)

Comment 5

•

9 years ago

(In reply to Marco Bonardo [::mak] from comment #3) > (In reply to Mike Conley (:mconley) - Needinfo me! from comment #2) > > I believe we do handle the frameloader destroyed case - the > > SessionStore:update messages are special in that they're part of a small > > group of messages that we allow to be received from frameloaders that have > > gone away. > > yes but here we seem to early return if the frameloader has died > http://mxr.mozilla.org/mozilla-central/source/browser/components/ > sessionstore/SessionStore.jsm#707 > just before the calls to TabStateFlusher > Only if the frameLoader doesn't for some reason match the one that we have on file for that particular browser. That can occur if we switch from remoteness state A to remoteness state B for a browser - we want to make sure we ignore update messages from state A.

Marco Bonardo [:mak]

Reporter

Comment 6

•

9 years ago

(In reply to Mike Conley (:mconley) - Needinfo me! from comment #4) > The other possibility is that the flushes are still proceeding, but they're > just taking their sweet time. Before the async flushing stuff landed, we > were doing sync flushing when nsAppStartup::Quit closed each window here: We have crashes (probably the most of them) with just 1 window. And the timeout is 1 minute. I see that it may take a while, but that sounds too much to be plausible. (In reply to Mike Conley (:mconley) - Needinfo me! from comment #5) > Only if the frameLoader doesn't for some reason match the one that we have > on file for that particular browser. That can occur if we switch from > remoteness state A to remoteness state B for a browser - we want to make > sure we ignore update messages from state A. So is it not possible that one of aMessage.targetFrameLoader or frameLoader is null while the other one is still valid?

Mike Conley (:mconley) (:⚙️)

Comment 7

•

9 years ago

(In reply to Marco Bonardo [::mak] from comment #6) > (In reply to Mike Conley (:mconley) - Needinfo me! from comment #4) > > The other possibility is that the flushes are still proceeding, but they're > > just taking their sweet time. Before the async flushing stuff landed, we > > were doing sync flushing when nsAppStartup::Quit closed each window here: > > We have crashes (probably the most of them) with just 1 window. > And the timeout is 1 minute. I see that it may take a while, but that sounds > too much to be plausible. > > (In reply to Mike Conley (:mconley) - Needinfo me! from comment #5) > > Only if the frameLoader doesn't for some reason match the one that we have > > on file for that particular browser. That can occur if we switch from > > remoteness state A to remoteness state B for a browser - we want to make > > sure we ignore update messages from state A. > > So is it not possible that one of aMessage.targetFrameLoader or frameLoader > is null while the other one is still valid? I don't think it's possible for frameLoader in that context to be null. We stash the frameloader for each browser.permanentKey in _lastKnownFrameLoader, which is a WeakMap keyed on the permanentKey. It's never deleted manually, so the only way it would have gone away is if it'd been overwritten with another browser.permanentKey (which is the scenario laid out in comment 5), or if the browser.permanentKey has been GC'd, which can't occur while we're in this SessionStore:update handler, since we have a reference to the browser. Which leaves aMessage.targetFrameLoader. Poking around in the IPC gunk that defines it, it doesn't look like anything fatal will occur if targetFrameLoader is not set, but I can't think of a case where the frameloader would be null for a TabParent (which is what the targetFrameLoader gets set to when the message is received). It might also be worth checking to see what proportion of this 30% of shutdown hangs have e10s enabled. Another possible explanation is that the content process is hung.

[:philipp]

Updated

•

9 years ago

David Teller [:Yoric] - still alive but not very active

Comment 8

•

9 years ago

(In reply to Mike Conley (:mconley) - Needinfo me! from comment #4) > Yoric - if non-AsyncShutdown blocker code is taking a while... like, for > example, nsAppStartup::Quit, does AsyncShutdown detect that and kill if we > take too long? Or does it only do the killing for AsyncShutdown tasks? It only kills for AsyncShutdown tasks. Note that the Terminator will still kill the process, if AsyncShutdown doesn't.

Flags: needinfo?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 9

•

9 years ago

> It might also be worth checking to see what proportion of this 30% of shutdown hangs have e10s enabled. Another possible explanation is that the content process is hung. I have seen very little e10s code that handles timeouts (including hung content processes) properly. So that might be related.

Kan-Ru Chen [:kanru] (UTC+9)

Updated

•

9 years ago

[:philipp]

Updated

•

9 years ago

Wayne Mery (:wsmwk)

Updated

•

9 years ago

Keywords: crash

alex_mayorga

Comment 11

•

8 years ago

¡Hola Marco! Just crashed like https://crash-stats.mozilla.com/report/index/bp-09b3a864-7d15-43e7-8ec7-7c3992160824 a couple minutes ago on Mozilla/5.0 (X11; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0 ID:20160823072522 CSet: 052656fc513c05da969590ac5934abd67271a897 Commenting as I happen to run Nightly from a terminal and it contained the lines below at the time of the crash. hope they're useful. Am I seeing this bug or a different one? ¡Gracias! Alex (file-roller:13159): Gtk-WARNING **: Allocating size to GtkScrolledWindow 0x1ba62a0 without calling gtk_widget_get_preferred_width/height(). How does the code know the size to allocate? 1472056119523 addons.xpi WARN Add-on {c37bac34-849a-4d28-be41-549b2c76c64e} is missing bootstrap method shutdown WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"SessionStore: flushing all windows","state":{"total":2,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1465,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1465","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:648","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1"]}] Barrier: quit-application-granted FATAL ERROR: AsyncShutdown timeout in quit-application-granted Conditions: [{"name":"SessionStore: flushing all windows","state":{"total":2,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1465,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1465","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:648","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1"]}] At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources. [Parent 4430] ###!!! ABORT: file resource:///modules/sessionstore/SessionStore.jsm, line 1465 [Parent 4430] ###!!! ABORT: file resource:///modules/sessionstore/SessionStore.jsm, line 1465 ExceptionHandler::GenerateDump cloned child 13632 ExceptionHandler::SendContinueSignalToChild sent continue signal to child ExceptionHandler::WaitForContinueSignal waiting for continue signal...

Flags: needinfo?(mak77)

Marco Bonardo [:mak]

Reporter

Comment 12

•

8 years ago

yes it's the same shutdown crash. From the available data you had 2 windows open, the first one blocked while trying to flush(). The stack doesn't say more. To me looks like there may be a race condition in the shutdown flush code... but I can't tell where off-hand. For sure doesn't seem useful to crash the user, maybe there could be a timeout for flushing, but then we could never notice a regression in it... Mike is the current owner of Session Restore, and maybe he has some ideas about this, or could just add it to the current backlog.

Flags: needinfo?(mak77) → needinfo?(mdeboer)

Mike Conley (:mconley) (:⚙️)

Comment 13

•

8 years ago

In the metadata, I see: {"phase":"quit-application-granted","conditions":[{"name":"SessionStore: flushing all windows","state":{"total":2,"current":0},"filename":"resource:///modules/sessionstore/SessionStore.jsm","lineNumber":1465,"stack":["resource:///modules/sessionstore/SessionStore.jsm:ssi_onQuitApplicationGranted:1465","resource:///modules/sessionstore/SessionStore.jsm:ssi_observe:648","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1","chrome://global/content/bindings/remote-browser.xml:permitUnload:352","chrome://browser/content/browser.js:CanCloseWindow:6170","chrome://browser/content/browser.js:canClose:5020","chrome://browser/content/browser.js:gMenuButtonUpdateBadge.onMenuPanelCommand:2654","chrome://browser/content/browser.xul:oncommand:1"]}]} Which says that alex was clicking on the "Restart to update" button in the About Firefox dialog. He had 2 windows opened, and neither of them finished "flushing". Was the second window the "About Firefox" dialog? Is it possible that for Linux, we don't handle flushing that type of window properly? Just spitballing.

Marco Bonardo [:mak]

Reporter

Comment 14

•

8 years ago

(In reply to Mike Conley (:mconley) - (Needinfo me!) from comment #13) > Which says that alex was clicking on the "Restart to update" button in the > About Firefox dialog. He had 2 windows opened, and neither of them finished > "flushing". The flushes happen in a strict loop: http://searchfox.org/mozilla-central/rev/44f6964ba95b8ddd8ebf70c55b34cd2323afeef4/browser/components/sessionstore/SessionStore.jsm#1517 Looks like it's enough that one window doesn't resolve to lock all the next ones, in this case it's the first window, we don't know anything about the second one. It may be nicer to start all the flushes at the same time, and wait later, provided there aren't technical reasons to start them one by one. (Basically, attach a then to each promise, start the flush in the loop but don't yield, later use Promise.all or such). Increasing a "bogus" (since it only counts up to the first hang) counter for async shutdown progress log doesn't sound useful at all, so I'm not sure why this is a strict loop. It would also be nice to annotate in the object progress something that allows us to identify the window, it may be nice to know the root (xul) document, now we could tell "Sure it's the About window!". But this may cross the privacy boundary so it needs some attention.

Marco Castelluccio [:marco]

Comment 15

•

8 years ago

I have a simple way to reproduce a similar crash. The crasher.html attachment in bug 1288997 causes the content process to crash on Windows, but hang on Linux (example report: https://crash-stats.mozilla.com/report/index/a37a145e-6cce-40f8-9b5c-583852160830).

Mike de Boer [:mikedeboer]

Comment 16

•

8 years ago

This is looking awfully similar to bug 1273770...

Marco Bonardo [:mak]

Reporter

Comment 17

•

8 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #16) > This is looking awfully similar to bug 1273770... I think it's the same and should probably be duped here. it depends on the AsyncShutdownTimeout metadata value, but looking on crash-stats I mostly see reports with "SessionStore: flushing all windows".

Mike de Boer [:mikedeboer]

Comment 19

•

8 years ago

What I'd like to repeat here is what Marco said in https://bugzilla.mozilla.org/show_bug.cgi?id=1273770#c23: I can probably try to answer, while I didn't write async shutdown code, I am using it in Places and looking into similar crashes. when you register a promise with addBlocker, you basically sign a contract that such promise on shutdown won't take more than 1 minute to resolve. If you fail resolving the promise for whatever reason, like a bug causing an exception before you invoke resolve, Async Shutdown time outs and forces an application crash, whose stack is the one you see, that is sort of useless cause it's the same for any async shutdown consumer. Though, attached to the crash report there is metadata, one of those is AsyncShutdownTimeout that is a "progress" object you pass along with your shutdown promise and you are expected to fill it with data useful to figure out the timeout reasons.

Mike de Boer [:mikedeboer]

Comment 20

•

8 years ago

str

(In reply to Marco Castelluccio [:marco] from comment #15) > I have a simple way to reproduce a similar crash. The crasher.html > attachment in bug 1288997 causes the content process to crash on Windows, > but hang on Linux (example report: > https://crash-stats.mozilla.com/report/index/a37a145e-6cce-40f8-9b5c- > 583852160830). Brr, this produces a horrible crash on my MBP running OSX 10.11.4. There are a couple of interesting tidbits that I found might interest some ppl following along here: First, my STR for OSX: 1. Open Nightly DEBUG with the '-jsconsole' option, mainly for seeing some fun red hues on your screen. 2. Open the following tabs in the first window: 1st: 'about:home' 2nd: 'https://html.spec.whatwg.org/' 3rd: 'https://duckduckgo.com/?q=lala&t=ffab' 4th: 'https://www.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.messages.doc/com.ibm.ws.ssl.channel.resources.sslchannelmessages.html' 3. Open a second window with the following tabs: 1st: 'about:home' 2nd: 'https://dxr.mozilla.org/mozilla-central/source/toolkit/components/places/PlacesUtils.jsm' 3rd: 'https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Error' 4th: 'https://www.yahoo.com/' 5th: 'https://en.wikipedia.org/wiki/Firefox' 4. Open a blank tab and drag 'crasher.html' into it and hit CMD+Q right after that. 5. Wait for it... BOOM! JavaScript error: resource:///modules/ContentCrashHandlers.jsm, line 75: TypeError: browser.permanentKey is not a non-null object ^-- That one is for mconley. permanentKey may be `null`! So we might be missing out on useful info in our crash reports! The main cause of the crash is of course bug 1288997, BUT that's just one way to invoke it. The real meat is in the effects it triggers: 1. It nukes the messageManager out of orbit in this case, but there are more ways to think of a content process crash, OOM or a script stalling the process, causing a hang. 2. The SessionStore TabStateFlusher is relying on the content process (frame)script to do the heavy lifting and the promises are only resolved when the ACK comes back to the main process via an IPC message. Only then will the AsyncShutdown blocker be released. But these messages never arrive. Ever. AsyncShutdown is doing the right thing here, because hanging forever is not acceptable UX. What SessionStore _could_ do is keep track of its own timer to notice flush timeouts and move along when nothing's responding properly. That way it may be able to restore more than it does now, which is zip. Additionally, I'm wondering if `messageManager.sendAsyncMessage()` is infallible? If so, why? It clearly isn't possible to send messages anymore, but there's no way to currently detect that. Or is there?

Flags: needinfo?(mdeboer) → needinfo?(mconley)

Marco Castelluccio [:marco]

Updated

•

8 years ago

Comment 21

•

8 years ago

What SessionStore/ TabStateFlusher could also do is observe 'ipc:content-shutdown' and 'oop-frameloader-crashed', like ContentCrashHandlers.jsm, which will trigger to resolve all promises.

Marco Bonardo [:mak]

Reporter

Comment 22

•

8 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #20) > What SessionStore _could_ do is keep track of its own timer to notice flush > timeouts and move along when nothing's responding properly. That way it may > be able to restore more than it does now, which is zip. The only downside, is that if you do your own promise timeout, you give up the async shutdown advantage of knowing that something broke during shutdown. If we'd have done that from the beginning, we'd not have this bug and thus we'd not have any analysis of issues that this bug found. I had similar ideas, but then figured I needed an alternative way to know if something went wrong in the shutdown path (maybe telemetry?)

David Teller [:Yoric] - still alive but not very active

Comment 23

•

8 years ago

Generally, I believe that message managers should offer a way to post a message with a timeout. But yes, we also need some Telemetry.

[:philipp]

Updated

•

8 years ago

Crash Signature: nsDebugImpl::Abort | NS_InvokeByIndex ] → nsDebugImpl::Abort | NS_InvokeByIndex ] [@ AsyncShutdownTimeout | quit-application-granted | SessionStore: flushing all windows ]

alex_mayorga

Comment 24

•

8 years ago

¡Hola! FWIW just got this crash on Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0 ID:20160906030431 CSet: 8c9c4e816e86f903c1d820f3f29715dc070a5a4a too: Report ID Date Submitted bp-7d249b70-ed71-421b-9c13-0811c2160906 9/6/2016 09:29 1293 crashes in the past week at https://crash-stats.mozilla.com/signature/?product=Firefox&signature=AsyncShutdownTimeout%20%7C%20quit-application-granted%20%7C%20SessionStore%3A%20flushing%20all%20windows , setting flags accordingly. ¡Gracias! Alex Crashing Thread (0) Frame Module Signature Source 0 mozglue.dll mozalloc_abort(char const* const) memory/mozalloc/mozalloc_abort.cpp:33 1 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp:436 2 xul.dll nsDebugImpl::Abort(char const*, int) xpcom/base/nsDebugImpl.cpp:146 3 xul.dll NS_InvokeByIndex xpcom/reflect/xptcall/md/win32/xptcinvoke_asm_x86_msvc.asm:54 4 xul.dll XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) js/xpconnect/src/XPCWrappedNativeJSOps.cpp:1143 5 xul.dll js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp:454 6 xul.dll InternalCall js/src/vm/Interpreter.cpp:499 7 xul.dll Interpret js/src/vm/Interpreter.cpp:2916 8 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp:400 9 xul.dll js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp:472 10 xul.dll InternalCall js/src/vm/Interpreter.cpp:499 11 xul.dll js::fun_call(JSContext*, unsigned int, JS::Value*) js/src/jsfun.cpp:1249 12 xul.dll js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp:454 13 xul.dll InternalCall js/src/vm/Interpreter.cpp:499 14 xul.dll js::CrossCompartmentWrapper::call(JSContext*, JS::Handle<JSObject*>, JS::CallArgs const&) js/src/proxy/CrossCompartmentWrapper.cpp:333 15 xul.dll js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp:442 16 xul.dll InternalCall js/src/vm/Interpreter.cpp:499 17 xul.dll Interpret js/src/vm/Interpreter.cpp:2916 18 xul.dll js::RunScript(JSContext*, js::RunState&) js/src/vm/Interpreter.cpp:400 19 xul.dll js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) js/src/vm/Interpreter.cpp:472 20 xul.dll InternalCall js/src/vm/Interpreter.cpp:499 21 xul.dll js::jit::DoCallFallback js/src/jit/BaselineIC.cpp:5998 22 xul.dll js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class const*, js::TaggedProto, JSObject*) js/src/vm/ObjectGroup.cpp:531

status-firefox48: --- → affected

status-firefox49: --- → affected

status-firefox50: --- → affected

status-firefox51: --- → affected

status-firefox-esr45: --- → affected

Mike Conley (:mconley) (:⚙️)

Comment 25

•

8 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #20) > JavaScript error: resource:///modules/ContentCrashHandlers.jsm, line 75: > TypeError: browser.permanentKey is not a non-null object > > ^-- That one is for mconley. permanentKey may be `null`! So we might be > missing out on useful info in our crash reports! > If permanentKey is null, that probably means that the <xul:browser> has been removed from the DOM, since the XBL binding has been removed (the XBL binding is what supplies the permanentKey). The observer in ContentCrashHandlers is hoping for the permanentKey so that it can map attempts to submit crash reports from about:tabcrashed to reports that have been collected. I suppose we have a case where the browser may have been removed from the DOM before that observer fires. The notification that shows up to submit unsubmitted crash reports should still be catching these unsubmitted crash reports though, so we have another opportunity to get them. I don't think there's a whole lot we can do if the <xul:browser> has been removed from the DOM before the oop-frameloader-crashed notification fires. The best course of action in that case might be to detect if the permanent key does not exist, and just bail out properly instead of throwing an exception.

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 26

•

8 years ago

(In reply to Mike de Boer [:mikedeboer] from comment #20) > > What SessionStore _could_ do is keep track of its own timer to notice flush > timeouts and move along when nothing's responding properly. That way it may > be able to restore more than it does now, which is zip. This sounds like a reasonable course of action, yes. > Additionally, I'm wondering if `messageManager.sendAsyncMessage()` is > infallible? If so, why? It clearly isn't possible to send messages anymore, > but there's no way to currently detect that. Or is there? It's not infallible - here's where it can fail: http://searchfox.org/mozilla-central/rev/95b0ecf4b59a01e0524ca02f6c96ecaabe38f4d5/dom/base/nsFrameMessageManager.cpp#829 If mCallback is nullptr, it throws NS_ERROR_NOT_INITIALIZED. It looks like mCallback can be set to nullptr here: http://searchfox.org/mozilla-central/rev/95b0ecf4b59a01e0524ca02f6c96ecaabe38f4d5/dom/base/nsFrameMessageManager.cpp#1401, and here: http://searchfox.org/mozilla-central/rev/95b0ecf4b59a01e0524ca02f6c96ecaabe38f4d5/dom/base/nsFrameMessageManager.cpp#1416

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 27

•

8 years ago

Hi Andrew, this is now the topcrash on Aurora50. It jumped up 83 spots! Can you please help find an owner? Thanks!

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 28

•

8 years ago

Kan-Ru, I hear you're interested in session restore. Can you take a look?

Flags: needinfo?(overholt) → needinfo?(kchen)

Kan-Ru Chen [:kanru] (UTC+9)

Comment 29

•

8 years ago

(In reply to Andrew Overholt [:overholt] from comment #28) > Kan-Ru, I hear you're interested in session restore. Can you take a look? You're right. I'm mentoring a few people to work on session restore. I'm also trying to reduce shutdown crashes. So, looks like a perfect match! Will Wang will look at this together with me.

Assignee: nobody → wiwang

Flags: needinfo?(kchen)

Marco Castelluccio [:marco]

Comment 30

•

8 years ago

A few interesting observations, overall is Firefox 48: (73.32% in signature vs 04.53% overall) reason = EXC_BAD_ACCESS / KERN_INVALID_ADDRESS (73.41% in signature vs 07.08% overall) platform = Mac OS X (89.40% in signature vs 25.44% overall) dom_ipc_enabled = 1 (73.58% in signature vs 12.77% overall) address = 0x0 (47.98% in signature vs 91.63% overall) build_id = 20160823121617 (49.42% in signature vs 05.89% overall) build_id = 20160817112116 (44.83% in signature vs 02.21% overall) platform_pretty_version = OS X 10.11 (23.27% in signature vs 61.63% overall) Addon "Mozilla Firefox hotfix" = true (23.27% in signature vs 01.15% overall) platform_version = 10.11.6 15G1004 (16.08% in signature vs 00.98% overall) platform_pretty_version = OS X 10.10

Marco Castelluccio [:marco]

Updated

•

8 years ago

Comment 31

•

8 years ago

Hi Kan-Ru, Will, this remains a top crasher on Aurora (at 5th place now). Have you had any luck investigating the problem so far? I hope we can get this one fixed soon as 50 hits Beta channel this week.

Flags: needinfo?(wiwang)

Flags: needinfo?(kchen)

Will Wang [:WillWang]

Assignee

Comment 32

•

8 years ago

Hi! Based on the rich information above, allow me to propose following actions: (please correct me if any :) ) 1. Add our(session restore) own promise timeout as first step to solve this top crasher. This timeout will be needed anyway since there are many ways which can cause a hang, as comment 20 indicated. 2. In another bug, add telemetry and/or sort of tracking mechanism to ensure we can still be aware of something wrong, as comment 22 and 23 suggested. As for adding our own promise timeout, I'd like to have the promise resolved/rejected under 2 conditions: 1. before hitting the AsyncShutdown's 1 minute limit. 2. once 'ipc:content-shutdown' and 'oop-frameloader-crashed' are observed, especially the latter. Any suggestion would be greatly appreciated! ==== Hi Ritu, I'd like to confirm with you that if I try to make patch landed before 10/24 (two weeks before firefox 50 release), does this meet your expectation? Thanks! :)

Flags: needinfo?(wiwang)

Flags: needinfo?(rkothari)

Flags: needinfo?(kchen)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 33

•

8 years ago

(In reply to Will Wang [:WillWang] from comment #32) > > Hi Ritu, > > I'd like to confirm with you that if I try to make patch landed before 10/24 > (two weeks before firefox 50 release), does this meet your expectation? > Thanks! :) Hi Will, it depends on the risk associated with the patch. If we have good verification data from pre-beta channels that the crash goes away with the fix and if it's deemed low risk fix, I would be happy to uplift to Beta50. Usually post-Beta6, we only take fixes for severe crashes, hangs, sec-crits/highs and recent regressions. Hope that helps! The sooner we get a fix in the better.

Flags: needinfo?(rkothari)

Will Wang [:WillWang]

Assignee

Comment 34

•

8 years ago

Attached patch 1255977.patch (obsolete) — Details — Splinter Review

Hi Yoric, Would you mind giving your thought for this patch? Thanks! As comment 32, this patch: 1. Add our(session restore) own promise timeout, which is set to 50s for now to avoid hitting 60s limit of async shutdown. 2. Add two observers to resolve promise for oop-frameloader-crashed and ipc:content-shutdown. 3. Add a looseTimer() which is derived from the one(Thanks for your works! :p ) in AsyncShutdown.jsm and refactor as DOM promise version. If patch is basically ok, I would like to add comments and some details then send it to review. Thank you!

Attachment #8802068 - Flags: feedback?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 35

•

8 years ago

Comment on attachment 8802068 [details] [diff] [review] 1255977.patch Review of attachment 8802068 [details] [diff] [review]: ----------------------------------------------------------------- Yeah, that looks like a valid strategy. You'll need to document against what you're defending and what's the worst that can happen in case of timeout.

Attachment #8802068 - Flags: feedback?(dteller) → feedback+

alex_mayorga

Comment 36

•

8 years ago

¡Hola! FWIW, this is also a thing on Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 ID:20161017030209 CSet: 94b0fddf96b43942bdd851a3275042909ea37e09 Report ID Date Submitted bp-bd77ff9d-73d7-478b-8be7-7d87e2161019 19/10/2016 11:54 a.m. ¡Gracias! Alex STR: - Come back to a locked Dell Latitude E6420 running Windows 7 Enterprise. - Notice there's a green badge with a white up arrow requesting an update to Nightly. - Click the "Restart Nightly to apply updates" option in the "Hamburger menu" - Notice there's another window on the back with "Restart later" and "Restart now" buttons. - Click "Restart now" Result: Nightly crashes Expected result: Nightly updates and starts with no crashes.

status-firefox52: --- → affected

Will Wang [:WillWang]

Assignee

Comment 37

•

8 years ago

Attached patch bug-1255977.patch (obsolete) — Details — Splinter Review

Hi Yoric, Could you help to review this patch? This patch is based on the one in comment 34 with several improvements: 1. add more comments 2. sort all Cu.import() 3. add timer.cancel() to timer function and move the timer function into SessionStoreInternal object. Thank you for your help :)

Attachment #8802068 - Attachment is obsolete: true

Attachment #8802861 - Flags: review?(dteller)

David Teller [:Yoric] - still alive but not very active

Comment 38

•

8 years ago

Comment on attachment 8802861 [details] [diff] [review] bug-1255977.patch Review of attachment 8802861 [details] [diff] [review]: ----------------------------------------------------------------- r=me with a little more doc. ::: browser/components/sessionstore/SessionStore.jsm @@ +1464,5 @@ > + // To prevent blocker from breaking the 60 sec limit(which will cause a > + // crash) of async shutdown during flushing all windows, we resolve the > + // promise passed to blocker once: > + // 1. the flushing exceed 50 sec, or > + // 2. 'oop-frameloader-crashed' or 'ipc:content-shutdown' is observed. Could you add a little something explaining what the resulting session will be on next startup if we have had a oop-frameloader-crahsed or a ipc:content-shutdown? I *think* that we're going to reuse the state we saved last, which sounds ok to me, but this needs documenting.

Attachment #8802861 - Flags: review?(dteller) → review+

Will Wang [:WillWang]

Assignee

Comment 39

•

8 years ago

Attached patch Patch(v2): Resolve promise once flushing timeout or crash observed. — Details — Splinter Review

Yoric, thanks for your review! I add a line of comment and carry r+ from you. (Please correct me if any, thanks!)

Attachment #8802861 - Attachment is obsolete: true

Attachment #8803277 - Flags: review+

Will Wang [:WillWang]

Assignee

Updated

•

8 years ago

Keywords: checkin-needed

Pulsebot

Comment 40

•

8 years ago

Pushed by kchen@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/fbd5239a7ed6 Resolve promise once flushing timeout or crash observed. r=yoric

Keywords: checkin-needed

Ryan VanderMeulen [:RyanVM]

Comment 41

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/fbd5239a7ed6

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox52: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → Firefox 52

Will Wang [:WillWang]

Assignee

Comment 42

•

8 years ago

Comment on attachment 8803277 [details] [diff] [review] Patch(v2): Resolve promise once flushing timeout or crash observed. Approval Request Comment [Feature/regressing bug #]: None. (Session Restore is not a new feature, and there is no obvious regressing bug) [User impact if declined]: If declined, Firefox will crash during the shutdown if content process is unable to response(e.g., hanged). [Describe test coverage new/current, TreeHerder]: Current test passed at both local and treeherder[1]. [Risks and why]: Low risk since the patched file don't affect other critical components. [String/UUID change made/needed]: None. ==== [1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=fbd5239a7ed6a37beec87b0ea1a5aea1ac8723d8

Attachment #8803277 - Flags: approval-mozilla-beta?

Attachment #8803277 - Flags: approval-mozilla-aurora?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 43

•

8 years ago

Comment on attachment 8803277 [details] [diff] [review] Patch(v2): Resolve promise once flushing timeout or crash observed. I verified that the last signature which has the highest volume seems gone from 52 after the fix landed, hoping this helps on 51 and 50 as well. Aurora51+, Beta50+

Attachment #8803277 - Flags: approval-mozilla-beta?

Attachment #8803277 - Flags: approval-mozilla-beta+

Attachment #8803277 - Flags: approval-mozilla-aurora?

Attachment #8803277 - Flags: approval-mozilla-aurora+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 44

•

8 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/6648f996b7024cd155e1a32757d83dfe63939041 https://hg.mozilla.org/releases/mozilla-beta/rev/0b0bf76318dc9c5eaf2b8f86bb4132a1d01f7d93

status-firefox50: affected → fixed

status-firefox51: affected → fixed

1255977.patch 8 years ago Will Wang [:WillWang] 3.51 KB, patch	Yoric : feedback+	Details \| Diff \| Splinter Review
bug-1255977.patch 8 years ago Will Wang [:WillWang] 5.38 KB, patch	Yoric : review+	Details \| Diff \| Splinter Review
Patch(v2): Resolve promise once flushing timeout or crash observed. 8 years ago Will Wang [:WillWang] 5.46 KB, patch	wiwang : review+ ritu : approval-mozilla-aurora+ ritu : approval-mozilla-beta+	Details \| Diff \| Splinter Review