Async shutdown crash with "Sanitize.js: Sanitize on shutdown" and "state: (none)" or "Places Clients shutdown" and "conditions":[]
Categories
(Firefox :: General, defect)
Tracking
()
People
(Reporter: mak, Unassigned)
References
Details
(Keywords: topcrash)
Crash Data
Attachments
(1 file)
11.50 KB,
text/plain
|
Details |
https://bugzilla.mozilla.org/show_bug.cgi?id=1250424#c26 > after the fixes have landed in 46.0b2 there are still two common occurrences > of the signature with sanitize.js: > > 36% of the crashes in mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | > NS_InvokeByIndex contain: > https://crash-stats.mozilla.com/search/?product=Firefox&version=46. > 0b2&process_type=browser&signature=%3Dmozalloc_abort+%7C+NS_DebugBreak+%7C+ns > DebugImpl%3A%3AAbort+%7C+NS_InvokeByIndex&async_shutdown_timeout=~lineNumber% > 22%3A0&_facets=signature&_facets=user_comments&_facets=platform&_facets=versi > on&_facets=adapter_vendor_id&_facets=async_shutdown_timeout&_columns=date&_co > lumns=signature&_columns=product&_columns=version&_columns=build_id&_columns= > platform#facet-async_shutdown_timeout > > 11% contain: > https://crash-stats.mozilla.com/search/?product=Firefox&version=46. > 0b2&process_type=browser&signature=%3Dmozalloc_abort+%7C+NS_DebugBreak+%7C+ns > DebugImpl%3A%3AAbort+%7C+NS_InvokeByIndex&async_shutdown_timeout=~lineNumber% > 22%3A839&_facets=signature&_facets=user_comments&_facets=platform&_facets=ver > sion&_facets=adapter_vendor_id&_facets=async_shutdown_timeout&_columns=date&_ > columns=signature&_columns=product&_columns=version&_columns=build_id&_column > s=platform#facet-async_shutdown_timeout
Updated•8 years ago
|
Comment 1•8 years ago
|
||
[Tracking Requested - why for this release]: Looking at https://crash-stats.mozilla.com/search/?product=Firefox&version=46.0b7&process_type=browser&signature=%3Dmozalloc_abort+%7C+NS_DebugBreak+%7C+nsDebugImpl%3A%3AAbort+%7C+NS_InvokeByIndex&_facets=signature&_facets=user_comments&_facets=platform&_facets=version&_facets=adapter_vendor_id&_facets=async_shutdown_timeout&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-async_shutdown_timeout Slightly over 2/3 of all our AsyncShutdown crashes in 46.0b7 are those issues, and overall almost 2% of the crashes in that version are AsyncShutdown, so I'd hope we can get something moving here. Requesting tracking because it looks like ~1.3% of 46.0b7 crashes are these issues.
Comment 2•8 years ago
|
||
Even worse in 46.0b8 - about 2.5% of the overall crashes are AsynCShutdown, and 83% of those are one of the two sanitize variants, so it's 2.1% of overall 46.0b8 crashes.
Reporter | ||
Comment 3•8 years ago
|
||
doesn't look like we have enough time to test a fix in Nightly, check numbers and upload. And the collected crash metadata sucks atm (yes, I should finish bug 1258354 to fix that). Once bug 1258354 is fixed, we need to collect some of the crashes to be able to tell what's up. So at a maximum we could workaound the crash for 46, if you think it's worth it. That could cause other issues though, like incomplete sanifications.
Comment 4•8 years ago
|
||
I think we would have time to test a fix in nightly and uplift to 46 even now. This is a bad topcrash for 46 late betas, so we do need to try a workaround. We could land a fix in beta 11 or even in RC.
Reporter | ||
Comment 6•8 years ago
|
||
I need bug 1258354 (pending review) to be able to know WHAT to look at, cause the current metadata has no information at all (state: none). without that info the only thing I could do is trying to timeout before the crash, but as I said that may cause privacy issues. So I hope to get that review quick, get some annotations with proper metadata, then we can check what are our options.
Reporter | ||
Comment 7•8 years ago
|
||
I asked for approval in bug 1258354, so far, after that change, I didn't see nightly reports for this crash. Could be we need to increase population, so uplifting that would help.
Reporter | ||
Comment 8•8 years ago
|
||
After the fix for bug 1258354, I cannot find anymore crashes with sanitize in the metadata. Robert, do you see anything I'm missing?
Comment 9•8 years ago
|
||
kairo is away this week, so he probably cannot answer. just by looking at 46.0b crash data it seems that starting in rc2 after bug 1258354 has landed, crashes with sanitize.js in the metadata have shifted to a non-descript: '{"phase":"Places Clients shutdown","conditions":[]}' async shutdown crashes for rc1: https://crash-stats.mozilla.com/search/?product=Firefox&version=46.0b99&process_type=browser&build_id=20160418114253&signature=%3Dmozalloc_abort+|+NS_DebugBreak+|+nsDebugImpl%3A%3AAbort+|+NS_InvokeByIndex&_facets=signature&_facets=build_id&_facets=async_shutdown_timeout&_columns=date&_columns=signature#facet-async_shutdown_timeout async shutdown crashes for rc2&3: https://crash-stats.mozilla.com/search/?product=Firefox&version=46.0b99&process_type=browser&build_id=20160421124000&build_id=20160420024437&signature=%3Dmozalloc_abort+|+NS_DebugBreak+|+nsDebugImpl%3A%3AAbort+|+NS_InvokeByIndex&_facets=signature&_facets=user_comments&_facets=async_shutdown_timeout&_columns=date&_columns=signature#facet-async_shutdown_timeout
Reporter | ||
Comment 10•8 years ago
|
||
(In reply to philipp from comment #9) > kairo is away this week, so he probably cannot answer. > just by looking at 46.0b crash data it seems that starting in rc2 after bug > 1258354 has landed, crashes with sanitize.js in the metadata have shifted to > a non-descript: '{"phase":"Places Clients shutdown","conditions":[]}' Ah, that sucks... but overall looks like there may have been a small reduction in the number of crashes? Now we should figure out why that metadata is empty.
Comment 11•8 years ago
|
||
In release, the crash is spiking since 2016-05-05 (increased by ~96% until 2016-05-08) and is #4 in top-crashes for 46.0.1, #4 for 47.0b3 and #45 for 48.0a1.
Reporter | ||
Comment 12•8 years ago
|
||
I'll see if I can figure out something more. Any help is clearly welcome if anyone notices wrong behavior in async shutdown or the sanitize implementation of it. So far I have only noticed a typo that could make us wrongly report the metadata (will file a bug shortly). And there is clearly a bug in the way we report the blocker state, it should not really be empty, but I haven't yet identified the reason. I think that needs try reproducing an hang at different points in the shutdown phase and see what gets reported, maybe I can figure out why we get an empty state.
Reporter | ||
Updated•8 years ago
|
Reporter | ||
Comment 13•8 years ago
|
||
Hm, now I maybe see why it's empty, we don't assign to the outparam. So, I have 2 bugs I can file and fix for the metadata reporting.
Updated•8 years ago
|
Updated•8 years ago
|
Comment hidden (spam) |
Comment 15•8 years ago
|
||
Marco is there more to do here? (See bug 1255745 comment 33)?
Reporter | ||
Comment 16•8 years ago
|
||
(In reply to David Bolter [:davidb] from comment #15) > Marco is there more to do here? (See bug 1255745 comment 33)? Yes, so far I fixed all the bogus code I could find, but the crash still happens. That means either: 1. there are other hang-on-shutdown causes I didn't find yet 2. those users have an effectively large/corrupt history database that we can't sanitize in an acceptable time and we have no way to manage that apart from improving general perf (that we are also doing) 3. there's a bug in async shutdown code Surely there's a bug in the code that collects information about blockers since they read like {"phase":"Places Clients shutdown","conditions":[]} while conditions should contain something, that could help us understand what blocks... Unfortunately it's hard to reproduce that locally, but surely it would be great to get a proper log. I also have another shutdown patch in bug 1275878 that I was hoping to land in the hope to simplify some of this code, but unfortunately it leaks in even more mysterious ways in a single unrelated private browsing browser-chrome test.
Updated•8 years ago
|
Comment 17•8 years ago
|
||
In my case, Firefox usually becomes unstable before crash. It begins to hang for tens of seconds from time to time and/or to show visual artifacts. And usually it crashes on restart in such cases. I suspect that in most my cases Firefox consumes almost all virtual memory (e.g., 3.7 from 4 GiB, see for example https://crash-stats.mozilla.com/report/index/b9954095-7ef2-44f9-bad5-cb05a2160901), and it tries to free some memory via aggressive garbage collection (what causes a hangups for tens seconds). Perhaps this process leads to memory corruption and, as consequence, to timeout on restart.
Reporter | ||
Comment 18•8 years ago
|
||
(In reply to Ruvim Pinka from comment #17) > In my case, Firefox usually becomes unstable before crash. It begins to hang > for tens of seconds from time to time and/or to show visual artifacts. And > usually it crashes on restart in such cases. > > I suspect that in most my cases Firefox consumes almost all virtual memory > (e.g., 3.7 from 4 GiB, see for example > https://crash-stats.mozilla.com/report/index/b9954095-7ef2-44f9-bad5- > cb05a2160901), and it tries to free some memory via aggressive garbage > collection (what causes a hangups for tens seconds). Perhaps this process > leads to memory corruption and, as consequence, to timeout on restart. Hi, could you please check your database with https://addons.mozilla.org/en-US/firefox/addon/places-maintenance/ and tell if it improves anything or if it reports corruption at the end?
Comment 19•8 years ago
|
||
(In reply to Marco Bonardo [::mak] from comment #18) I have checked my database with places-maintenance add-on. It didn't find any issue or clean up anything. The initial statistic is byte-to-byte the same as statistic in the maintenance result. > Statistics Database size is 81920 KiB user_version is 30 page_size is 32768 cache_size is -2048 journal_mode is wal synchronous is 1 History can store a maximum of 104858 unique pages [per table details are skipped] It seems that the negative number in cache_size value is subject of a bug with signed/unsigned numbers in the add-on.
Reporter | ||
Comment 20•8 years ago
|
||
the negative value is the actual value, it's just a magic value. Could you please attach a log from about:support?
Comment 21•8 years ago
|
||
Reporter | ||
Comment 22•8 years ago
|
||
ok, nothing interesting there, though you have a lot of add-ons and that could help explaining some of the hangs. Not the crash on shutdown though. But you pointed out an interesting data, virtual memory, so I did some additional aggregates: 1. doesn't seem to happen on 64bit arch (aggregate on cpu arch), only x86. It's true x64 population is smaller though. 2. aggregating on total virtual memory, it happens a lot on systems with 2GB, a bit less on systems with 4GB, no reports with more memory (64bit). 3. aggregating on available virtual memory, points out most of the crashes happen when it's small (Some hundreds MB), but still there are some crashes with 1GB available, so it may not be the whole story. 4. aggregating on total physical memory, seems to only happen on systems with ram less or equal to 4GB. Another interesting fact, by looking at users comments, is that in many cases the browser was unresponsive/hanging on some page (very often facebook), and they tried to restart the browser, and we couldn't shutdown properly and crashed, likely cause something was keeping the main-thread blocked. Sounds like a good part of the async shutdown crashes are not really due to async shutdown problems, but to an hung main-thread that can't properly process async shutdown promises... Benjamin, is there a way we can tell if the main-thread is hanging from some time, and don't report an async shutdown crash in those cases? Or even just annotate the crash somehow so we can distinguish in crash-stats real async shutdown crashes from the ones caused by an hung main-thread. Also, looks like async shtudown crashig after one minute may conflict with hang detection?
Comment 23•8 years ago
|
||
Re add-ons — only half of them are turned on. And the hangs occur after long uptime only, usually with following crash on restart (I restart Firefox after several long hangs). Re total physical memory (point 4) — it is 6 GiB in my case (i.e., more than 4 GiB).i I don't see the total physical memory in the crash-reports, but they include Available Physical Memory and System Memory Use Percentage — so it seems that the total size can be calculated (depends on what "System Memory" means).
Reporter | ||
Comment 24•8 years ago
|
||
(In reply to Ruvim Pinka from comment #23) > Re total physical memory (point 4) — it is 6 GiB in my case (i.e., more than > 4 GiB).i Yes, I think your case is the latter part of comment 22 and memory could just be a red herring (or related to the hang, more than to the crash). Your browser is hanging for whatever reason, you try to restart it, but since it's hung we can't properly shutdown, and we crash on shutdown.
Comment 25•8 years ago
|
||
> Another interesting fact, by looking at users comments, is that in many > cases the browser was unresponsive/hanging on some page (very often > facebook), and they tried to restart the browser, and we couldn't shutdown > properly and crashed, likely cause something was keeping the main-thread > blocked. > > Sounds like a good part of the async shutdown crashes are not really due to > async shutdown problems, but to an hung main-thread that can't properly > process async shutdown promises... The asyncshutdown code can't kill Firefox except on the main thread. So by definition we're able to process messages on the main thread at the time of the crash. It's possible that the thread could have been really hung up sometime during the process, but we don't know that. > Benjamin, is there a way we can tell if the main-thread is hanging from some > time, and don't report an async shutdown crash in those cases? We *want* to report an async shutdown in this case. It's not supposed to take this long. > Also, looks like async shtudown crashig after one minute may conflict with > hang detection? I don't understand what this means.
Reporter | ||
Comment 26•8 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #25) > The asyncshutdown code can't kill Firefox except on the main thread. So by > definition we're able to process messages on the main thread at the time of > the crash. It's possible that the thread could have been really hung up > sometime during the process, but we don't know that. The fact is, a lot of users comments point out that the browser was hung BEFORE they even tried to shutdown, they were browsing the web, the browser hung and THEN they tried to restart it. At this point somehow async shutdown wasn't unable to complete and there may be a relation with the fact the browser was already hung. Something fishy was already happening on the main-thread and we don't know what. This makes hard to distinguish between a real shutdown timeout due to the shutdown action itself, and a shutdown timeout due to something else that was already ongoing. That makes fixing these shutdown crashes very very hard, regardless of any change we could make to the shutdown code, and we don't have a stack of the previous hang. > > Also, looks like async shtudown crashig after one minute may conflict with > > hang detection? > > I don't understand what this means. Sorry, I mean that if the browser was already hung, reporting an async shutdown crash is less useful cause it covers the real cause of the hang. I seemed to recall we had hang detection code that reports hangs after N minutes, so I was guessing if async shutdown was crashing before the hang reporter could do its job.
Comment 27•8 years ago
|
||
If a user is able to quit the browser, it may have been very slow or janky, but it wasn't completely "hung". Because otherwise it wouldn't have processed the mouse clicks or keypresses which trigger quitting. Note that the crash signatures here are not intended to ascribe *cause*: this is a shutdown hang because shutting down took more than 60 seconds. The signature shows the state at the time we decided to crash. It may be worth adding additional metrics (or even a profile?) to these reports, so that we could break time down by whether we're spending time in content-JS, chrome-JS, GC, or other things. I'd happily support filing a bug for that.
Comment 28•8 years ago
|
||
And the hang detection code exists but was never enabled beyond nightly because of false positives.
Comment 29•8 years ago
|
||
Recently my Firefox has crashed again. The symptoms before the crash was as usual — hangs and visual artifacts, and actual crash after pressing Exit Firefox button. But the crash Signature is another: "_VEC_memzero | _VEC_memzero" (see https://crash-stats.mozilla.com/report/index/63236a27-7f18-47cf-b5ca-9b1572160907 ). Also, using Performance Monitor, I have noticed that during hangs Firefox consumes CPU Privileged Time up to 99%. I can attach the diagram here, or create separate bug report.
Comment hidden (spam) |
Reporter | ||
Comment 31•8 years ago
|
||
(In reply to Yorgos from comment #30) > https://crash-stats.mozilla.com/report/index/9f9a1118-f08b-41e1-9d7e- > 37e0f2160912 Can you tell something about what you were doing when you saw this crash, if you recall?
Comment 33•8 years ago
|
||
Just a quick comment. I know someone else says scrolling Facebook News will reproducibly create OOM crashes on at least 4 different Computers and on differing Windows versions. Others however have so far failed to reproduce this, and one of the Crash Signatures the bug was filed under is not at all common. See Bug 1291294
Comment hidden (spam) |
Reporter | ||
Comment 35•8 years ago
|
||
You can stop reporting crashes here, it's not useful atm (we already track those on crash-stats and can query from there). Moreover I'm not sure I have a plan of action for this specific bug, since if the hang starts earlier than shutdown, it's unlikely I can fix it in Places shutdown code. We should rather try to figure out why facebook hangs, that is quite a different challenge.
Comment 36•8 years ago
|
||
Thanks a lot for helping.
Comment 37•8 years ago
|
||
~91% of crashes with this signature (compared to ~8% overall) have contains_memory_report=1, which might mean that this often happens in low memory conditions. (In reply to Marco Bonardo [::mak] from comment #22) > Another interesting fact, by looking at users comments, is that in many > cases the browser was unresponsive/hanging on some page (very often > facebook), and they tried to restart the browser, and we couldn't shutdown > properly and crashed, likely cause something was keeping the main-thread > blocked. > > Sounds like a good part of the async shutdown crashes are not really due to > async shutdown problems, but to an hung main-thread that can't properly > process async shutdown promises... This sounds similar to what happens with bug 1288997. In that case, on Windows we crash with 'mozilla::ipc::ProcessLink::SendMessageW', on Linux and Mac the browser hangs. See bug 1255977 comment 20.
Reporter | ||
Comment 38•8 years ago
|
||
And if it's not a real "hang", it may be the system is so slow (swapping?) that events are processed very slowly, and we can't complete async shutdown in a timely manner.
Comment 39•8 years ago
|
||
Some other interesting data points: (91.41% in signature vs 08.39% overall) contains_memory_report = 1 (49.69% in signature vs 06.52% overall) GFX_ERROR "Failed 2 buffer db=" (55.11% in signature vs 16.16% overall) available_virtual_memory < 512 MB (40.87% in signature vs 10.18% overall) domain = www.facebook.com (100.0% in signature vs 75.08% overall) dom_ipc_enabled = False (65.48% in signature vs 44.86% overall) platform_pretty_version = Windows 7 (55.61% in signature vs 38.08% overall) platform_version = 6.1.7601 Service Pack 1 (35.39% in signature vs 05.31% overall) GFX_ERROR "Failed 2 buffer db=" ∧ available_virtual_memory < 512 MB So the crash is non-e10s only, this means it can't be the same problem as bug 1255977 comment 20.
Reporter | ||
Comment 40•8 years ago
|
||
I'm not actively working on this atm, and I have exhausted all the ideas I had, there's still the fact this crash can happen if the system or the main-thread are particularly busy and the message about closing the db takes too long to be handled. In those cases there's nothing I can do in Places itself to prevent the problem. Many comments indeed report things like "Firefox was not responding so I closed and reopened. Nothing happened and this box popped up". I still think it's a very horrible solution to show a crash dialog to the user if shutdown times out. If we could report these without a dialog making us look so unstable, it would be great. On the other side, looks like the volume of these crashes has reduced greatly in 50/51/52? I may re-examine if the situation gets worse.
Comment 41•5 years ago
|
||
Showing results from 7 days ago
Signature report for AsyncShutdownTimeout | profile-before-change | Sqlite.jsm shutdown blocker
2,908 Results
Signature report for AsyncShutdownTimeout | profile-before-change | Sqlite.jsm shutdown blocker
Processor Notes contains "Abort | mozalloc_abort | Abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex"
124 Results
Signature report for AsyncShutdownTimeout | profile-before-change | Sqlite.jsm shutdown blocker
Processor Notes contains "Abort | mozalloc_abort | Abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex"
41 Results
Comment 42•5 years ago
|
||
Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.
Comment 43•3 years ago
|
||
Marking this as Resolved > Worksforme since no other crashes with this signature have been reported in the last 6 months.
Description
•