Closed Bug 1258350 Opened 8 years ago Closed 3 years ago

Async shutdown crash with "Sanitize.js: Sanitize on shutdown" and "state: (none)" or "Places Clients shutdown" and "conditions":[]

Categories

(Firefox :: General, defect)

defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox46 + wontfix
firefox47 --- affected
firefox48 --- affected
firefox49 --- affected
firefox-esr60 --- affected
firefox67 --- affected
firefox68 --- affected
firefox69 --- affected

People

(Reporter: mak, Unassigned)

References

Details

(Keywords: topcrash)

Crash Data

Attachments

(1 file)

https://bugzilla.mozilla.org/show_bug.cgi?id=1250424#c26
> after the fixes have landed in 46.0b2 there are still two common occurrences
> of the signature with sanitize.js:
> 
> 36% of the crashes in mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort |
> NS_InvokeByIndex contain:
> https://crash-stats.mozilla.com/search/?product=Firefox&version=46.
> 0b2&process_type=browser&signature=%3Dmozalloc_abort+%7C+NS_DebugBreak+%7C+ns
> DebugImpl%3A%3AAbort+%7C+NS_InvokeByIndex&async_shutdown_timeout=~lineNumber%
> 22%3A0&_facets=signature&_facets=user_comments&_facets=platform&_facets=versi
> on&_facets=adapter_vendor_id&_facets=async_shutdown_timeout&_columns=date&_co
> lumns=signature&_columns=product&_columns=version&_columns=build_id&_columns=
> platform#facet-async_shutdown_timeout
> 
> 11% contain:
> https://crash-stats.mozilla.com/search/?product=Firefox&version=46.
> 0b2&process_type=browser&signature=%3Dmozalloc_abort+%7C+NS_DebugBreak+%7C+ns
> DebugImpl%3A%3AAbort+%7C+NS_InvokeByIndex&async_shutdown_timeout=~lineNumber%
> 22%3A839&_facets=signature&_facets=user_comments&_facets=platform&_facets=ver
> sion&_facets=adapter_vendor_id&_facets=async_shutdown_timeout&_columns=date&_
> columns=signature&_columns=product&_columns=version&_columns=build_id&_column
> s=platform#facet-async_shutdown_timeout
Depends on: 1258354
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ]
[Tracking Requested - why for this release]:

Looking at https://crash-stats.mozilla.com/search/?product=Firefox&version=46.0b7&process_type=browser&signature=%3Dmozalloc_abort+%7C+NS_DebugBreak+%7C+nsDebugImpl%3A%3AAbort+%7C+NS_InvokeByIndex&_facets=signature&_facets=user_comments&_facets=platform&_facets=version&_facets=adapter_vendor_id&_facets=async_shutdown_timeout&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-async_shutdown_timeout

Slightly over 2/3 of all our AsyncShutdown crashes in 46.0b7 are those issues, and overall almost 2% of the crashes in that version are AsyncShutdown, so I'd hope we can get something moving here.

Requesting tracking because it looks like ~1.3% of 46.0b7 crashes are these issues.
Even worse in 46.0b8 - about 2.5% of the overall crashes are AsynCShutdown, and 83% of those are one of the two sanitize variants, so it's 2.1% of overall 46.0b8 crashes.
doesn't look like we have enough time to test a fix in Nightly, check numbers and upload. And the collected crash metadata sucks atm (yes, I should finish bug 1258354 to fix that).
Once bug 1258354 is fixed, we need to collect some of the crashes to be able to tell what's up.

So at a maximum we could workaound the crash for 46, if you think it's worth it.
That could cause other issues though, like incomplete sanifications.
I think we would have time to test a fix in nightly and uplift to 46 even now. This is a bad topcrash for 46 late betas, so we do need to try a workaround.   We could land a fix in beta 11 or even in RC.
Flags: needinfo?(mak77)
In short we want to track this and very much want a workaround.
I need bug 1258354 (pending review) to be able to know WHAT to look at, cause the current metadata has no information at all (state: none).
without that info the only thing I could do is trying to timeout before the crash, but as I said that may cause privacy issues.
So I hope to get that review quick, get some annotations with proper metadata, then we can check what are our options.
I asked for approval in bug 1258354, so far, after that change, I didn't see nightly reports for this crash. Could be we need to increase population, so uplifting that would help.
Flags: needinfo?(mak77)
After the fix for bug 1258354, I cannot find anymore crashes with sanitize in the metadata.
Robert, do you see anything I'm missing?
Flags: needinfo?(kairo)
(In reply to philipp from comment #9)
> kairo is away this week, so he probably cannot answer. 
> just by looking at 46.0b crash data it seems that starting in rc2 after bug
> 1258354 has landed, crashes with sanitize.js in the metadata have shifted to
> a non-descript: '{"phase":"Places Clients shutdown","conditions":[]}'

Ah, that sucks... but overall looks like there may have been a small reduction in the number of crashes?
Now we should figure out why that metadata is empty.
In release, the crash is spiking since 2016-05-05 (increased by ~96% until 2016-05-08) and is #4 in top-crashes for 46.0.1, #4 for 47.0b3 and #45 for 48.0a1.
I'll see if I can figure out something more. Any help is clearly welcome if anyone notices wrong behavior in async shutdown or the sanitize implementation of it.
So far I have only noticed a typo that could make us wrongly report the metadata (will file a bug shortly).
And there is clearly a bug in the way we report the blocker state, it should not really be empty, but I haven't yet identified the reason. I think that needs try reproducing an hang at different points in the shutdown phase and see what gets reported, maybe I can figure out why we get an empty state.
Assignee: nobody → mak77
Summary: Async shutdown crash with "Sanitize.js: Sanitize on shutdown" and "state: (none)" → Async shutdown crash with "Sanitize.js: Sanitize on shutdown" and "state: (none)" or "Places Clients shutdown" and "conditions":[]
Hm, now I maybe see why it's empty, we don't assign to the outparam. So, I have 2 bugs I can file and fix for the metadata reporting.
Depends on: 1271280
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ]
Marco is there more to do here? (See bug 1255745 comment 33)?
Flags: needinfo?(mak77)
(In reply to David Bolter [:davidb] from comment #15)
> Marco is there more to do here? (See bug 1255745 comment 33)?

Yes, so far I fixed all the bogus code I could find, but the crash still happens.
That means either:
1. there are other hang-on-shutdown causes I didn't find yet
2. those users have an effectively large/corrupt history database that we can't sanitize in an acceptable time and we have no way to manage that apart from improving general perf (that we are also doing)
3. there's a bug in async shutdown code

Surely there's a bug in the code that collects information about blockers since they read like {"phase":"Places Clients shutdown","conditions":[]} while conditions should contain something, that could help us understand what blocks... Unfortunately it's hard to reproduce that locally, but surely it would be great to get a proper log.

I also have another shutdown patch in bug 1275878 that I was hoping to land in the hope to simplify some of this code, but unfortunately it leaks in even more mysterious ways in a single unrelated private browsing browser-chrome test.
Flags: needinfo?(mak77)
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] → [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex ] [@ mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ Abort | mozalloc_abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex ] [@ AsyncShutd…
In my case, Firefox usually becomes unstable before crash. It begins to hang for tens of seconds from time to time and/or to show visual artifacts. And usually it crashes on restart in such cases.

I suspect that in most my cases Firefox consumes almost all virtual memory (e.g., 3.7 from 4 GiB, see for example https://crash-stats.mozilla.com/report/index/b9954095-7ef2-44f9-bad5-cb05a2160901), and it tries to free some memory via aggressive garbage collection (what causes a hangups for tens seconds). Perhaps this process leads to memory corruption and, as consequence, to timeout on restart.
(In reply to Ruvim Pinka from comment #17)
> In my case, Firefox usually becomes unstable before crash. It begins to hang
> for tens of seconds from time to time and/or to show visual artifacts. And
> usually it crashes on restart in such cases.
> 
> I suspect that in most my cases Firefox consumes almost all virtual memory
> (e.g., 3.7 from 4 GiB, see for example
> https://crash-stats.mozilla.com/report/index/b9954095-7ef2-44f9-bad5-
> cb05a2160901), and it tries to free some memory via aggressive garbage
> collection (what causes a hangups for tens seconds). Perhaps this process
> leads to memory corruption and, as consequence, to timeout on restart.

Hi, could you please check your database with https://addons.mozilla.org/en-US/firefox/addon/places-maintenance/ and tell if it improves anything or if it reports corruption at the end?
(In reply to Marco Bonardo [::mak] from comment #18)
I have checked my database with places-maintenance add-on. It didn't find any issue or clean up anything. The initial statistic is byte-to-byte the same as statistic in the maintenance result.

> Statistics
Database size is 81920 KiB
user_version is 30
page_size is 32768
cache_size is -2048
journal_mode is wal
synchronous is 1
History can store a maximum of 104858 unique pages
[per table details are skipped]

It seems that the negative number in cache_size value is subject of a bug with signed/unsigned numbers in the add-on.
the negative value is the actual value, it's just a magic value.

Could you please attach a log from about:support?
ok, nothing interesting there, though you have a lot of add-ons and that could help explaining some of the hangs. Not the crash on shutdown though.

But you pointed out an interesting data, virtual memory, so I did some additional aggregates:
1. doesn't seem to happen on 64bit arch (aggregate on cpu arch), only x86. It's true x64 population is smaller though.
2. aggregating on total virtual memory, it happens a lot on systems with 2GB, a bit less on systems with 4GB, no reports with more memory (64bit).
3. aggregating on available virtual memory, points out most of the crashes happen when it's small (Some hundreds MB), but still there are some crashes with 1GB available, so it may not be the whole story.
4. aggregating on total physical memory, seems to only happen on systems with ram less or equal to 4GB.

Another interesting fact, by looking at users comments, is that in many cases the browser was unresponsive/hanging on some page (very often facebook), and they tried to restart the browser, and we couldn't shutdown properly and crashed, likely cause something was keeping the main-thread blocked.

Sounds like a good part of the async shutdown crashes are not really due to async shutdown problems, but to an hung main-thread that can't properly process async shutdown promises...

Benjamin, is there a way we can tell if the main-thread is hanging from some time, and don't report an async shutdown crash in those cases? Or even just annotate the crash somehow so we can distinguish in crash-stats real async shutdown crashes from the ones caused by an hung main-thread.

Also, looks like async shtudown crashig after one minute may conflict with hang detection?
Flags: needinfo?(benjamin)
Re add-ons — only half of them are turned on. And the hangs occur after long uptime only, usually with following crash on restart (I restart Firefox after several long hangs).

Re total physical memory (point 4) — it is 6 GiB in my case (i.e., more than 4 GiB).i

I don't see the total physical memory in the crash-reports, but they include Available Physical Memory and System Memory Use Percentage — so it seems that the total size can be calculated (depends on what "System Memory" means).
(In reply to Ruvim Pinka from comment #23)
> Re total physical memory (point 4) — it is 6 GiB in my case (i.e., more than
> 4 GiB).i

Yes, I think your case is the latter part of comment 22 and memory could just be a red herring (or related to the hang, more than to the crash).
Your browser is hanging for whatever reason, you try to restart it, but since it's hung we can't properly shutdown, and we crash on shutdown.
> Another interesting fact, by looking at users comments, is that in many
> cases the browser was unresponsive/hanging on some page (very often
> facebook), and they tried to restart the browser, and we couldn't shutdown
> properly and crashed, likely cause something was keeping the main-thread
> blocked.
> 
> Sounds like a good part of the async shutdown crashes are not really due to
> async shutdown problems, but to an hung main-thread that can't properly
> process async shutdown promises...

The asyncshutdown code can't kill Firefox except on the main thread. So by definition we're able to process messages on the main thread at the time of the crash. It's possible that the thread could have been really hung up sometime during the process, but we don't know that.

> Benjamin, is there a way we can tell if the main-thread is hanging from some
> time, and don't report an async shutdown crash in those cases?

We *want* to report an async shutdown in this case. It's not supposed to take this long.

> Also, looks like async shtudown crashig after one minute may conflict with
> hang detection?

I don't understand what this means.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #25)
> The asyncshutdown code can't kill Firefox except on the main thread. So by
> definition we're able to process messages on the main thread at the time of
> the crash. It's possible that the thread could have been really hung up
> sometime during the process, but we don't know that.

The fact is, a lot of users comments point out that the browser was hung BEFORE they even tried to shutdown, they were browsing the web, the browser hung and THEN they tried to restart it. At this point somehow async shutdown wasn't unable to complete and there may be a relation with the fact the browser was already hung. Something fishy was already happening on the main-thread and we don't know what.

This makes hard to distinguish between a real shutdown timeout due to the shutdown action itself, and a shutdown timeout due to something else that was already ongoing. That makes fixing these shutdown crashes very very hard, regardless of any change we could make to the shutdown code, and we don't have a stack of the previous hang.

> > Also, looks like async shtudown crashig after one minute may conflict with
> > hang detection?
> 
> I don't understand what this means.

Sorry, I mean that if the browser was already hung, reporting an async shutdown crash is less useful cause it covers the real cause of the hang. I seemed to recall we had hang detection code that reports hangs after N minutes, so I was guessing if async shutdown was crashing before the hang reporter could do its job.
Flags: needinfo?(benjamin)
If a user is able to quit the browser, it may have been very slow or janky, but it wasn't completely "hung". Because otherwise it wouldn't have processed the mouse clicks or keypresses which trigger quitting.

Note that the crash signatures here are not intended to ascribe *cause*: this is a shutdown hang because shutting down took more than 60 seconds. The signature shows the state at the time we decided to crash.

It may be worth adding additional metrics (or even a profile?) to these reports, so that we could break time down by whether we're spending time in content-JS, chrome-JS, GC, or other things. I'd happily support filing a bug for that.
Flags: needinfo?(benjamin)
And the hang detection code exists but was never enabled beyond nightly because of false positives.
Recently my Firefox has crashed again. The symptoms before the crash was as usual — hangs and visual artifacts, and actual crash after pressing Exit Firefox button. But the crash Signature is another: "_VEC_memzero | _VEC_memzero" (see https://crash-stats.mozilla.com/report/index/63236a27-7f18-47cf-b5ca-9b1572160907 ).

Also, using Performance Monitor, I have noticed that during hangs Firefox consumes CPU Privileged Time up to 99%. I can attach the diagram here, or create  separate bug report.
(In reply to Yorgos from comment #30)
> https://crash-stats.mozilla.com/report/index/9f9a1118-f08b-41e1-9d7e-
> 37e0f2160912

Can you tell something about what you were doing when you saw this crash, if you recall?
Flags: needinfo?(condacum)
Scrolling a big Facebook album.
Flags: needinfo?(condacum)
Just a quick comment. I know someone else says scrolling Facebook News will reproducibly create OOM crashes on at least 4 different Computers and on differing Windows versions. Others however have so far failed to reproduce this, and one of the Crash Signatures the bug was filed under is not at all common. See Bug 1291294
You can stop reporting crashes here, it's not useful atm (we already track those on crash-stats and can query from there). Moreover I'm not sure I have a plan of action for this specific bug, since if the hang starts earlier than shutdown, it's unlikely I can fix it in Places shutdown code. We should rather try to figure out why facebook hangs, that is quite a different challenge.
Thanks a lot for helping.
See Also: → 1255977
~91% of crashes with this signature (compared to ~8% overall) have
contains_memory_report=1, which might mean that this often happens
in low memory conditions.

(In reply to Marco Bonardo [::mak] from comment #22)
> Another interesting fact, by looking at users comments, is that in many
> cases the browser was unresponsive/hanging on some page (very often
> facebook), and they tried to restart the browser, and we couldn't shutdown
> properly and crashed, likely cause something was keeping the main-thread
> blocked.
> 
> Sounds like a good part of the async shutdown crashes are not really due to
> async shutdown problems, but to an hung main-thread that can't properly
> process async shutdown promises...

This sounds similar to what happens with bug 1288997.
In that case, on Windows we crash with 'mozilla::ipc::ProcessLink::SendMessageW',
on Linux and Mac the browser hangs. See bug 1255977 comment 20.
And if it's not a real "hang", it may be the system is so slow (swapping?) that events are processed very slowly, and we can't complete async shutdown in a timely manner.
Some other interesting data points:
(91.41% in signature vs 08.39% overall) contains_memory_report = 1
(49.69% in signature vs 06.52% overall) GFX_ERROR "Failed 2 buffer db="
(55.11% in signature vs 16.16% overall) available_virtual_memory < 512 MB
(40.87% in signature vs 10.18% overall) domain = www.facebook.com
(100.0% in signature vs 75.08% overall) dom_ipc_enabled = False
(65.48% in signature vs 44.86% overall) platform_pretty_version = Windows 7
(55.61% in signature vs 38.08% overall) platform_version = 6.1.7601 Service Pack 1

(35.39% in signature vs 05.31% overall) GFX_ERROR "Failed 2 buffer db=" ∧ available_virtual_memory < 512 MB

So the crash is non-e10s only, this means it can't be the same problem as bug 1255977 comment 20.
I'm not actively working on this atm, and I have exhausted all the ideas I had, there's still the fact this crash can happen if the system or the main-thread are particularly busy and the message about closing the db takes too long to be handled. In those cases there's nothing I can do in Places itself to prevent the problem. Many comments indeed report things like "Firefox was not responding so I closed and reopened. Nothing happened and this box popped up".

I still think it's a very horrible solution to show a crash dialog to the user if shutdown times out. If we could report these without a dialog making us look so unstable, it would be great.

On the other side, looks like the volume of these crashes has reduced greatly in 50/51/52?
I may re-examine if the situation gets worse.
Assignee: mak77 → nobody

Showing results from 7 days ago

Signature report for AsyncShutdownTimeout | profile-before-change | Sqlite.jsm shutdown blocker
2,908 Results

Signature report for AsyncShutdownTimeout | profile-before-change | Sqlite.jsm shutdown blocker
Processor Notes contains "Abort | mozalloc_abort | Abort | NS_DebugBreak | nsDebugImpl::Abort | XPTC__InvokebyIndex"
124 Results

Signature report for AsyncShutdownTimeout | profile-before-change | Sqlite.jsm shutdown blocker
Processor Notes contains "Abort | mozalloc_abort | Abort | NS_DebugBreak | nsDebugImpl::Abort | NS_InvokeByIndex"
41 Results

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=AsyncShutdownTimeout%20%7C%20profile-before-change%20%7C%20Sqlite.jsm%20shutdown%20blocker&date=%3E%3D2019-06-08T10%3A38%3A00.000Z&date=%3C2019-06-15T10%3A38%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=1#summary

https://crash-stats.mozilla.com/signature/?product=Firefox&processor_notes=~Abort%20%7C%20mozalloc_abort%20%7C%20Abort%20%7C%20NS_DebugBreak%20%7C%20nsDebugImpl%3A%3AAbort%20%7C%20XPTC__InvokebyIndex&signature=AsyncShutdownTimeout%20%7C%20profile-before-change%20%7C%20Sqlite.jsm%20shutdown%20blocker&date=%3E%3D2019-06-08T10%3A38%3A00.000Z&date=%3C2019-06-15T10%3A38%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=1#summary

https://crash-stats.mozilla.com/signature/?product=Firefox&processor_notes=~Abort%20%7C%20mozalloc_abort%20%7C%20Abort%20%7C%20NS_DebugBreak%20%7C%20nsDebugImpl%3A%3AAbort%20%7C%20NS_InvokeByIndex&signature=AsyncShutdownTimeout%20%7C%20profile-before-change%20%7C%20Sqlite.jsm%20shutdown%20blocker&date=%3E%3D2019-06-08T10%3A38%3A00.000Z&date=%3C2019-06-15T10%3A38%3A00.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=startup_crash&_sort=-date&page=1#summary

Crash Signature: AsyncShutdownTimeout | Places Clients shutdown | (none)] → AsyncShutdownTimeout | Places Clients shutdown | (none)] [@ AsyncShutdownTimeout | profile-before-change | Sqlite.jsm shutdown blocker ]

Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.

See Also: → 1597915

Marking this as Resolved > Worksforme since no other crashes with this signature have been reported in the last 6 months.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: