Open Bug 1395542 Opened 7 years ago Updated 4 months ago

Crash in AsyncShutdownTimeout | profile-before-change | Places Connection shutdown

Categories

(Toolkit :: Places, defect, P3)

55 Branch
x86
Windows 7
defect

Tracking

()

Tracking Status
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox59 --- affected
firefox60 --- affected
firefox61 --- affected

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [fxsearch])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-d5246b0c-a246-4853-b4ce-497330170831.
=============================================================

Seen while looking at 55 crash stats: http://bit.ly/2eHfmrw. Currently in 55.0.3 it isn't very high ranking (#45), but it didn't have a bug associated with it.

Also affects 56 and 57.
Marco, I saw profile-before-change | Places Connection shutdown in the signature. I guess this is for you!
Flags: needinfo?(mak77)
I've already seen it (I keep an eye on crash stats usually), but it's not actionable as-is.
It means Places started closing the connection, but something didn't allow that, after 1 minute async shutdown just forcibly crashed Firefox.
Unfortunately all the information in the metadata is "state":{"progress":{}}. Not really useful. There may be a bug in how we pass it, or in async shutdown itself, but I didn't find it so far and didn't have time to dig deeper.

One thing we should do, imo, is to add a circular buffer to async shutdown keeping the last N log messages, and push those along the async shutdown metadata. Async shutdown logging may help us identify what happened just before the crash.
Flags: needinfo?(mak77)
Priority: -- → P5
Marco, why did you mark it as p5?

This is impacting a lot of users on release (~100k install at least).
Hm, it should have been a P3 :(
The problem here is that there's no clear cause nor solution to this problem for now, it's really cryptic.
Flags: needinfo?(mak77)
Whiteboard: [fxsearch]
Priority: P5 → P3

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/45153595-2f48-4bc8-b76f-81fc40201116

MOZ_CRASH Reason: MOZ_CRASH()

Top 10 frames of crashing thread:

0 mozglue.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:33
1 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp:435
2 xul.dll nsDebugImpl::Abort xpcom/base/nsDebugImpl.cpp:134
3 xul.dll XPTC__InvokebyIndex 
4  @0x1788a2056ef 
5 xul.dll trunc 
6 xul.dll trunc 
7 xul.dll static XPCWrappedNative::CallMethod js/xpconnect/src/XPCWrappedNative.cpp:1142
8 xul.dll XPC_WN_CallMethod js/xpconnect/src/XPCWrappedNativeJSOps.cpp:925
9 xul.dll js::InternalCallOrConstruct js/src/vm/Interpreter.cpp:599

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

For more information, please visit auto_nag documentation.

Keywords: topcrash
Severity: critical → S2

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

For more information, please visit auto_nag documentation.

Keywords: topcrash

We will continue to monitor the number of crashes, and based on changes, we'll determine if we have enough/more details that could help pinpoint the location of crashes.

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Most of the crashes here are due to exporting bookmarks to html on shutdown:
https://searchfox.org/mozilla-central/rev/b6d0ad6ba24b3d075e357e52160cd8fc7f911ba9/browser/components/BrowserGlue.jsm#3273-3275
Likely these users have many bookmarks...
The browser.bookmarks.autoExportHTML pref is not enabled by default luckily, so it only affects users who manually flipped it on.

Not sure what we could do here, the requirement is to export the most up to date version of bookmarks, since html format is used to sync bookmarks with third party apps...
Likely most of the time is spent in PlacesUtils.promiseFaviconData(), we could maybe make this part of the initial database fetch, instead of having one query per bookmark.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

For more information, please visit auto_nag documentation.

Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

For more information, please visit auto_nag documentation.

Keywords: topcrash

Looking at various reports, it looks like many of these crashes are due to DeleteFileW taking too long when Sqlite tries to remove the -shm and -wal files. There is an option in Sqlite to persist those files, https://www.sqlite.org/c3ref/c_fcntl_begin_atomic_write.html#sqlitefcntlpersistwal and we could evaluate enabling it in Nightly/Early beta and check the effect.

when I'm running Automation test using Selenium and Firefox Web driver, I'm getting the next output can you please take a look on that:
WARNING: A blocker encountered an error while we were waiting.
Blocker: Waiting for ping task
Phase: TelemetryController: Waiting for pending ping activity
State: (none)
WARNING: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
WARNING: addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:727:15
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:523:26
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:458:15
setupShutdown@resource://gre/modules/osfile/osfile_async_front.jsm:1548:28
@resource://gre/modules/osfile/osfile_async_front.jsm:1568:16
@resource://gre/modules/osfile.jsm:12:30
@resource://gre/modules/TelemetryStorage.sys.mjs:10:28
_checkPendingPings@resource://gre/modules/TelemetrySend.sys.mjs:859:17
setup@resource://gre/modules/TelemetrySend.sys.mjs:799:18
setup@resource://gre/modules/TelemetrySend.sys.mjs:237:30
setupTelemetry/this._delayedInitTask<@resource://gre/modules/TelemetryControllerParent.sys.mjs:824:36
observe@resource://gre/modules/AsyncShutdown.sys.mjs:576:16

After i take a look on the code i can see that this Warning errors arrives when im using driver.get().quit() method

Flags: needinfo?(mak)

(In reply to Avi Lugassi from comment #16)

when I'm running Automation test using Selenium and Firefox Web driver, I'm getting the next output can you please take a look on that:
WARNING: A blocker encountered an error while we were waiting.
Blocker: Waiting for ping task
Phase: TelemetryController: Waiting for pending ping activity
State: (none)
WARNING: Error: Phase "profile-before-change" is finished, it is too late to register completion condition "OS.File: flush I/O queued before profileBeforeChange"
WARNING: addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:727:15
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:523:26
addBlocker@resource://gre/modules/AsyncShutdown.sys.mjs:458:15
setupShutdown@resource://gre/modules/osfile/osfile_async_front.jsm:1548:28
@resource://gre/modules/osfile/osfile_async_front.jsm:1568:16
@resource://gre/modules/osfile.jsm:12:30
@resource://gre/modules/TelemetryStorage.sys.mjs:10:28
_checkPendingPings@resource://gre/modules/TelemetrySend.sys.mjs:859:17
setup@resource://gre/modules/TelemetrySend.sys.mjs:799:18
setup@resource://gre/modules/TelemetrySend.sys.mjs:237:30
setupTelemetry/this._delayedInitTask<@resource://gre/modules/TelemetryControllerParent.sys.mjs:824:36
observe@resource://gre/modules/AsyncShutdown.sys.mjs:576:16

After I take a look on the code, I can see that those messages arrive when call the driver.quit() method.

Flags: needinfo?(mak)

Most of the crashes here are now related to:

  1. operations when closing the database. Bug 1834043 will replace a file delete with a truncate, that ideally should be cheaper, though apparently some clients are still crashing on the truncate. I must assume it's really slow mechanical disks at this point.
  2. wal journal syncing that is causing an fsync, and again the system IO in these cases seems particularly slow. There's also crashes in LocalStorage and IndexedDB, not just Places.

I'm lowering severity, because I'm not sure we can do much when the disk is just painfully slow.

Severity: S2 → S3

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
You need to log in before you can comment on or make changes to this bug.