Open Bug 771138 Opened 12 years ago Updated 5 months ago

Thunderbird crash in walIndexTryHdr due to failures at the operating system level to read the page in

Categories

(Toolkit :: Storage, defect, P5)

defect

Tracking

()

Tracking Status
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- wontfix
firefox-esr45 --- wontfix
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- wontfix
firefox-esr60 - wontfix
firefox53 --- wontfix
firefox-esr115 --- affected
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- verified
firefox67 --- wontfix
firefox119 --- wontfix
firefox120 --- wontfix
firefox121 --- wontfix

People

(Reporter: wsmwk, Unassigned)

Details

(Keywords: crash, Whiteboard: [tbird crash][TB12 regression][wontfix?])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-ca4b63fa-0a3b-4954-987b-3b4272120705 .
============================================================= 

Frame	Module	Signature	Source
0	mozsqlite3.dll	walIndexTryHdr	db/sqlite3/src/sqlite3.c:47107
1	mozsqlite3.dll	walIndexReadHdr	db/sqlite3/src/sqlite3.c:47165
2	mozsqlite3.dll	walTryBeginRead	db/sqlite3/src/sqlite3.c:47299
3	mozsqlite3.dll	pagerBeginReadTransaction	db/sqlite3/src/sqlite3.c:41335
4	mozsqlite3.dll	sqlite3PagerSharedLock	db/sqlite3/src/sqlite3.c:43209
5	mozsqlite3.dll	lockBtree	db/sqlite3/src/sqlite3.c:51524
6	mozsqlite3.dll	sqlite3BtreeBeginTrans	db/sqlite3/src/sqlite3.c:51824
7	mozsqlite3.dll	sqlite3VdbeExec	db/sqlite3/src/sqlite3.c:67638
8	xul.dll	mozilla::PerformanceCounter	xpcom/ds/TimeStamp_windows.cpp:427
9	mozsqlite3.dll	sqlite3Step	db/sqlite3/src/sqlite3.c:63043
10	mozsqlite3.dll	sqlite3_step	db/sqlite3/src/sqlite3.c:63118
11	xul.dll	mozilla::storage::Connection::stepStatement	storage/src/mozStorageConnection.cpp:893
12	xul.dll	mozilla::storage::AsyncExecuteStatements::executeStatement	storage/src/mozStorageAsyncStatementExecution.cpp:400
13	xul.dll	mozilla::storage::AsyncExecuteStatements::executeAndProcessStatement	storage/src/mozStorageAsyncStatementExecution.cpp:325
14	xul.dll	mozilla::storage::AsyncExecuteStatements::bindExecuteAndProcessStatement	storage/src/mozStorageAsyncStatementExecution.cpp:307
15	xul.dll	mozilla::storage::AsyncExecuteStatements::Run	storage/src/mozStorageAsyncStatementExecution.cpp:647 


crash appears to have started in version 12. ranking #22 for TB13.0.1
almost no firefox crashes

stack variation: bp-119f3dcc-873e-44ce-a67e-986762120705
firefox example bp-11008807-7480-4be6-8d92-2e17d2120614
Component: General → Database
Product: Thunderbird → MailNews Core
Component: Database → Storage
Product: MailNews Core → Toolkit
It's #36 top crasher in TB 17.0.
Keywords: topcrash
Whiteboard: [tbird topcrash][TB12 regression] → [tbird crash][TB12 regression]
Crash volume for signature 'walIndexTryHdr':
 - nightly(version 50):0 crashes from 2016-06-06.
 - aurora (version 49):4 crashes from 2016-06-07.
 - beta   (version 48):51 crashes from 2016-06-06.
 - release(version 47):315 crashes from 2016-05-31.
 - esr    (version 45):1095 crashes from 2016-04-07.

Crash volume on the last weeks:
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly       0       0       0       0       0       0       0
 - aurora        0       1       0       2       0       1       0
 - beta          8       4      12       9       5      11       2
 - release      56      51      47      36      38      48      30
 - esr         102     148     123     114      91     112     107

Affected platforms: Windows, Mac OS X, Linux
walIndexTryHdr is trying to read a memory-mapped page from the Write-Ahead-Log.  All of these crashes are due to failures at the operating system level to read the page in.  This is one of the downsides of memory-mapped I/O; I/O errors become fatal if we don't explicitly handle the page faults and transform them into something non-fatal.

Having said that, although these don't need to be fatal, it's quite likely that if we're encountering them, then most profile I/O going forward is going to be broken too, so trading the crash for everything breaking isn't likely a major improvement.

As such I don't think there's much to do about the bug.

== Details:

If we aggregate on "Reason" for >2% we get:
1 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_IN_PAGE_ERROR 	358 	52.19 %
2 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_CONNECTION_DISCONNECTED 	101 	14.72 %
3 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_OBJECT_NAME_NOT_FOUND 	60 	8.75 %
4 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_INVALID_PARAMETER 	50 	7.29 %
5 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_VOLUME_DISMOUNTED 	22 	3.21 %
6 	EXCEPTION_IN_PAGE_ERROR_READ / STATUS_FILE_INVALID 	15 	2.19 %

The first part, "EXCEPTION_IN_PAGE_ERROR_READ" specifically means there was an I/O error paging things in.  The latter code is extracted from the exception record if available.
* STATUS_IN_PAGE_ERROR: This is the same actual code as EXCEPTION_IN_PAGE_ERROR.  I'm not sure if this is an inability to be more specific or some layering scenario like if loopback devices were involved.
* STATUS_CONNECTION_DISCONNECTED: Presumably the file was network mounted and we lost the mount.
* STATUS_OBJECT_NAME_NOT_FOUND: Seems similar?  Machine/drive no longer around to service the UNC path or whatever?
* STATUS_INVALID_PARAMETER: This is a very generic error like it sounds; likely a cascading error from some other I/O error, possibly involving a disconnect?
* STATUS_VOLUME_DISMOUNTED: Explicitly that the volume was dismounted
* STATUS_FILE_INVALID: This generic-seeming error is actually really specific that someone externally messed with the opened file and it's no longer valid.

The others that didn't make the cut seem similarly of the form "the file system has betrayed us".
Priority: -- → P5
Whiteboard: [tbird crash][TB12 regression] → [tbird crash][TB12 regression][wontfix?]
Crash volume for signature 'walIndexTryHdr':
 - nightly (version 54): 3 crashes from 2017-01-23.
 - aurora  (version 53): 0 crashes from 2017-01-23.
 - beta    (version 52): 15 crashes from 2017-01-23.
 - release (version 51): 87 crashes from 2017-01-16.
 - esr     (version 45): 4528 crashes from 2016-08-10.

Crash volume on the last weeks (Week N is from 02-06 to 02-12):
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly       2       1
 - aurora        0       0
 - beta         11       1
 - release      57      14       0
 - esr         216     262     311     213     153      54     141

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content   Plugin
 - nightly #415
 - aurora
 - beta    #1254
 - release #698
 - esr     #78
A P5 critical bug seems like a contradiction in terms.
Gonna remove regression keyword since we've been shipping this for so long.
Keywords: regression
(In reply to Mike Taylor [:miketaylr] from comment #7)
> Gonna remove regression keyword since we've been shipping this for so long.

I don't understand how time impacts whether this is a regression or not.
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #8)
> (In reply to Mike Taylor [:miketaylr] from comment #7)
> > Gonna remove regression keyword since we've been shipping this for so long.
> 
> I don't understand how time impacts whether this is a regression or not.

Fair point. We use that keyword to help in regression triage, where we look for recent regressions (ideally to prevent shipping them to release). It can still be considered a regression without this keyword.
this is a fairly frequent crash on esr buils. modules from sophos security software are commonly showing up in the app_init_dll field of crash reports and correlations.
Crash Signature: [@ walIndexTryHdr] → [@ walIndexTryHdr] [@ sqlite3WalFindFrame]
Component: Storage → Other
OS: All → Windows
Product: Toolkit → External Software Affecting Firefox
Hardware: x86 → All
Version: Trunk → unspecified
Whiteboard: [tbird crash][TB12 regression][wontfix?] → [tbird topcrash][TB12 regression][wontfix?]
This is higher volume on ESR than on release. 
Adam, can you try contacting Sophos?
Opened a support case with them.
https://secure2.sophos.com/en-us/support/open-a-support-case.aspx
Flags: needinfo?(astevenson)
To help out anyone looking from Sophos, here's some links to relevant info in crash-stats:

Click through on the crash signatures from this bug, and then add an extra column into the resulting report. So, the first crash sig takes you to:

https://crash-stats.mozilla.com/signature/?signature=walIndexTryHdr
There is a little tab that says "summary" there and you want to look at the "reports" tab instead.

I added the column for "app init dlls" and then clicked on the column heading to sort by that content.
That shows me a bunch of reports with sophos dlls. 

This link will take you to the summary page - it adds the column for dlls but you will still have to click through on the Reports tab and then sort on the column header. 

https://crash-stats.mozilla.com/signature/?signature=walIndexTryHdr&date=%3E%3D2018-12-05T16%3A04%3A25.000Z&date=%3C2018-12-12T16%3A04%3A25.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_columns=app_init_dlls&_sort=-app_init_dlls&_sort=-date&page=1

Hope that helps.

(In reply to [:philipp] from comment #10)

this is a fairly frequent crash on esr buils. modules from sophos security
software are commonly showing up in the app_init_dll field of crash reports
and correlations.

Do you find that sophos correlates in many Thunderbird crashes?

**In a spot check of 10 Thunderbird crashes I found only two potentially AV related (though I could be wrong). If Thunderbird crashes don't correlate well to AV then we should move this back to the other component.
bp-783e9d0b-c52b-43cb-9fca-2c9ef0181212 TrendMicro
bp-b090cfb1-e316-4e40-bcf0-c96990181212 Sophois

Still a THunderbird topcrash

Flags: needinfo?(madperson)
Summary: crash in walIndexTryHdr → Thunderbird crash in walIndexTryHdr due to failures at the operating system level to read the page in

yes, currently only 10% of those crashes on thunderbird show involvement of sophos, so i'm following your suggestion and moving the bug to its original component.

Component: Other → Storage
Flags: needinfo?(madperson)
OS: Windows → All
Product: External Software Affecting Firefox → Toolkit
Version: unspecified → Trunk

Crash rate of Thunderbird is up 20-30% compared to spring. Not possible to say yet whether that correlates to version 78 uptake.

Still topcrash - Ranks ~25 for version Thunderbird 78.12.0, combined count of sqlite3WalFindFrame and walIndexTryHdr

In Thunderbird 102.0.2, the rank drops to #52

Whiteboard: [tbird topcrash][TB12 regression][wontfix?] → [tbird crash][TB12 regression][wontfix?]
Severity: critical → S2
You need to log in before you can comment on or make changes to this bug.