Closed Bug 1341373 Opened 7 years ago Closed 6 years ago

Crash in mozilla::storage::Service::getSingleton after SQLite version update

Categories

(Toolkit :: Application Update, defect, P1)

52 Branch
x86
Windows 7
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox51 --- wontfix
firefox52 + wontfix
firefox53 + wontfix
firefox54 + wontfix
firefox55 - affected

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, stale-bug)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-0563174b-58cc-449d-8f81-617d82170221.
=============================================================

Spike seen in Beta 6 in this signature: http://bit.ly/2ljr2o1. Between B5 and B6 we went from 92 crashes to 1476.
from correlations for beta:
(100.0% in signature vs 01.43% overall) moz_crash_reason = MOZ_CRASH(SQLite Version Error)
user comments in crash reports also indicate that they are seeing a "SQlite is too old" error message after trying to apply a firefox update.

on the beta channel this crash signature started taking off on 2017-02-17 - the same day that 52.0b7 was released with a sqlite version update (bug 1339321). all the crash reports are annotated with 52.0b6 or earlier though.

the same thing happened on the aurora channel: bug 1339321 landed in build 20170217004020 and afterwards the crash signature started occurring in the older builds 20170216004023 & 20170215004022.
same pattern on nightly:
* upgrade to SQLite 3.15.0 (bug 1310361) was pushed on 2016-10-21, afterwards there were crashes with build 20161020030211
* upgrade to SQLite 3.15.2 (bug 1321233) was pushed on 2016-12-01, afterwards there were crashes with builds 20161130030206 & 20161129030222
* upgrade to SQLite 3.16.2 (bug 1328161) was pushed on 2017-01-10, afterwards there were crashes with build 20170110030221
* upgrade to SQLite 3.17.0 (bug 1339321) was pushed on 2017-02-14, afterwards there were crashes with build 20170214030231

https://crash-stats.mozilla.com/search/?signature=%3Dmozilla%3A%3Astorage%3A%3AService%3A%3AgetSingleton&release_channel=nightly&date=%3E%3D2016-08-21T18%3A52%3A22.000Z&_sort=-date&_facets=signature&_facets=version&_facets=user_comments&_facets=build_id&_facets=install_time&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-build_id
Summary: Crash in mozilla::storage::Service::getSingleton → Crash in mozilla::storage::Service::getSingleton after SQLite version update
I was expecting this to happen on Linux at a maximum, on Windows and MacOS it's puzzling.
The only way for this to happen is if nss3.dll is older than xul.dll, and the only thing that could cause this, afaict, is a broken update (xul.dll has been updated, but not nss3.dll).

(In reply to [:philipp] from comment #1)
> on the beta channel this crash signature started taking off on 2017-02-17 -
> the same day that 52.0b7 was released with a sqlite version update (bug
> 1339321). all the crash reports are annotated with 52.0b6 or earlier though.

Interesting, so the reported version is old, this seems again to point out to some update issue, to me.

Ehsan, IIRC you worked on the background update, any idea how we could end up running a libxul newer than nss3.dll, or an executable that uses new libxul while the other libraries have not been updated yet?
Flags: needinfo?(ehsan)
Priority: -- → P1
302 rstrong.
Flags: needinfo?(ehsan) → needinfo?(robert.strong.bugs)
We should probably also change the message here: http://searchfox.org/mozilla-central/rev/b1044cf7c2000c3e75e8181e893236a940c8b6d2/storage/mozStorageService.cpp#212 and suggest to the user a Firefox reinstall. At least we'd give them a way out of this.
The only cases where I can think of that happening due to app update are
a) the system has multiple instances running of the same install with one of the instances performing an update.
b) the dll's not being unloaded when an instance is started when performing an update.

Can you check if the clients are getting multiple crashes?

For Windows it would be a good thing to check dll's versions as well.
Flags: needinfo?(robert.strong.bugs)
yes, individual installations are hit repeatedly by this crash - you can check the "install time" facet: http://bit.ly/2l2g4Qq
Tracking since this looks fairly high volume in beta (~1700 crashes in 230 installs in a week)
We are building the last beta today though (then the RC next week) so if there's anything you think we should do for 52, it should be soon.
I can see it but I'm at a loss as to how this could be happening.

Is there any way to get the update logs for an affected system? If they have reinstalled I think the logs will be copied to the temp dir.
The only idea I have so far is to improve the error message. That can be done even very late since it has no risk at all. I can make a patch for that early tomorrow.
Storage-side I don't have further ideas atm. We're being handled an outdated library, at a maximum we could lift the MOZ_CRASH until we hit a Sqlite version that has clear db format incompatibilities, but that won't be a definitive solution. Plus this also means we are using an old nss3.dll that may contains security vulnerabilities and may still crash differently where the public exposed methods differ.
Looking at the modules of some of these crashes (https://crash-stats.mozilla.com/report/index/b908a38e-c3cc-4b1c-ba6e-a30722170223#tab-modules https://crash-stats.mozilla.com/report/index/181ac6ae-aa0d-4d1b-bb64-f3daa2170223#tab-modules) and some others, it looks like everything but xul.dll that seems to be behind.  Am I reading that right?  (Noting that some crashes have different version numbers, but there's always the same xul.dll skew.) ex:

52.0.0.6253: firefox.exe, nss3.dll, lgpllibs.dll, mozglue.dll
52.0.0.6256: xul.dll
Depends on: 1342441
I filed bug 1342441 to improve the Storage message.
I'm moving this bug to application Update, since I'm not really sure what we can do on our side, as comment 12 pointed out, there is a clear discrepancy between xul.dll and the other libraries, and afaict the only thing that could cause such a thing is the update process.

It's possible something started using sqlite too early during the update process, I don't know, that's something that should probably also be verified in nsPermissionManager and NotificationTelemetryServiceConstructor, that is what is trying to init Storage here.
Cc-ing some people that worked on that code.

...
7 	xul.dll 	nsPermissionManager::InitDB(bool) 	extensions/cookie/nsPermissionManager.cpp:865
8 	xul.dll 	nsPermissionManager::Init() 	extensions/cookie/nsPermissionManager.cpp:808
9 	xul.dll 	nsPermissionManager::GetXPCOMSingleton() 	extensions/cookie/nsPermissionManager.cpp:778
10 	xul.dll 	nsIPermissionManagerConstructor 	extensions/cookie/nsCookieModule.cpp:17
11 	xul.dll 	nsComponentManagerImpl::CreateInstanceByContractID(char const*, nsISupports*, nsID const&, void**) 	xpcom/components/nsComponentManager.cpp:1170
12 	xul.dll 	nsComponentManagerImpl::GetServiceByContractID(char const*, nsID const&, void**) 	xpcom/components/nsComponentManager.cpp:1526
13 	xul.dll 	nsCOMPtr_base::assign_from_gs_contractid(nsGetServiceByContractID, nsID const&) 	xpcom/glue/nsCOMPtr.cpp:95
14 	xul.dll 	nsCOMPtr<nsIPermissionManager>::nsCOMPtr<nsIPermissionManager>(nsGetServiceByContractID) 	obj-firefox/dist/include/nsCOMPtr.h:555
15 	xul.dll 	mozilla::services::GetPermissionManager() 	xpcom/build/ServiceList.h:26
16 	xul.dll 	mozilla::dom::NotificationTelemetryService::RecordPermissions() 	dom/notification/Notification.cpp:724
17 	xul.dll 	NotificationTelemetryServiceConstructor

NotificationTelemetryServiceConstructor has been added at the end of 2015 in bug 1219030.
Component: Storage → Application Update
Is it possible we get a web notification while Firefox is updating, and that causes this code to run?
Flags: needinfo?(MattN+bmo)
Flags: needinfo?(kit)
It shouldn't be possible except under the conditions I specified in comment #6.

Background: there are two ways updates are applied which are staged updates (the typical method used) and regular updates.

A staged update makes a copy of the installation files into a new directory, applies the update, and then when Firefox is restarted replaces the installation dir with this new dir where the update has been applied.

A regular update applies the update to the installation dir on Firefox restart.

There were several issues identified many years ago where the process wouldn't restore the original files on a failed update. The analysis for these issues can be seen in bug 635834 (there are other bugs I can track down if you would like me to). I also performed the analysis periodically over a time to check if there were any regressions and there were none.
`NotificationTelemetryService` initializes at `profile-after-change`, before we show any notifications. We could try moving it to a later startup phase, but I wonder if that would then move the crash to the second caller that tries to initialize storage. If we are initializing too early, it's really bizarre that it took seven release cycles for the issue to arise.
Flags: needinfo?(kit)
I don't have anything to add.
Flags: needinfo?(MattN+bmo)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #15)
> It shouldn't be possible except under the conditions I specified in comment
> #6.
> 
> Background: there are two ways updates are applied which are staged updates
> (the typical method used) and regular updates.
> 
> A staged update makes a copy of the installation files into a new directory,
> applies the update, and then when Firefox is restarted replaces the
> installation dir with this new dir where the update has been applied.
> 
> A regular update applies the update to the installation dir on Firefox
> restart.
> 
> There were several issues identified many years ago where the process
> wouldn't restore the original files on a failed update. The analysis for
> these issues can be seen in bug 635834 (there are other bugs I can track
> down if you would like me to). I also performed the analysis periodically
> over a time to check if there were any regressions and there were none.
Note: the main analysis is in Bug 635161.
Interestingly this barely affected release, whereas it looked pretty bad when it hit beta52.  wontfix for 51/52 at this stage.
Given that the crash volume for this signature is in the single digit for past 7 days, this is a low volume crash. We are entering RC week on coming Monday, this is now a wontfix for 53.
Given that there are no crashes for 54 for past 2 weeks.  Mark 54 won't fix for now.
No crashes in 53, only one in 54.0a1 in the last week. If this continues with no crashes in current versions we can close the bug as WORKSFORME.
This is happening again in 55.0a1 after bumping sqlite in bug 1367892.
This is a P1 bug without an assignee. 

P1 are bugs which are being worked on for the current release cycle/iteration/sprint. 

If the bug is not assigned by Monday, 28 August, the bug's priority will be reset to '--'.
Keywords: stale-bug
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.