Crash in mozilla::places::Database::ForceCrashAndReplaceDatabase

RESOLVED FIXED in Firefox -esr52

Status

()

defect
P1
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: calixte, Assigned: mak)

Tracking

(Blocks 1 bug, {crash})

55 Branch
mozilla56
Unspecified
Windows 10
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox-esr5255+ fixed, firefox54 wontfix, firefox55+ fixed, firefox56+ fixed)

Details

(Whiteboard: [clouseau][tbird crash][fxsearch], crash signature)

Attachments

(2 attachments, 1 obsolete attachment)

Reporter

Description

2 years ago
This bug was filed from the Socorro interface and is 
report bp-46dae67d-1cd7-440f-a69d-2b50e2170415.
=============================================================

There are 5 crashes (from 1 installation) in nightly 55 with buildid 20170414030225. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1355414.

[1] https://hg.mozilla.org/mozilla-central/rev?node=a121d68c6eec8c0d033581d8afceaba48b8ad8c5
Flags: needinfo?(mak77)
Assignee

Comment 1

2 years ago
(In reply to Calixte Denizet (:calixte) from comment #0)
> There are 5 crashes (from 1 installation) in nightly 55 with buildid
> 20170414030225. In analyzing the backtrace, the regression may have been
> introduced by patch [1] to fix bug 1355414.

This is not a regression, I added an intentional crash that will report us users whose database can't be migrated.
I'm working on bug 1355561 to reduce the number of databases that can't be fixed automatically.
Depends on: 1355561
Flags: needinfo?(mak77)
Assignee

Comment 2

2 years ago
Also, until bug 1355561 is fixed, the crash is what allows the user to go back to a working browser, since otherwise we would keep trying to use a broken places.sqlite.
Removing regression keyword based on Comment 1.
Keywords: regression
Assignee

Comment 4

2 years ago
note that before bug 1357366 we could enter a crash loop, so we could have multiple reports for the same client.

Updated

2 years ago
Whiteboard: [clouseau] → [clouseau][tbird crash]
Assignee

Updated

2 years ago
Priority: -- → P3

Comment 5

2 years ago
(In reply to Marco Bonardo [::mak] from comment #4)
> note that before bug 1357366 we could enter a crash loop, so we could have
> multiple reports for the same client.

current crash data suggests that single installations are still crashing multiple times with this signature
Assignee

Comment 6

2 years ago
(In reply to [:philipp] from comment #5)
> (In reply to Marco Bonardo [::mak] from comment #4)
> > note that before bug 1357366 we could enter a crash loop, so we could have
> > multiple reports for the same client.
> 
> current crash data suggests that single installations are still crashing
> multiple times with this signature

How do I see that on crash-stats?
The number of reports doesn't seem critical atm, but please let me know if I'm misinterpreting the numbers.

It's possible a system is just bad, and every time we create a database it just corrupts itself, or the user tries to restore a corrupt database that we removed.
Work on bug 1355561 is proceeding, but likely it won't be safe enough for an uplift to beta.

Comment 7

2 years ago
(In reply to Marco Bonardo [::mak] from comment #6)
> How do I see that on crash-stats?

on https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3Aplaces%3A%3ADatabase%3A%3AForceCrashAndReplaceDatabase in the product section it lists 15 crashes for 5 installations for 55.0b1 and 15 crashes for a single installation on 56.0a1 currently.

at https://crash-stats.mozilla.com/signature/?version=55.0b1&signature=mozilla%3A%3Aplaces%3A%3ADatabase%3A%3AForceCrashAndReplaceDatabase#reports if you sort by install time you'll get the crashes coming from the same machines shown next to each other.

> The number of reports doesn't seem critical atm, but please let me know if
> I'm misinterpreting the numbers.

there's not much crash data for the 55.0b1 cycle to make any claims with confidence yet since we're in the staged rollout phase where users on beta don't get automatically updated to 55 - but currently the 
[@ mozilla::places::Database::ForceCrashAndReplaceDatabase ] signature is the #2 crash in the browser process: https://crash-stats.mozilla.com/topcrashers/?product=Firefox&version=55.0b1&days=7
Assignee

Comment 8

2 years ago
If it's a single crash it should be fine, we are moving from a forever-broken browser experience, to a browser that crashes once and then it works.
The only worrisome part is the multiple crashes per user, that are unexpected. It's all cases where for some reason the places.sqlite file cannot be removed, like it's being kept open by antivirus software or other. indeed the moz_crash reason is always "Unable to remove the corrupt database file."

I think I could make an upliftable patch where we crash only if we fail to remove the db after a successful close.
If we fail to remove the file when the db was no even opened, the system is doomed by something else that is not us, so crashing may not bring to anything useful.
Assignee

Comment 9

2 years ago
Though, on a second thought, that would not yet be enough, since at the next startup we'd try again to open it, find it's corrupt and loop again since we will try to close and remove again.
Maybe we should just not crash in 55, and wait for the proper fix in 56.
Let's make this a P1 for re-evaluation.
Priority: P3 → P1
Whiteboard: [clouseau][tbird crash] → [clouseau][tbird crash][fxsearch]

Comment 10

2 years ago
bug 1355414 got uplifted to 54 and esr too...
Assignee

Comment 11

2 years ago
(In reply to [:philipp] from comment #10)
> bug 1355414 got uplifted to 54 and esr too...

right, it's what is allowing Thunderbird to open links again.
The real problem started earlier, it was probably a poor decision on my side to use a crash here, it would have been better to use opt-out telemetry.

Comment 12

2 years ago
i wasn't aware originally that it's on release as well when i commented earlier today - the impact there doesn't look too critical at the moment (32 crashes from 21 installations).
[Tracking Requested - why for this release]: Topcrash in early Beta55 results
Assignee

Updated

2 years ago
Assignee: nobody → mak77
Status: NEW → ASSIGNED
Comment hidden (mozreview-request)
Assignee

Comment 15

2 years ago
benjamin, what is the policy to uplift telemetry histograms? Is that something we can do?
Otherwise I can just make an uplift patch without the telemetry part.
Comment on attachment 8880979 [details]
Bug 1356812 - Use telemetry to report unfixable corrupt Places databases.

https://reviewboard.mozilla.org/r/152340/#review158014

Looks good, I love that we can actually test this!

::: toolkit/components/places/tests/unit/test_corrupt_telemetry.js:8
(Diff revision 1)
> +
> +// Tests that history initialization correctly handles a request to forcibly
> +// replace the current database.
> +
> +add_task(async function() {
> +  let profileDBPath =await OS.Path.join(OS.Constants.Path.profileDir, "places.sqlite");

Nit: space before await.
Attachment #8880979 - Flags: review?(past) → review+
Assignee

Comment 17

2 years ago
Comment on attachment 8880979 [details]
Bug 1356812 - Use telemetry to report unfixable corrupt Places databases.

Let's try with Rebecca.
Attachment #8880979 - Flags: review?(benjamin) → review?(rweiss)
Assignee

Comment 18

2 years ago
Also, please see the question in comment 15.

Comment 19

2 years ago
mozreview-review
Comment on attachment 8880979 [details]
Bug 1356812 - Use telemetry to report unfixable corrupt Places databases.

https://reviewboard.mozilla.org/r/152340/#review159938

data-r=me
I apologize for the delay: I didn't keep my review routine during the workweek and then both Rebecca and I were on PTO.

Please come talk to me about permanent monitoring of this since it seems like an important quality metric that we should have as part of mission control.
Attachment #8880979 - Flags: review+

Updated

2 years ago
Attachment #8880979 - Flags: review?(rweiss)
Comment hidden (mozreview-request)

Comment 21

2 years ago
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/autoland/rev/06ee61ade381
Use telemetry to report unfixable corrupt Places databases. r=bsmedberg,past

Comment 22

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/06ee61ade381
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Hi Marco, would you consider this fix a good candidate for uplift to 55? I came upon this bug as it was tracked for 55 and a top crasher in early beta builds. The change doesn't look so trivial. I noticed that there aren't many reports on 56. Perhaps the fix worked.
Flags: needinfo?(mak77)
Would be a good ESR52 candidate too if the risk isn't too high.
Assignee

Comment 25

2 years ago
Yes, we can uplift this, but I have to make an unbitrotted patch, since bug 1355561 exists only in 56 and I'm not going to uplift Storage fixes, those need the whole testing timeframe.

I'll try to unbitrot the patch against beta, and request uplift on that.
Assignee

Comment 26

2 years ago
Posted patch Patch for uplift (obsolete) — Splinter Review
Approval Request Comment
[Feature/Bug causing the regression]: bug 1355414
[User impact if declined]: When a places.sqlite is corrupt, and we cannot fix it in place, we currently crash (on purpose). On a second thought this is not a great user experience, so we changed the behavior to use opt-out telemetry instead. The difference is that replacing the database will require the user to restart the browser manually.
[Is this code covered by automated tests?]: new telemetry is, testing replace-on-restart is not.
[Has the fix been verified in Nightly?]: nope, but it's not trivially verifiable. There are no crashes reported after landing. Still waiting for telemetry aggregates.
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: the code only runs in the already rare case of a db corruption, plus it's only setting a pref and reporting telemetry.
[String changes made/needed]: none
Flags: needinfo?(mak77)
Attachment #8885787 - Flags: approval-mozilla-esr52?
Attachment #8885787 - Flags: approval-mozilla-beta?
Assignee

Comment 27

2 years ago
Sigh, I forgot to amend the patch.
See comment 26 for the approvals.
Attachment #8885787 - Attachment is obsolete: true
Attachment #8885787 - Flags: approval-mozilla-esr52?
Attachment #8885787 - Flags: approval-mozilla-beta?
Attachment #8885854 - Flags: approval-mozilla-release?
Attachment #8885854 - Flags: approval-mozilla-beta?
Comment on attachment 8885854 [details] [diff] [review]
Patch for uplift

The volume of crashes in 54 looks low and there is no plan to have dot release for 54. Release54-.
Attachment #8885854 - Flags: approval-mozilla-release? → approval-mozilla-release-
Assignee

Comment 29

2 years ago
Comment on attachment 8885854 [details] [diff] [review]
Patch for uplift

Oops looks like I clicked release instead of esr52
Attachment #8885854 - Flags: approval-mozilla-esr52?
Comment on attachment 8885854 [details] [diff] [review]
Patch for uplift

use telemetry instead of crashing when facing a corrupt places.sqlite, beta55+

Should be in 55.0b9
Attachment #8885854 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Marco Bonardo [::mak] from comment #26)
> [Is this code covered by automated tests?]: new telemetry is, testing
> replace-on-restart is not.
> [Has the fix been verified in Nightly?]: nope, but it's not trivially
> verifiable. There are no crashes reported after landing. Still waiting for
> telemetry aggregates.
> [Needs manual test from QE? If yes, steps to reproduce]: no

Setting qe-verify- based on Marco's assessment on manual testing needs.
Flags: qe-verify-
Comment on attachment 8885854 [details] [diff] [review]
Patch for uplift

Looks like there are no crashes after 55.0b9. Let's uplift this to ESR52.3.
Attachment #8885854 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
Interestingly, the builds failed on my original push to ESR52 because the new histogram didn't have a bug_numbers field, but Trunk/Beta were perfectly fine with that. I've fixed it up, but I'm wondering if we need to add it to those other branches too.
Assignee

Comment 35

2 years ago
I don't know if bug_number is mandatory, and why it would only be on 52.
Flags: needinfo?(alessio.placitelli)
(In reply to Marco Bonardo [::mak] from comment #35)
> I don't know if bug_number is mandatory, and why it would only be on 52.

The |bug_numbers| field is mandatory, unless the probe is "whitelisted". If the field is not there and the probe is not whitelisted (i.e. in the histogram_whitelist.json file), then building should fail with an error. See https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/collection/histograms.html#bug-numbers

If it doesn't, it's a regression :( I've filed and taken bug 1382556.
Flags: needinfo?(alessio.placitelli)
I've attached a patch to bug 1382556 that fixes this regression and the definition of all the histograms that landed while it was broken. I'll uplift that to beta as well.
You need to log in before you can comment on or make changes to this bug.