Closed Bug 1539640 Opened 7 months ago Closed 6 months ago

Crash in [@ walCleanupHash]

Categories

(Toolkit :: Storage, defect, P2, critical)

x86
Windows 7
defect
Points:
1

Tracking

()

RESOLVED FIXED
Iteration:
68.4 - Apr 29 - May 12
Tracking Status
firefox-esr60 --- wontfix
firefox66 --- wontfix
firefox67 --- fixed
firefox68 --- fixed

People

(Reporter: marcia, Assigned: mak)

References

Details

(Keywords: crash, regression, regressionwindow-wanted)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-51db773c-52a7-464c-892a-821550190327.

Seen while looking at beta crash data, but present in nightly as well: https://bit.ly/2TB2VPn. In looking back 6 months, it seems crashes started accumulating more frequently in 67 nightly. Signature is Windows only and the majority of the crashes are in Win 7.

No useful comments. The moz crash reason for all crashes appears to be EXCEPTION_ACCESS_VIOLATION_READ.

Top 10 frames of crashing thread:

0 nss3.dll static void walCleanupHash db/sqlite3/src/sqlite3.c:59199
1 nss3.dll static int sqlite3WalUndo db/sqlite3/src/sqlite3.c:61266
2 nss3.dll static int pagerRollbackWal db/sqlite3/src/sqlite3.c:53631
3 nss3.dll static int pagerPlaybackSavepoint db/sqlite3/src/sqlite3.c:53902
4 nss3.dll static int sqlite3PagerRollback db/sqlite3/src/sqlite3.c:57143
5 nss3.dll static int sqlite3BtreeRollback db/sqlite3/src/sqlite3.c:67206
6 nss3.dll static void sqlite3RollbackAll db/sqlite3/src/sqlite3.c:154993
7 nss3.dll static int sqlite3VdbeHalt db/sqlite3/src/sqlite3.c:79039
8 nss3.dll static int sqlite3VdbeExec db/sqlite3/src/sqlite3.c:86545
9 nss3.dll sqlite3_step db/sqlite3/src/sqlite3.c:81781

This is another bug where having bug 1274628 and bug 1289666 and friends addressed in crash-stats/related infra would be useful because it's hard to tell what's bad hardware/memory corruption without spending a lot of manual investigation time.

In the meantime, we are still working on various IndexedDB correctness improvements that will help ensure this isn't just a wacky use-after-free, so it's unlikely to make sense to look at this from an IDB perspective until we have more confidence about that.

Marking 66 as affected as there are crash signatures for 66 too.

(In reply to Neha Kochar [:neha] from comment #2)

Marking 66 as affected as there are crash signatures for 66 too.

Just a couple though, whereas 67 is way worse.

Marco FYI this crash signature is spiking in 67 beta.

Flags: needinfo?(mak77)

Andrew, could you please point out some bugs that may help with this?
Since this starts from indexedDB, and that code changed a lot recently, I'm a little bit out of my water to what are our options here.

In the meanwhile I notified the Sqlite team about this signature, in case they have any insight for the Sqlite side of the stack. but I think we need someone with indexedDB knowledge to provide a possible path to solve this here, even if it's just about incoming improvements that may reduce the likelihood of this.

Flags: needinfo?(mak77) → needinfo?(bugmail)

We haven't seen anything like this before.

An observation: The sLoc.aHash[] array that is being accessed when the crash occurs is memory mapped into the *-shm file.

From what we read, on both unix and windows, an external thread or process can truncate or delete the memory-mapped file and that should not make the memory go away. Everything should still work. But maybe if something inside of indexedDB is closing the file descriptor out from under SQLite, then problems might arise, maybe? I don't know and I'm not making an accusation - I'm just brainstorming ideas...

Would y'all be willing to push out a nightly that contains SQLite 3.27.2 with a small patch? (https://www.sqlite.org/src/info/1109942a50395583). This patch against 3.27.2 checks for a condition that we believe is impossible to reach but if it were reached would result in the behavior your are seeing. If you could run with 3.27.2+patch and observe that your crash spike goes away, that would be an important clue to us that we need to dig further into finding some way of reaching the unusual state that the branch detects.

If you are willing to do this, there are "Tarball" and "ZIP archive" links on the page above that will give you the complete source code for 3.27.2 plus the patch. Just download and "./configure; make sqlite3.c".

Some details on the code: The NEVER() macro will be a no-op for your builds. It will be defined thusly:

#define NEVER(X)  X

The purpose of the NEVER() macro is to indicate that we believe the branch can never be taken, and hence we do not have a test case for that branch.

The patch will be in the forthcoming 3.28.0 release of SQLite regardless. The check-in linked above is a cherrypick of the patch from trunk onto the 3.27 branch.

Thanks for the insight, much appreciated. I'll let Andrew answer regarding the indexedDB behavior with file descriptors, I don't know about it.

I'd have no problem temporarily testing a patch in Nightly, but the crash spike is only visible in Beta, Nightly has a few reports, but not many, that means either some timing difference is covering the crash, something changed in indexedDB, or just that the population is too small to hit this case often enough.
My fear is that we couldn't observe a meaningful shift in the number of crashes with the Nightly population, to be able to extrapolate reasoning out of it.

Testing your patch in Beta is a question for release drivers, I can't make such a decision by myself. The patch looks unharmful and from a technical point of view it should have no effect on any of our measurements.
Julien, what do you think, would release drivers allow this small diagnostic patch (see the link in comment 7) in Beta for a week?

Flags: needinfo?(jcristau)

(In reply to Marco Bonardo [::mak] from comment #8)

Testing your patch in Beta is a question for release drivers, I can't make such a decision by myself. The patch looks unharmful and from a technical point of view it should have no effect on any of our measurements.
Julien, what do you think, would release drivers allow this small diagnostic patch (see the link in comment 7) in Beta for a week?

Answering for Julien as he does 68 and I am the release manager for 67. Yes I would be OK testing this patched SQLite for a week on beta.

Flags: needinfo?(jcristau)

Ok, I can prepare that patch.

Add a bailout on error to check whether this is the cause for recent spike in Sqlite related crashes

Comment on attachment 9055763 [details]
Bug 1539640 - Diagnostic patch for Sqlite crash. r=asuth

Beta/Release Uplift Approval Request

  • Feature/Bug causing the regression: None
  • User impact if declined: We may not have useful data to debug this crash
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): error check bailout, simple one-liner
  • String changes made/needed: none
Attachment #9055763 - Flags: approval-mozilla-beta?
Priority: -- → P2

IndexedDB is a fairly straightforward consumer of our mozStorage bindings that wrap SQLite (apart from the refcount-updating custom function and the triggers). Everything happens in the parent process and the file descriptors are never touched. The VFS used by all mozStorage consumers (https://searchfox.org/mozilla-central/source/storage/TelemetryVFS.cpp) hasn't meaningfully changed for quite some time.

That said, Firefox's entire IPC infrastructure does depend on shipping fd's all over the place, so it is conceivable that a bug in the IPC/sandbox layer could result in a problem. (The process could also maybe run out of fd's?)

Some Partial Analysis

So, a lot of the crash reports are somehow missing "xul.dll" as a module, which is a bit disturbing, and presumably then why xul.dll frames can't be symbolicated. Maybe that's a race between the auto-updater doing things and the crash reporter? Anyways, here are some good crashes that do have the xul.dll module and symbols:

Stacks clearly indicate the browser is shutting down with QuotaManager waiting on all its clients to finish up following an attempt to abort everything in process.

The direct sequence of events leading to the CommitOp::Run that's calling into SQLite would seem to be:
On the "IPDL Background" thread:

  • mozilla::dom::quota::QuotaManager::Shutdown calls Client::ShutdownWorkerThreads.
  • mozilla::dom::indexedDB::QuotaClient::ShutdownWorkerThreads() calls
  • QuotaClient::AbortOperations() calls
  • Database::Invalidate() calls
  • inner class Helper::InvalidateTransactions() calls
  • TransactionBase::Invalidate() calls
  • TransactionBase::Abort(, aForce=false) calls
  • TransactionBase::MaybeCommitOrAbort() where mForceAborted=false, calls
  • TransactionBase::CommitOrAbort() creates a CommitOp and passes it to a call to:
  • ConnectionPool::Finish creates a FinishCallbackWrapper passing the CommitOp as aCallback in its call to:
  • ConnectionPool::Dispatch which dispatches the FinishCallbackWrapper to the database connection thread which is the crashing thread.

On the crashing connection thread:

  • FinishCallbackWrapper::Run is run by the thread's event loop, it calls:
  • (TransactionBase::CommitOp::Run which
    • does check if the connection exists and bails if it doesn't... and I think our close hygiene is pretty good, so, hm, maybe not a use-after-free on this.
Flags: needinfo?(bugmail)

Comment on attachment 9055763 [details]
Bug 1539640 - Diagnostic patch for Sqlite crash. r=asuth

Diagnostic patch to investigate a crash, uplift approved for 67 beta 9, thanks.

Attachment #9055763 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

the crash signature seems to have disappeared in 67.0b9

(In reply to D. Richard Hipp from comment #7)

Would y'all be willing to push out a nightly that contains SQLite 3.27.2 with a small patch? (https://www.sqlite.org/src/info/1109942a50395583). This patch against 3.27.2 checks for a condition that we believe is impossible to reach but if it were reached would result in the behavior your are seeing. If you could run with 3.27.2+patch and observe that your crash spike goes away, that would be an important clue to us that we need to dig further into finding some way of reaching the unusual state that the branch detects.

Richard, it looks like we are indeed hitting that condition, considered the patch basically resolved the crashes with this stack. I hope this helps your investigation on Sqlite side.

Flags: needinfo?(drh)

Yes. We will be digging deeper in trying to recreate the problem.

Note that the problem is solved in the patch you have, and in the latest trunk version of SQLite (that will become 3.28.0). The issue for us is that we do not yet have a test case for this problem. But now that you have confirmed that it does occur, we'll spend more time trying to get an actual test case.

The branch at https://www.sqlite.org/src/artifact/8bf87820896453ee?ln=1005 is what detects the error condition. As you can see, that branch is currently enclosed in a NEVER() macro, indicating that we do not have a test case to exercise that branch, and so that branch never occurs during testing. The NEVER() macro is a no-op for your builds. But for debugging builds, NEVER() causes an assertion fault if the condition is true. The task before us SQLite developers is to develop a test case that causes the condition inside of the NEVER() macro to be true.

Thanks for testing the patch.

Flags: needinfo?(drh)

Thank you, much appreciated.
We'll pick 3.28.0 as soon as it's available, and because the current patch is already in upstream, we can keep it without risks (we don't patch our copy of Sqlite, unless the proposed patch made upstream).

We can keep this bug open until we have a bug for the new Sqlite version, and re-evaluate the situation once the 68 merge approaches in the middle of May.

Depends on: SQLite3.28.0

Sqlite has been updated to 3.28.0

Assignee: nobody → mak77
Status: NEW → RESOLVED
Closed: 6 months ago
Resolution: --- → FIXED
Points: --- → 1
Iteration: --- → 68.4 - Apr 29 - May 12
You need to log in before you can comment on or make changes to this bug.