Closed Bug 1528690 Opened 5 years ago Closed 5 years ago

Crash in [@ mozilla::dom::indexedDB::`anonymous namespace'::DatabaseConnection::AutoSavepoint::Start(const class mozilla::dom::indexedDB::`anonymous namespace'::TransactionBase*)]

Categories

(Core :: Storage: IndexedDB, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
mozilla68
Tracking Status
firefox-esr60 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 blocking fixed
firefox68 --- fixed

People

(Reporter: gsvelto, Assigned: tt)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(4 files, 1 obsolete file)

This bug is for crash report bp-dde28edf-0b5b-4a68-8d14-4e40d0190217.

Top 10 frames of crashing thread:

0 xul.dll nsresult mozilla::dom::indexedDB::`anonymous namespace'::DatabaseConnection::AutoSavepoint::Start dom/indexedDB/ActorsParent.cpp:10021
1 xul.dll nsresult mozilla::dom::indexedDB::`anonymous namespace'::ObjectStoreAddOrPutRequestOp::DoDatabaseWork dom/indexedDB/ActorsParent.cpp:23588
2 xul.dll void mozilla::dom::indexedDB::`anonymous namespace'::TransactionDatabaseOperationBase::RunOnConnectionThread dom/indexedDB/ActorsParent.cpp:21249
3 xul.dll nsresult mozilla::dom::indexedDB::`anonymous namespace'::TransactionDatabaseOperationBase::Run dom/indexedDB/ActorsParent.cpp:21399
4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1162
5 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:474
6 xul.dll nsresult mozilla::dom::indexedDB::`anonymous namespace'::ConnectionPool::ThreadRunnable::Run dom/indexedDB/ActorsParent.cpp:11738
7 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1162
8 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:474
9 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:333

Found this in nightly but it doesn't seem like a new crash since it's present in various older versions.

Windows-specific but happens on all supported versions.

OS: Windows 10 → Windows

crashes are also occurring on non-windows platforms with a slightly different signature.

Crash Signature: [@ mozilla::dom::indexedDB::`anonymous namespace'::DatabaseConnection::AutoSavepoint::Start(const class mozilla::dom::indexedDB::`anonymous namespace'::TransactionBase*)] → [@ mozilla::dom::indexedDB::`anonymous namespace'::DatabaseConnection::AutoSavepoint::Start(const class mozilla::dom::indexedDB::`anonymous namespace'::TransactionBase*)] [@ mozilla::dom::indexedDB::(anonymous namespace)::DatabaseConnection::AutoSavepoin…
OS: Windows → All

It's a crash so mark it as P1.

From the report [1], it crashes in [2]. It shows that the connection object cannot be got from the 'Database'. I guess that means somehow we haven't called EnsureConnection() before accessing the connection.

[1] https://crash-stats.mozilla.com/report/index/dde28edf-0b5b-4a68-8d14-4e40d0190217
[2] https://searchfox.org/mozilla-central/rev/93905b660fc99a5d52b683690dd26471daca08c8/dom/indexedDB/ActorsParent.cpp#10021

Priority: -- → P1

#11 top browser crash on nightly 67.

27 crashes for 67.0b1 (Devedition) released 2 days which is more crashes than we had for 66 betas while there is at least an order of magnitude difference in userbase. It seems that the situation is getting worse on 67 than on 66. Andrew, can we get somebody to investigate please? Thanks!

Flags: needinfo?(overholt)

Tom, are you able to take another look here?

Flags: needinfo?(overholt) → needinfo?(shes050117)

Sure.

Assignee: nobody → shes050117
Status: NEW → ASSIGNED
Flags: needinfo?(shes050117)

After looking into more, I reckon it's because the ObjectStoreAddOrRequestOp/TransactionDatabaseOperationBase is destroyed so that the connection is a dangling pointer. With the ActorDestory() stuff which Jan found, I think it's likely to happen.

The reason is:

  • connection is held by the database and the database is held by the transaction. The transaction is held by TransactionDatabaseOperationBase. [1]
  • The transaction only is null'd in Cleanup() and the function shouldn't be called before the RunOnConnectionThread()
  • EnsureConnection() is called slightly before the crash happened [2].

I'm working a patch to make this happen manually to verify my guess.

[1] https://hg.mozilla.org/mozilla-central/annotate/529782921812aaf0558fbb22384b3051174657d3/dom/indexedDB/ActorsParent.cpp#l10029
[2] https://hg.mozilla.org/mozilla-central/annotate/529782921812aaf0558fbb22384b3051174657d3/dom/indexedDB/ActorsParent.cpp#l21506

So, after testing it with the minidump, it actually crashes with the reason:
connection->mUpdateRefcountFunction.mRawPtr was nullptr.

It seems it's [1].

[1] https://searchfox.org/mozilla-central/rev/ddd1679c0534f7ddf36cafddd17b710c4fefe3c4/dom/indexedDB/ActorsParent.cpp#9487

What I found on the minidump:

  • Frame 0

    • connection isn't a nullptr
      • mInReadTransaction is true
      • mInWriteTransaction is false
      • mRefCnt is 2
    • mUpdateRefcountFunction is a nullptr
  • Frame 2 TransactionDatabaseOperationBase::RunOnConnectionThread

  • In the successful case of EnsureConnection()

The only place where the mUpdateRefcountFunction is set:

  • DatabaseConnection::BeginWriteTransaction()

https://searchfox.org/mozilla-central/rev/2738efcf98d746c8810819e7c0634b2c57212a8d/dom/indexedDB/ActorsParent.cpp#9356

It seems the problem is that:
If a new connection is set in EnsureConnection() is created, its (the connection object) mUpdateRefcountFunction is never set before the crash point.

I have no idea how do I create a test to close the connection while adding or putting an object store. So, I might right a patch to force it close and get a new one in EnsureConnection()

Attached patch Patch to crashSplinter Review

My guess is that the connection is closed for some reasons before the EnsureConnection. Although EnsureConnection generate a new connection for the database/transaction, the attribute of the connection wasn't set properly.

With this patch, I can reproduce the crash by running the test_objectStore_inline_autoincrement_key_added_on_put.htm

The only strange point is that CloseDatabase() is called only when the database is recognized as an idle database. It seems that there might have some issues.

Setting as a blocker for 67, with 200 crashes per day on beta, this is very likely to explode in release if we don't have a fix or a mitigation in place.

So, my understanding of this issue is that:
ConnectionPool::CloseDatabase() [1] is called before the PBackgroundIDBRequestParent* TransactionBase::AllocRequest() [2] is called. For a read transaction that should be okay, but it becomes a problem in the case of a write transaction. The reason is that in the Database::EnsureConnection() [3], it only creates a read transaction when it can't find out a connection for the given database. Consequently, for instance, it would crash when it starts the autosave point in ObjectStoreAddorPutRequestOp::DoDatabaseWork().

My first impression for this is that I should check whether we can prevent the connection from being closed in this case and also check if we can either abort the operation or update the read transaction to the write transaction.

For the "prevent the connection from being closed in this case":
I check all the caller of the CloseDatabase(). And, all of them are executed because the idle timer (30s) is expired. It seems they are reasonable.

For the "abort the operation or update the read transaction to the write transaction"

  • "Abort the operation"
    I hit the assertion for lacking mStoreageConnection [GetCachedStatement in 4] for the Transaction while the Transaction is commiting if I tried to abort the operation in TransactionDatabaseOperationBase::RunOnConnectionThread() (before ensuring the connection [5]).

  • "Update the read transaction to the write transaction"
    I hit the assertion for duplicating the key [6] while executing the statement if I tried to do that by calling BeginWriteTransaction() in ObjectStoreAddorPutRequestOp::DoDatabaseWork() [7]. It's a failure case for getting a constraint error of foreign key. It seems that part of the cached statement was rollbacked while closing the previous connection.

[1] https://searchfox.org/mozilla-central/rev/8d78f219702286c873860f39f9ed78bad1a6d062/dom/indexedDB/ActorsParent.cpp#11510
[2] https://searchfox.org/mozilla-central/rev/6db0a6a56653355fcbb25c4fa79c6e7ffc6f88e9/dom/indexedDB/ActorsParent.cpp#13797
[3] https://searchfox.org/mozilla-central/rev/6db0a6a56653355fcbb25c4fa79c6e7ffc6f88e9/dom/indexedDB/ActorsParent.cpp#12503
[4] https://searchfox.org/mozilla-central/rev/8d78f219702286c873860f39f9ed78bad1a6d062/dom/indexedDB/ActorsParent.cpp#21886
[5] https://searchfox.org/mozilla-central/rev/8d78f219702286c873860f39f9ed78bad1a6d062/dom/indexedDB/ActorsParent.cpp#21564-21572
[6] https://searchfox.org/mozilla-central/rev/8d78f219702286c873860f39f9ed78bad1a6d062/dom/indexedDB/ActorsParent.cpp#24214
[7] https://searchfox.org/mozilla-central/rev/8d78f219702286c873860f39f9ed78bad1a6d062/dom/indexedDB/ActorsParent.cpp#23946

I'll take a look too.

Flags: needinfo?(jvarga)

(In reply to Tom Tung [:tt, :ttung] from comment #13)

  • "Update the read transaction to the write transaction"
    I hit the assertion for duplicating the key [6] while executing the statement if I tried to do that by calling BeginWriteTransaction() in ObjectStoreAddorPutRequestOp::DoDatabaseWork() [7]. It's a failure case for getting a constraint error of foreign key. It seems that part of the cached statement was rollbacked while closing the previous connection.

After adding more log,
Under a VersionChangeTransaction

  1. CreateObjectStoreOp::DoDatabaseWork() -> Insert (id, auto_increament, name, key_path) into table object_store
  2. ObjectStoreAddOrPutRequestOp::DoDatabaseWork() -> Insert (object_store_id, ...) into table object_data

However, if I manually closed the connection of the transaction to mimic idle timer expired between 1 and 2 (in fact, I closed it at the beginning of initializing ObjectStoreAddOrPutRequestOp), I found out that the inserted statment could be find out (by select the id from table object_store) before the connection is closing (at the SendingToConnectionPool of ObjectStoreAddOrPutRequestOp). The problem is that after the connection is closed, I couldn't find out the inserted statment if I wanted to select the id from the table object_store. Consequectly, it leads a foreign key contraint error when inserting a object_store_id to table object_data.

The problem for me on this is that, I assemed the statment would be written into a SQLite database after the Op completes. However, it seems I was wrong.

It seems that the only way I can do for this is to abort the transaction for the write connection if the connection is about to be closed. And, the remaining problem of this is that it seems there is no easy way to get a transaction from the ConnectionPool. I need to figure a way out of this.

(In reply to Jan Varga [:janv] from comment #14)

I'll take a look too.

I have to re-evaluate this, we have only 3 weeks for stabilizing LSNG, so I'd like to focus on that.
Not sure who else could help here. Everyone seems to be busy right now.

Flags: needinfo?(jvarga)

This is a very high crash volume on beta and would block the 67 release, so it should be a high priority.
Andrew, is there anyone else who can help out here?

Flags: needinfo?(overholt)

We're working on a plan here.

Flags: needinfo?(overholt)

(In reply to Tom Tung [:tt, :ttung] from comment #13)

  • "Abort the operation"
    I hit the assertion for lacking mStoreageConnection [GetCachedStatement in 4] for the Transaction while the Transaction is commiting if I tried to abort the operation in TransactionDatabaseOperationBase::RunOnConnectionThread() (before ensuring the connection [5]).

I'm thinking to rollback all the statement when the connection for a write transaction is eventually closed. Then, I will need to notify the Transaction and set some flag there. So that when CommitOp::Run is executing, we can bypass these GetCachedConnection()s and also the assertions.

I will check if it's really safe to do that (the code is really complicated, though) before the discussion tomorrow.

We discussed this again and Tom has a plan to add a null check. He'd also like to convert an assertion to a diagnostic assertion and asuth will coordinate with sheriffs (tomorrow) so we don't have an explosion of crashes due to that.

The real problem is because the connection was closed for some reason. This
patch won't resolve the issue by only let the operation and the transaction fail
to escape from the crash. Note that it also changing an assertion when the
transaction is either commiting or aborting. This is prevent the problem for
closing connection is worser than we expected.

Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/6520a5a9e6f3
P1 - Mitigate the issue by checking the pointer before accessing it; r=asuth
Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/7b127d669026
P2 - Change some assertions to diagnostic assertion to get more information on Nightly; r=asuth
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Comment on attachment 9059381 [details]
Bug 1528690 - P1 - Mitigate the issue by checking the pointer before accessing it;

Beta/Release Uplift Approval Request

  • User impact if declined: Crash
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It's a nullptr check before the place where users crashed. If a user runs into that check, it would return an error so that the operation would be aborted.
  • String changes made/needed:
Attachment #9059381 - Flags: approval-mozilla-beta?

Comment on attachment 9059381 [details]
Bug 1528690 - P1 - Mitigate the issue by checking the pointer before accessing it;

Fixes a top crash, approved for 67.0b13.

Attachment #9059381 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Volume on ESR60 is low enough that I don't think we need to worry about backporting there. Feel free to nominate if you feel strongly otherwise, however.

Pushed by shes050117@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/2942b016522b
P3 - Change more assertions to diagnostic assertion to get more information on Nightly; r=asuth

(In reply to Pulsebot from comment #32)

Pushed by shes050117@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/2942b016522b
P3 - Change more assertions to diagnostic assertion to get more information
on Nightly; r=asuth

This is a patch to diagnosis the issue so that it is expected to make some users crash at DatabaseConnection::~DatabaseConnection() or DatabaseConnection::Close().

I'll wait two more days to see if P3 can catch the issues on Nightly. If it's not, I'll change or add more assertions and make the assertions in P3 into release assertion and uplift that to Beta to catch the issue as soon as possible.

Depends on: 1546892

Comment on attachment 9063568 [details]
Backed out changeset 1a7c349c2c60 for causing Bug 1528690;

Revision D30387 was moved to bug 1546892. Setting attachment 9063568 [details] to obsolete.

Attachment #9063568 - Attachment is obsolete: true

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?
Root Cause: ? → Coding: Unhandled Exceptions
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: