Consider removing the temporary fix for bug 1528690

RESOLVED FIXED

Status

()

enhancement
P2
normal
RESOLVED FIXED
3 months ago
2 months ago

People

(Reporter: tt, Assigned: tt)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(crash signature)

Attachments

(3 attachments, 1 obsolete attachment)

The patch Bug 1481164 might also fix the bug 1528690 because that corrects threading for closing connection in the case of missing a connection.

No longer blocks: 1481164
Depends on: 1481164

We haven't collected any crash from the assertions in bug 1528690. So, that
means there is no a closing connection and its mInWriteTransaction flag is true
or its transaction count is non-zero.

Bug 1481164 corrects dispatching to the wrong thread for closing idle databases.
That might be related to the mUpdateRefcountFunction issue because that make it
possible to access dbInfo, transactionInfo on a Connection thread.

Therefore, this patch intends to remove the pointer check on Nightly and enable
an assertion to see if the problem is still exsit.

Pushed by shes050117@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/1a7c349c2c60
P1 - Removing the temporary fix for bug 1528690; r=asuth

We should track if there are crashes after the patch goes to m-c

Keywords: leave-open
Crash Signature: [@ mozilla::dom::indexedDB::`anonymous namespace'::DatabaseConnection::AutoSavepoint::Start(const class mozilla::dom::indexedDB::`anonymous namespace'::TransactionBase*) ] [@ mozilla::dom::indexedDB::(anonymous namespace)::DatabaseConnection::AutoSavepoi…
Attachment #9063569 - Attachment description: Backed out changeset 1a7c349c2c60 for causing Bug 1528690; → Bug 1546892 - Backed out changeset 1a7c349c2c60 for causing Bug 1528690;
Backout by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c61887431ba8
Backed out changeset 1a7c349c2c60 for causing Bug 1528690 on a CLOSED TREE
Flags: needinfo?(shes050117)

Thanks! So, bug 1481164 doesn't fix the crash because missing mUpdateRefcountFunction from a connection of ObjectStoreAddOrPutRequestOp. I need to figure out the root cause of that crash.

Flags: needinfo?(shes050117)
Attachment #9063569 - Attachment is obsolete: true

Not sure if it's related to the root cause yet, but in the minidump of a crash report. I saw the mResultCode of the aTransaction is NS_ERROR_STORAGE_IOERR(2153971714) [1].

[1] https://searchfox.org/mozilla-central/rev/11cfa0462a6b5d8c5e2111b8cfddcf78098f0141/dom/indexedDB/ActorsParent.cpp#10053

(In reply to Tom Tung [:tt, :ttung] from comment #8)

Not sure if it's related to the root cause yet, but in the minidump of a crash report. I saw the mResultCode of the aTransaction is NS_ERROR_STORAGE_IOERR(2153971714) [1].

And the mInvalidate of it is false

When the TransactionDatabaseOperationBase is constructing, the const member variable mTransactionIsAbort is initialized to be aTransaction->IsAbort(). At the third of the stack(TransactionDatabaseOperationBase::RunOnConnectionThread()), the mTransactionIsAbort is false. So, it means a failure for the transacion happen between the time when TransactionDatabaseOperationBase is constructing and the time when it crashes.

mCommitOrAbortReceived is false so it's not TransactionBase::RecvAbort before the EnsureConnection of something like that.
I wonder if it's this [1]

[1] https://searchfox.org/mozilla-central/rev/11cfa0462a6b5d8c5e2111b8cfddcf78098f0141/dom/indexedDB/ActorsParent.cpp#20724

(In reply to Tom Tung [:tt, :ttung] from comment #11)

mCommitOrAbortReceived is false so it's not TransactionBase::RecvAbort before the EnsureConnection of something like that.
I wonder if it's this [1]

[1] https://searchfox.org/mozilla-central/rev/11cfa0462a6b5d8c5e2111b8cfddcf78098f0141/dom/indexedDB/ActorsParent.cpp#20724

There is an element called __vfptr and it shows that the mTransaction is implemented by a NormalTransaction.
Combine with all the information above, there is only one possible place:

https://searchfox.org/mozilla-central/rev/11cfa0462a6b5d8c5e2111b8cfddcf78098f0141/dom/indexedDB/ActorsParent.cpp#21772

Some other TransactionDatabaseOperationBase set that result and the mTransaction is aborted right after the executing ObjectStoreAddOrPutRequestOp jumped from background thread to the connection thread.

It seems that's a timing issue. I'm going to try to figure out why the connection is closed by that "other" TransactionDatabaseOperationBase, but it appears that there is some storage error. I guess I could consider aborting the transaction earlier if it's possible.

I feel like I find out a possible way to run into this issue.

So there is a Database::StartTransactionOp and a ObjectStoreAddOrPutRequestOp. The StartTransactionOp executes first and then the ObjectStoreAddOrPutRequestOp executes right after it.

  1. TheStartTransactionOp fails to transfer the connection into a write connection on the connection thread [1] while ObjectStoreAddOrPutRequestOp is initializing on the background thread.
  2. The StartTransactionOp jumps back to the background thread and then abort the transaction while ObjectStoreAddOrPutRequestOp jumps from the background thread to the connection thread.
  3. Although the transaction is aborted, the ObjectStoreAddOrPutRequestOp isn't able to know the connection was aborted and keep using that connection (doesn't have a mUpdateRefcountFunction).
  4. Hit the issue.

I need to verify if it's possible to have the story above.

[1] https://searchfox.org/mozilla-central/rev/11cfa0462a6b5d8c5e2111b8cfddcf78098f0141/dom/indexedDB/ActorsParent.cpp#13142

It turns out that the temporary fix for bug 1528690 might be the right way to
do. Based on the recent crash reports, the mResultCode for the Transaction
object wasn't an NS_OK and the crashing operation hasn't updated the failure to
that yet. So, the right way to do here is that we should abort the operation on
the connection thread.

(In reply to Tom Tung [:tt, :ttung] from comment #14)

Created attachment 9066412 [details]
Bug 1546892 - Update the comment for the fix and add an assertion to ensure all cases have been caught;

It turns out that the temporary fix for bug 1528690 might be the right way to
do. Based on the recent crash reports, the mResultCode for the Transaction
object wasn't an NS_OK and the crashing operation hasn't updated the failure to
that yet. So, the right way to do here is that we should abort the operation on
the connection thread.

The follow-up or another thing which needs to do is to answer the question "Why does the number of NS_ERROR_STORAGE_IOERR for doing BeginWriteTransaction() increase dramatically on Windows in these a few months?"

I should also try to make a test to verify it. Something like removes the SQLite database while doing IDB stuff, or utilize functions in TelemetryVFS to create the error.

(In reply to Tom Tung [:tt, :ttung] from comment #16)

I should also try to make a test to verify it. Something like removes the SQLite database while doing IDB stuff, or utilize functions in TelemetryVFS to create the error.

I'm trying that by other methods, the problem I have now is that the test often hit some MOZ_ASSERT in advance. I will figure out how to make it easier or more straightforward (crashing on the same point for the crash report).

I can reproduce the issue by manually making BeginWriteTransacation() fails when StartTransactionOp calls that function and forcing all existing connections being idle.

I will try to make the BeginWriteTransacation() fails without changing any c++ code.

Posted patch test.patchSplinter Review

With this patch, I can constantly hit this issue on both my Mac and Ubuntu and that verifies my guess. Note, only on opt builds, because debug builds would run into many MOZ_ASSERTs before reaching the crashing place.

The next step is to verify if removing sqlite file can make BeginWriteTransaction() fail so that we can remove the forcing failure code on the c++.

(In reply to Tom Tung [:tt, :ttung] from comment #19)

The next step is to verify if removing sqlite file can make BeginWriteTransaction() fail so that we can remove the forcing failure code on the c++.

I set a breakpoint at the beginning of that function and remove the sqlite files, but I didn't get an error in that function...

(In reply to Tom Tung [:tt, :ttung] from comment #20)

(In reply to Tom Tung [:tt, :ttung] from comment #19)

The next step is to verify if removing sqlite file can make BeginWriteTransaction() fail so that we can remove the forcing failure code on the c++.

I set a breakpoint at the beginning of that function and remove the sqlite files, but I didn't get an error in that function...

That result was on Mac. I should and will test it on Windows. I got issues while getting "ProfD" on the test. I guess that means I need to have a chrome privilege. Thus, I'm going to transfer the mochitest test to a browser mochitest test.

Blocks: 1555931
No longer depends on: 1481164
Pushed by shes050117@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/7a8bb795a4c9
Update the comment for the fix and add an assertion to ensure all cases have been caught; r=asuth
Status: ASSIGNED → RESOLVED
Type: task → enhancement
Closed: 2 months ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.