Intermittent dom/indexedDB/test/test_file_sharing.html | application crashed [@ RefPtr<mozilla::dom::BlobImpl>::operator->][@ ObjectStoreAddOrPutRequestOp::DoDatabaseWork]

RESOLVED FIXED in Firefox 52

Status

()

Core
DOM: IndexedDB
RESOLVED FIXED
a year ago
a year ago

People

(Reporter: Treeherder Bug Filer, Assigned: asuth)

Tracking

({assertion, crash, intermittent-failure})

unspecified
mozilla54
assertion, crash, intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox52 fixed, firefox-esr52 fixed, firefox53 fixed, firefox54 fixed)

Details

Attachments

(2 attachments, 1 obsolete attachment)

The timing of this fits nicely with bug 1319531. Can you please take a look, asuth?
Blocks: 1319531
status-firefox52: --- → affected
status-firefox53: --- → affected
status-firefox54: --- → affected
Component: DOM → DOM: IndexedDB
Flags: needinfo?(bugmail)
Keywords: assertion, crash
Summary: Intermittent dom/indexedDB/test/test_file_sharing.html | application crashed [@ RefPtr<mozilla::dom::BlobImpl>::operator->] → Intermittent dom/indexedDB/test/test_file_sharing.html | application crashed [@ RefPtr<mozilla::dom::BlobImpl>::operator->][@ ObjectStoreAddOrPutRequestOp::DoDatabaseWork]
(Assignee)

Updated

a year ago
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
(Assignee)

Comment 2

a year ago
DatabaseFile::GetBlockingInputStream() is in the midst of executing on the Connection Thread when ObjectStoreAddOrPutRequestOp::Cleanup() interrupts it on the Background Thread and nulls mBlobImpl right out from under it.

My fix in bug 1319531 introduced this race.  Previously, mBlobImpl was only ever accessed on the background thread.  The lifecycle looked like: mBlobImpl initialized, mBlobImpl used to produce an nsIInputStream 0+ times, mBlobImpl cleared by ActorDestroy and/or ClearInputStream().  The ownership rule was only the background thread touched mBlobImpl.

With the fix introducing BlobImpl::GetSize(), file I/O became possible and so mBlobImpl started being accessed on the connection thread as well.  The bug/race is due to:
- The effective mBlobImpl life-cycle rule was initialization and clearing on the background thread, but read multiple times on the connection thread.
- ObjectStoreAddOrPutRequestOp::Init() errs on the side of scheduling a file to be written multiple times until it's sure the file has already been written to disk.  This can potentially result in a file being written multiple times.  I think this might better explain FileHelper::CreateFileFromStream's original undocumented confusing behavior where it checked the file size and skipped the write if it was already correct.  (Which I raised in https://bugzilla.mozilla.org/show_bug.cgi?id=1312808#c16 review hunk 4.)  I think the "exists" check CreateFileFromStream ended up with still can make sense, however.
- test_file_sharing.html schedules the same blob to be written in 2 consecutive add() operations within the upgradeneeded transaction.  This produces the race we are witnessing here where the first add() operation has successfully written the blob to disk and set mCopiedSuccessfully and its Cleanup() is clearing it at the same time DoDatabaseWork() is trying to perform a redundant write.

I'm going to bypass options like "use a monitor" and "save a copy of mBlobImpl off" in favor of eliminating the StoredFileInfo::mCopiedSuccessfully indirection and instead directly clearing mBlobImpl on the connection thread after a successful CreateFileFromStream().  mCopiedSuccessfully was appropriate when mBlobImpl was only accessed on the background, thread.  But with our new control flow, we can clear it on the connection thread and also thereby eliminate the potential duplicate write problem.

The new mBlobImpl lifecycle becomes:
- Initialized on background thread at DatabaseFile creation time.  Safe because by definition no other thread has a reference during initialization.
- Cleared only on the connection thread.  (Or by the DatabaseFile destructor via its threadsafe refcount hitting zero.  But by definition that means no other thread has a reference to access.)  Note that we already removed the ActorDestroy clearing in the fix.
- Accessed on both threads.  But the background thread case is only to call bool-returning HasBlobImpl().  The connection-thread DoDatabaseWork can handle a previous op's DoDatabaseWork having cleared mBlobImpl out on the same thread, which is how we handle the attempted duplicate write case test_file_sharing.html and others can create.  (So even if HasBlobImpl() on the background thread always lied and returned true, it would not matter.)

Patch and try run shortly.
(Assignee)

Comment 4

a year ago
Created attachment 8833238 [details] [diff] [review]
Stop defering mBlobImpl clear via mCopiedSuccessfully, clear immediately. v1
Flags: needinfo?(bugmail)
Attachment #8833238 - Flags: review?(jvarga)

Comment 5

a year ago
Comment on attachment 8833238 [details] [diff] [review]
Stop defering mBlobImpl clear via mCopiedSuccessfully, clear immediately. v1

Review of attachment 8833238 [details] [diff] [review]:
-----------------------------------------------------------------

Great analysis as usual :)

I can only say that it seems we can get rid of HasBlobImpl() now by:
- removing checks for HasBlobImpl() in ObjectStoreAddOrPutRequestOp::Init()
- removing mFileManager
- adding mDatabase and initialize it in ObjectStoreAddOrPutRequestOp constructor
- calling mDatabase->GetFileManager() in ObjectStoreAddOrPutRequestOp::DoDatabaseWork() to get fileManager when it's needed

I think this makes sense given that HasBlobImpl() is not reliable as you outlined in your comment and the lifecycle will be even cleaner/clearer.

All the comments related to mBlobImpl and stuff will need to be updated too.

Anyway, great analysis, comments and fix.
Thanks!
Attachment #8833238 - Flags: review?(jvarga) → review+

Comment 6

a year ago
(In reply to Andrew Sutherland [:asuth] from comment #2)
> multiple times.  I think this might better explain
> FileHelper::CreateFileFromStream's original undocumented confusing behavior
> where it checked the file size and skipped the write if it was already
> correct.  (Which I raised in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1312808#c16 review hunk 4.)  I
> think the "exists" check CreateFileFromStream ended up with still can make
> sense, however.

Ah, so this was the reason for having the original check. I thought it was for the corner case described at https://dxr.mozilla.org/mozilla-central/source/dom/indexedDB/ActorsParent.cpp#29386
Anyway, as you said, it still makes sense for the corner case.

Comment 7

a year ago
24 failures in 733 pushes (0.033 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 12
* mozilla-inbound: 9
* try: 1
* mozilla-central: 1
* mozilla-beta: 1

Platform breakdown:
* linux64: 12
* linux32: 11
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1335054&startday=2017-01-30&endday=2017-02-05&tree=all
(Assignee)

Comment 9

a year ago
Created attachment 8834276 [details] [diff] [review]
Interdiff: remove HasBlobImpl, mFileManager.

These are the changes as requested.  The one deviation was that since ObjectStoreAddOrPutRequestOp subclasses TransactionDatabaseOperationBase (via NormalTransactionOp), ObjectStoreAddOrPutRequestOp has access to TransactionDatabaseOperationBase's mTransaction via public Transaction(), so we're able to pull the FileManager out via `Transaction()->GetDatabase()->GetFileManager();` without having to save off mDatabase.  (We know this to be safe because TransactionDatabaseOperationBase::RunOnConnectionThread which calls DoDatabaseWork() explicitly asserts on mTransaction and mTransaction->GetDatabase() and the Database class never clears its mFileManager.)

Alternately we could have exposed DatabaseConnection's mFileManager by adding a new public accessor, since the signature is DoDatabaseWork(DatabaseConnection* aConnection).

I'll land with the patches folded together after sufficiently green try run.

Comment 10

a year ago
Comment on attachment 8834276 [details] [diff] [review]
Interdiff: remove HasBlobImpl, mFileManager.

Review of attachment 8834276 [details] [diff] [review]:
-----------------------------------------------------------------

Perfect!

Comment 12

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/8eb49b4026a7
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
status-firefox54: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Duplicate of this bug: 1336742
OrangeFactor looks good! Please request Aurora/Beta approval on this when you get a chance.
Flags: needinfo?(bugmail)
(Assignee)

Comment 15

a year ago
Created attachment 8834990 [details] [diff] [review]
Stop defering mBlobImpl clear via mCopiedSuccessfully, clear immediately. v2

(This is the r=janv patch that landed.  I expect uplift to graft anyways, but for easy history, I'm updating the attachment.)  The landed https://hg.mozilla.org/mozilla-central/rev/8eb49b4026a7 grafts onto beta without problem unlike the original patch.

Approval Request Comment
[Feature/Bug causing the regression]:
Bug 1319531 that was uplifted to alpha and beta already.

[User impact if declined]:
See comment 2 for a detailed analysis, but in-brief: there is a crashing race in IndexedDB Blob/File writes.  It occurs when a Blob that was not previously stored in IDB is referenced in back-to-back add/put operations that don't wait for completion/success notifications.  (An add/put that references the Blob multiple times will not cause the race.)

[Is this code covered by automated tests?]:
Yes, and they have given us this bug ;).  Creating a test that would more reliably reproduce the race is not really feasible; this test is the most efficient reproduction possible (without massive new test infra).

[Has the fix been verified in Nightly?]:
Yes, and :RyanVM has helpfully confirmed that orange factor saw an improvement.

[Needs manual test from QE? If yes, steps to reproduce]:
No, this can't be manually tested.  The appropriate verification is making sure that the test crashes go away like :RyanVM did for nightly.

[List of other uplifts needed for the feature/fix]:
None.

[Is the change risky?]:
No.

[Why is the change risky/not risky?]:
With wisdom and guidance from :janv, the usage pattern that resulted in a race was completely eliminated.

[String changes made/needed]:
None.
Attachment #8833238 - Attachment is obsolete: true
Flags: needinfo?(bugmail)
Attachment #8834990 - Flags: review+
Attachment #8834990 - Flags: approval-mozilla-beta?
Attachment #8834990 - Flags: approval-mozilla-aurora?
Attachment #8834990 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8834990 [details] [diff] [review]
Stop defering mBlobImpl clear via mCopiedSuccessfully, clear immediately. v2

Fix a intermittent-failure. Aurora53+.
Comment on attachment 8834990 [details] [diff] [review]
Stop defering mBlobImpl clear via mCopiedSuccessfully, clear immediately. v2

fix race condition in indexeddb, beta52+
Attachment #8834990 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment 18

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/0ab74d80ea74
status-firefox53: affected → fixed

Comment 19

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-beta/rev/ef757ca3fbfe
status-firefox52: affected → fixed
6 failures in 836 pushes (0.007 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* autoland: 3
* mozilla-beta: 2
* mozilla-inbound: 1

Platform breakdown:
* linux64: 5
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1335054&startday=2017-02-06&endday=2017-02-12&tree=all

Comment 21

a year ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-esr52/rev/ef757ca3fbfe
status-firefox-esr52: --- → fixed
You need to log in before you can comment on or make changes to this bug.