Closed Bug 1413501 Opened 2 years ago Closed 2 years ago

Crash in nsCOMPtr<T>::nsCOMPtr<T> | mozilla::storage::Service::unregisterConnection

Categories

(Toolkit :: Storage, defect, P1, critical)

58 Branch
Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox-esr52 --- unaffected
firefox56 --- unaffected
firefox57 --- unaffected
firefox58 --- wontfix
firefox59 --- fixed

People

(Reporter: calixte, Assigned: asuth)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [clouseau])

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-3775d422-f68f-480e-9a55-ec3830171101.
=============================================================

There are 46 crashes in nightly 58 starting with buildid 20171020100426. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1388998.

[1] https://hg.mozilla.org/mozilla-central/rev/2f78917dbcfd
Flags: needinfo?(bugmail)
Okay, so the problem is that SpinningSynchronousClose creates a new AsyncCloseConnection() which has a strong RefPtr<Connection> which puts our refcount up above 1 again which means that when AsyncCloseConnection::~AsyncCloseConnection does NS_ReleaseOnMainThreadSystemGroup(mConnection.forget()) we can end up in Service::unregisterConnection() for the same connection a second time, which causes us to end up with the case where !forgettingRef.

Since Connection claims to implement a threadsafe ISUPPORTS but it's just the mRefCnt that's safe and not our Release method, and adding additional mutexes into the custom Release method is asking for more headaches, I think the simplest thing is just to change Service::unregisterConnection to be okay with the case where forgettingRef is null.  We just want to ensure we don't leak Connections and we don't leak their underlying resources.

As long as we invoke unregisterConnection() at least once, we will drop our book-keeping reference.  If we invoke it twice, that's fine as long as we don't crash.  The "resurrecting" AddRef is unfortunate because it prevents us from having a nice invariant about the refcount never going back above one, but as long as we can't end up with mozStorage-using code ending up with a Connection we don't have in our Connection list, it's fine.  I think that holds true.

It's possible there's some other higher level cleanup we can do here, but we're sort of converging to an ugly evolved system that no longer crashes.


The triggering case here seems to be invoking executeAsync and getting GC'ed on the main thread so that when the AsyncStatementExecution completes, it's the last (non-connections-list) strong refcount.  This should be easy enough to trigger from JS, so I'll provide a test with the fix.
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: needinfo?(bugmail)
Priority: -- → P1
Crash Signature: [@ nsCOMPtr<T>::nsCOMPtr<T> | mozilla::storage::Service::unregisterConnection] → [@ nsCOMPtr<T>::nsCOMPtr<T> | mozilla::storage::Service::unregisterConnection] [@ RefPtr<T>::RefPtr<T> | mozilla::storage::Service::unregisterConnection]
Attached patch Part 1: FixSplinter Review
My prior analysis was a little too quick.  Since the Close() is happening in the destructor, it's not exactly a great idea for us to let a new strong reference be added at that point since that implies redundant destruction or UAF.

So I've moved our failsafe close from the destructor into our too-clever Release() method.  I've added comments there, but perhaps restating more simply:
- We invoke Close() when our refcount hits 1.  This may cause a transient AddRef/Release pair due to AsyncCloseConnection, where the Release() may or may not occur by the time Close() returns (based on varying notions of event targets).
- So I have us directly manipulate mRefCnt to go to 2 before we call Close() so we can avoid a 2->1 transition.
- Because of that potential delayed release, we have 2 paths to decide when we do the strong Release() via unregisterConnection and dropping our synthetic mRefCnt manipulation so that we avoid a "strong" 2->1 transition.

This is more than a little ugly.  I'm open to alternative fixes or suggestions for fixes.  I'm at TPAC right now and returning tomorrow, which is why this fix got delayed in the first place, but should be available to try alternatives on Friday/the weekend.

Uploading test next.
Attachment #8926530 - Flags: review?(mak77)
Attached patch Part 2: TestSplinter Review
I verified that this test, if applied without the fix, results in a crash of the xpcshell test and therefore a failure.
Attachment #8926531 - Flags: review?(mak77)
Comment on attachment 8926530 [details] [diff] [review]
Part 1: Fix

Review of attachment 8926530 [details] [diff] [review]:
-----------------------------------------------------------------

ugh! Hopefull yone day we can do better than manual bookeeping :(

::: storage/mozStorageConnection.cpp
@@ +587,5 @@
> +    // Okay, now our refcount is 2, we trigger Close().
> +    Unused << Close();
> +    // Now our refcount should either be at 2 (because nothing happened, or the
> +    // addref and release pair happened due to SpinningSynchronousClose)  or
> +    // 3 (because SpinningSynchronousClose hapened but didn't release yet).

typo "hapened"
Attachment #8926530 - Flags: review?(mak77) → review+
Comment on attachment 8926531 [details] [diff] [review]
Part 2: Test

Review of attachment 8926531 [details] [diff] [review]:
-----------------------------------------------------------------

::: storage/test/unit/test_connection_failsafe_close.js
@@ +29,5 @@
> +
> +  // now we need to wait for that callback to have completed.
> +  await callbackInvoked;
> +
> +  ok(true, "if we shutdown cleanly and do not crash, then we succeeded");

nit: please use the complete form Assert.ok, per common convention.
Attachment #8926531 - Flags: review?(mak77) → review+
Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/mozilla-inbound/rev/47f845a860d8
Fix for SpinningSynchronousClose unregisterConnection edge-case. r=mak
https://hg.mozilla.org/integration/mozilla-inbound/rev/4b34fc8bfe40
Test for SpinningSynchronousClose unregisterConnection edge-case. r=mak
I made both the requested changes.  I also added a few more comment lines about how the fix will not explode even if the memory reporters or shrinkMemory happen during the spun nested event loop.
Please request Beta approval on this when you get a chance.
Flags: needinfo?(bugmail)
Flags: in-testsuite+
(In reply to Ryan VanderMeulen [:RyanVM] from comment #11)
> Please request Beta approval on this when you get a chance.

Status update: this bug has moved some crashes around, or at least given us intermittent assertion failures, giving us bug 1417326 and bug 1422327 that I will finish investigating and address in the short term.  Once those are fixed and baked a little, we can consider uplift.  Those all currently appear a bit more edge-casey because they are a function of cross-thread races.
Depends on: 1417326, 1422327
Too late for Beta58. Mark 58 won't fix.
Flags: needinfo?(bugmail)
You need to log in before you can comment on or make changes to this bug.