Closed Bug 1413501 Opened 8 years ago Closed 8 years ago

Crash in nsCOMPtr<T>::nsCOMPtr<T> | mozilla::storage::Service::unregisterConnection

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla59

Tracking Flags:

Tracking

Status

firefox-esr52

---

unaffected

firefox56

---

unaffected

firefox57

---

unaffected

firefox58

---

wontfix

firefox59

---

fixed

People

(Reporter: calixte, Assigned: asuth)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [clouseau])

Crash Data

Attachments

(2 files)

Part 1: Fix 8 years ago Andrew Sutherland [:asuth] (he/him) 4.00 KB, patch	mak : review+	Details \| Diff \| Splinter Review
Part 2: Test 8 years ago Andrew Sutherland [:asuth] (he/him) 2.64 KB, patch	mak : review+	Details \| Diff \| Splinter Review

Calixte Denizet (:calixte)

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is report bp-3775d422-f68f-480e-9a55-ec3830171101. ============================================================= There are 46 crashes in nightly 58 starting with buildid 20171020100426. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1388998. [1] https://hg.mozilla.org/mozilla-central/rev/2f78917dbcfd

Flags: needinfo?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 1

•

8 years ago

Okay, so the problem is that SpinningSynchronousClose creates a new AsyncCloseConnection() which has a strong RefPtr<Connection> which puts our refcount up above 1 again which means that when AsyncCloseConnection::~AsyncCloseConnection does NS_ReleaseOnMainThreadSystemGroup(mConnection.forget()) we can end up in Service::unregisterConnection() for the same connection a second time, which causes us to end up with the case where !forgettingRef. Since Connection claims to implement a threadsafe ISUPPORTS but it's just the mRefCnt that's safe and not our Release method, and adding additional mutexes into the custom Release method is asking for more headaches, I think the simplest thing is just to change Service::unregisterConnection to be okay with the case where forgettingRef is null. We just want to ensure we don't leak Connections and we don't leak their underlying resources. As long as we invoke unregisterConnection() at least once, we will drop our book-keeping reference. If we invoke it twice, that's fine as long as we don't crash. The "resurrecting" AddRef is unfortunate because it prevents us from having a nice invariant about the refcount never going back above one, but as long as we can't end up with mozStorage-using code ending up with a Connection we don't have in our Connection list, it's fine. I think that holds true. It's possible there's some other higher level cleanup we can do here, but we're sort of converging to an ugly evolved system that no longer crashes. The triggering case here seems to be invoking executeAsync and getting GC'ed on the main thread so that when the AsyncStatementExecution completes, it's the last (non-connections-list) strong refcount. This should be easy enough to trigger from JS, so I'll provide a test with the fix.

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Flags: needinfo?(bugmail)

Marco Bonardo [:mak]

Updated

•

8 years ago

Priority: -- → P1

Marcia Knous [:marcia]

Updated

•

8 years ago

Crash Signature: [@ nsCOMPtr<T>::nsCOMPtr<T> | mozilla::storage::Service::unregisterConnection] → [@ nsCOMPtr<T>::nsCOMPtr<T> | mozilla::storage::Service::unregisterConnection] [@ RefPtr<T>::RefPtr<T> | mozilla::storage::Service::unregisterConnection]

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 2

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a19e2370017e3224325a3dfca0f7c443d43de9a

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 3

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf3bc49b61fa90f0672dd49807ebe0d5fb9847c0

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

8 years ago

Attached patch Part 1: Fix — Details — Splinter Review

My prior analysis was a little too quick. Since the Close() is happening in the destructor, it's not exactly a great idea for us to let a new strong reference be added at that point since that implies redundant destruction or UAF. So I've moved our failsafe close from the destructor into our too-clever Release() method. I've added comments there, but perhaps restating more simply: - We invoke Close() when our refcount hits 1. This may cause a transient AddRef/Release pair due to AsyncCloseConnection, where the Release() may or may not occur by the time Close() returns (based on varying notions of event targets). - So I have us directly manipulate mRefCnt to go to 2 before we call Close() so we can avoid a 2->1 transition. - Because of that potential delayed release, we have 2 paths to decide when we do the strong Release() via unregisterConnection and dropping our synthetic mRefCnt manipulation so that we avoid a "strong" 2->1 transition. This is more than a little ugly. I'm open to alternative fixes or suggestions for fixes. I'm at TPAC right now and returning tomorrow, which is why this fix got delayed in the first place, but should be available to try alternatives on Friday/the weekend. Uploading test next.

Attachment #8926530 - Flags: review?(mak77)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 5

•

8 years ago

Attached patch Part 2: Test — Details — Splinter Review

I verified that this test, if applied without the fix, results in a crash of the xpcshell test and therefore a failure.

Attachment #8926531 - Flags: review?(mak77)

Marco Bonardo [:mak]

Comment 6

•

8 years ago

Comment on attachment 8926530 [details] [diff] [review] Part 1: Fix Review of attachment 8926530 [details] [diff] [review]: ----------------------------------------------------------------- ugh! Hopefull yone day we can do better than manual bookeeping :( ::: storage/mozStorageConnection.cpp @@ +587,5 @@ > + // Okay, now our refcount is 2, we trigger Close(). > + Unused << Close(); > + // Now our refcount should either be at 2 (because nothing happened, or the > + // addref and release pair happened due to SpinningSynchronousClose) or > + // 3 (because SpinningSynchronousClose hapened but didn't release yet). typo "hapened"

Attachment #8926530 - Flags: review?(mak77) → review+

Marco Bonardo [:mak]

Comment 7

•

8 years ago

Comment on attachment 8926531 [details] [diff] [review] Part 2: Test Review of attachment 8926531 [details] [diff] [review]: ----------------------------------------------------------------- ::: storage/test/unit/test_connection_failsafe_close.js @@ +29,5 @@ > + > + // now we need to wait for that callback to have completed. > + await callbackInvoked; > + > + ok(true, "if we shutdown cleanly and do not crash, then we succeeded"); nit: please use the complete form Assert.ok, per common convention.

Attachment #8926531 - Flags: review?(mak77) → review+

Pulsebot

Comment 8

•

8 years ago

Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/47f845a860d8 Fix for SpinningSynchronousClose unregisterConnection edge-case. r=mak https://hg.mozilla.org/integration/mozilla-inbound/rev/4b34fc8bfe40 Test for SpinningSynchronousClose unregisterConnection edge-case. r=mak

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 9

•

8 years ago

I made both the requested changes. I also added a few more comment lines about how the fix will not explode even if the memory reporters or shrinkMemory happen during the spun nested event loop.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 10

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/47f845a860d8 https://hg.mozilla.org/mozilla-central/rev/4b34fc8bfe40

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox59: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla59

Ryan VanderMeulen [:RyanVM]

Comment 11

•

8 years ago

Please request Beta approval on this when you get a chance.

Flags: needinfo?(bugmail)

Flags: in-testsuite+

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 12

•

8 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #11) > Please request Beta approval on this when you get a chance. Status update: this bug has moved some crashes around, or at least given us intermittent assertion failures, giving us bug 1417326 and bug 1422327 that I will finish investigating and address in the short term. Once those are fixed and baked a little, we can consider uplift. Those all currently appear a bit more edge-casey because they are a function of cross-thread races.

Depends on: 1417326, 1422327

Gerry Chang [:gchang]

Comment 13

•

8 years ago

Too late for Beta58. Mark 58 won't fix.

status-firefox58: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Flags: needinfo?(bugmail)

BMO Automation

Updated

•

1 year ago

Product: Toolkit → Core

You need to log in before you can comment on or make changes to this bug.