Closed Bug 635482 Opened 13 years ago Closed 13 years ago

Crash [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast ] (was: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P]) from nsCycleCollectingAutoRefCnt::decr with mozilla::storage::BindingParams::Release

Categories

(Toolkit :: Places, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla2.0
Tracking Status
blocking2.0 --- final+

People

(Reporter: benjamin, Assigned: sdwilsh)

Details

(Whiteboard: [hardblocker])

Attachments

(1 file, 2 obsolete files)

This crash signature may have been grouped/confused with bug 633445. 

https://crash-stats.mozilla.com/report/index/81efd85f-8d9b-4373-a94d-3f4ff2110219

0 	mozalloc.dll 	mozalloc_abort 	memory/mozalloc/mozalloc_abort.cpp:77
1 	xul.dll 	NS_DebugBreak_P 	xpcom/base/nsDebugImpl.cpp:350
2 	xul.dll 	AbortIfOffMainThreadIfCheckFast 	xpcom/base/nsCycleCollector.cpp:1195
3 	xul.dll 	nsCycleCollectingAutoRefCnt::decr 	obj-firefox/dist/include/nsISupportsImpl.h:196
4 	xul.dll 	XPCVariant::Release 	js/src/xpconnect/src/xpcvariant.cpp:58
5 	xul.dll 	ReleaseObjects 	obj-firefox/xpcom/build/nsCOMArray.cpp:148
6 	xul.dll 	mozilla::storage::BindingParams::Release 	storage/src/mozStorageBindingParams.cpp:243
7 	xul.dll 	nsRunnable::Release 	obj-firefox/xpcom/build/nsThreadUtils.cpp:55
8 	xul.dll 	nsRefPtr<mozilla::dom::Element>::~nsRefPtr<mozilla::dom::Element> 	obj-firefox/xpcom/build/nsCOMPtr.cpp:81
9 	xul.dll 	mozilla::storage::BindingParamsArray::Release 	storage/src/mozStorageBindingParamsArray.cpp:78
10 	xul.dll 	mozilla::storage::AsyncExecuteStatements::Release 	storage/src/mozStorageAsyncStatementExecution.cpp:514
11 	xul.dll 	nsRefPtr<mozilla::dom::Element>::~nsRefPtr<mozilla::dom::Element> 	obj-firefox/xpcom/build/nsCOMPtr.cpp:81

I don't know how a dom::Element ended up in a non-mainthread statement, or whether this is caused by a bad extension, but it bears looking into.
Severity: normal → critical
Version: unspecified → Trunk
I suppose this was a link and the async statement was the isVisited check?
Summary: Crash [@ mozalloc_abort(char const* const) | NS_DebugBreak_P] from nsCycleCollectingAutoRefCnt::decr with mozilla::storage::BindingParams::Release → Crash [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast ] (was: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P]) from nsCycleCollectingAutoRefCnt::decr with mozilla::storage::BindingParams::Release
This shows up very high on nightly builds, now that we have mozalloc_abort broken down better in crash-stats. I think this should be marked hardblocker at least until we have crash stats from beta12, in order to see whether it's actually going to end up in the top-5 topcrashers.
blocking2.0: --- → ?
Assigning to Marco, though shawn could just as likely take it. We're marking this a hardblocker provisionally, per benjamin in comment 2.

Can we get an ETA for fix immediately?
Assignee: nobody → mak77
blocking2.0: ? → final+
Whiteboard: [hardblocker]
(In reply to comment #0)
> I don't know how a dom::Element ended up in a non-mainthread statement, or
> whether this is caused by a bad extension, but it bears looking into.

I don't think it did.  I think every nsRefPtr::~nsRefPtr is unified by the linker, so we randomly get the symbol name for one type that's used as a template parameter.  Just look at it s nsRefPtr<T>::~nsRefPtr<T>.
COMPtrs are unified very often. RefPtrs are unified much less often. But yeah, don't assume that this is a dom::Element without verifying somehow. It's possible that this is caused by an extension, do we have correlation data?
I see what's happening here:
> 4     xul.dll     XPCVariant::Release    
> js/src/xpconnect/src/xpcvariant.cpp:58
> 5     xul.dll     ReleaseObjects     obj-firefox/xpcom/build/nsCOMArray.cpp:148
> 6     xul.dll     mozilla::storage::BindingParams::Release    
> storage/src/mozStorageBindingParams.cpp:243

BindingParams is getting released on the background thread, which happens to hold an xpcvariant, which is cycle collected.  We just started getting unlucky as to which thread held onto the last reference it looks like.  This has been this way since the initial implementation.

(In reply to comment #5)
> COMPtrs are unified very often. RefPtrs are unified much less often. But yeah,
> don't assume that this is a dom::Element without verifying somehow. It's
> possible that this is caused by an extension, do we have correlation data?
I'm pretty sure the nsRefPtr just got merged here.
Assignee: mak77 → sdwilsh
Status: NEW → ASSIGNED
Whiteboard: [hardblocker] → [hardblocker][eta 2/25/2011]
Some of the nsRunnable notifiers in mozStorageAsyncStatementExecution.cpp that hold a nsRefPtr<AsyncExecuteStatements> seems that could be trying to release it on a thread different from the creation one. But doesn't seem to reflect the stack.
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
(In reply to comment #7)
> Some of the nsRunnable notifiers in mozStorageAsyncStatementExecution.cpp that
> hold a nsRefPtr<AsyncExecuteStatements> seems that could be trying to release
> it on a thread different from the creation one. But doesn't seem to reflect the
> stack.

The completion notifier dispatches back to the calling thread; I think the callstacks agree with this if we assume that the crash-stats backtrace generation is not super-perfect.  I tried to re-analyze the dump on my Windows box with windbg, but either did something wrong or my 64-bit box is a major stumbling point.

The situation which would make sense for this would be:
1) Someone uses XPConnect to bind a variant in using one of the variant forms.
2) They somehow dispatch that request from a background thread.

I'm assuming the new, stricter rules about JS and threads means that actually accomplishing this feat requires coordination between JS code and a C++ component...

Does Places do anything sketchy where JS code creates binding parameters (say, populating a statement cache), but C++ code actually executes it?

nb: Although crash stats seems to claim the crashing user had no extensions, if you look at the raw json dump, they clearly had a boatload, including SQLiteManager@mrinalkant.blogspot.com:0.6.8...
(In reply to comment #8)
> The completion notifier dispatches back to the calling thread; I think the
> callstacks agree with this if we assume that the crash-stats backtrace
> generation is not super-perfect.  I tried to re-analyze the dump on my Windows
> box with windbg, but either did something wrong or my 64-bit box is a major
> stumbling point.
While that is true that the CompletionNotifier (I think that's the class you are talking about at least) runs on the calling thread, and it holds a strong reference to AsyncExecuteStatements, it's still possible that we get released on the wrong thread.  For example:
On async thread, dispatch CompletionNotifier to main thread.
On main thread, complete work in CompletionNotifier.
On main thread, do final release on CompletionNotifier that was just run.
On async thread, finish our Run method.
On async thread, do release, which ends up being the final one because nobody held onto our handle (as a mozIStoragePendingStatement), and the CompletionNotifier no longer hold a reference to us (AsyncExecuteStatements).

Note that the issue here is where AsyncExecuteStatements is released, since it ends up releasing the StatementData, which holds a references to a BindingParamsArray, which holds a reference to a BindingParams, which holds the reference to our troublesome xpcvariant (try saying that ten times fast).

> I'm assuming the new, stricter rules about JS and threads means that actually
> accomplishing this feat requires coordination between JS code and a C++
> component...
Correct.

> Does Places do anything sketchy where JS code creates binding parameters (say,
> populating a statement cache), but C++ code actually executes it?
Not as far as I know.

> nb: Although crash stats seems to claim the crashing user had no extensions, if
> you look at the raw json dump, they clearly had a boatload, including
> SQLiteManager@mrinalkant.blogspot.com:0.6.8...
That may be related, but I'm pretty sure we have a legit bug here too.
As an aside, I wonder if this is related to bug 629120.
Ah, yeah, the async thread needs to lose its reference in the process, not create a new one.  Silly me.
Attached patch v1.0 (obsolete) — Splinter Review
Always proxy our release of these objects to the proper thread.
Attachment #515227 - Flags: review?(bugmail)
(In reply to comment #11)
> Ah, yeah, the async thread needs to lose its reference in the process, not
> create a new one.  Silly me.
I mean, I missed this little detail in review too ;)

It's subtle, to say the least.
Whiteboard: [hardblocker][eta 2/25/2011] → [hardblocker][needs review asuth]
Comment on attachment 515227 [details] [diff] [review]
v1.0

Looks good.  The nsCOMPtr code should handle providing the assertions for the byzantine failure where someone writes some code that tries to use getSqliteStatement or the sqlite3 * operator after finalization, which I think should cover all the bases.
Attachment #515227 - Flags: review?(bugmail) → review+
Attached patch v1.0 for checkin (obsolete) — Splinter Review
Same as before, but with a commit message.
Keywords: checkin-needed
Whiteboard: [hardblocker][needs review asuth] → [hardblocker][has patch][can land]
http://hg.mozilla.org/mozilla-central/rev/f9e075d85552
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0
Does this patch fix bug 629120?
(In reply to comment #17)
> Does this patch fix bug 629120?
I haven't been able to reproduce that, but it's possible that it was a related issue.
"3.55"% tp4 regression on 10.5 (and a probable, but unmeasured due to the noise, regression on 10.6) - http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/8e55e9f1cb8321fa#
Are there any implementations of an aggregating proxy-release mechanism in-tree?  Alternatively, we could basically back out this patch and change things so that instead of AsyncExecuteStatements being directly run by the thread, a helper gets run which proxies the release of AES back to the main thread after it completes execution so we're reducing the runnable count and lock contention that way instead.  (And to eliminate having any net gain in runnables, smoosh things together so either the CompletionNotifier is responsible for that, or the re-targeted AsyncExecuteStatements just subsumes the duties of the CompletionNotifier.)
(In reply to comment #19)
> "3.55"% tp4 regression on 10.5 (and a probable, but unmeasured due to the
> noise, regression on 10.6) -
> http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/8e55e9f1cb8321fa#
I'm highly suspicious that this patch would have caused that.  Did anything land around it?
I can't imagine any way to blame anything around it - the previous push was more than 4 hours before, the next push was 3.5 hours later, and I retriggered multiple runs on this push and the push before, which all agreed on the less-noisy 10.5, and agreed on the more-noisy 10.6 except one obvious outlier.

Backed out in http://hg.mozilla.org/mozilla-central/rev/78f7521c8e7f, and I'll trigger a fistful again.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
What did the numbers look like on other platforms?
Where "before" is the push before this landed, "after" is the push when this landed, and "two after" is the push which followed this push,

Linux: before, 382.98; after, 383.8; two after, 383.03
Linux64: before, 334.16; after, 338.51; two after, 335.6
Win7: before, 321.95; after, 328.87; two after, 311.08
WinXP: before, 296.27; after, 296.34; two after, 299.39

And from the tree-management thread, the numbers for 10.5 were:

before: 373.07, 372.97, 371.12, 371.08, 373.84
after:  387.58, 388.3, 389.16, 388.0, 384.56

and for 10.6 were:

before: 361.51, 359.25, 366.61, 443.35
after: 386.05, 376.54, 384.6, 374.38
asuth: if you're using a 32bit version of firefox/thunderbird, be sure to use the 32bit version of windbg. things should just work.
And where "before" is before I backed it out, and "after" is the backout push:

10.5 before: 388.38, 383.87, 389.75, 389.38, 390.23, 387.18
10.5 after: 372.43, 374.74, 373.85, 371.02, 374.94, 371.49

10.6 before: 373.31, 376.31, 379.13, 399.66, 383.7, 383.82
10.6 after: 368.18, 365.0, 365.41, 359.64, 400.86, 366.53
(In reply to comment #21)
> I'm highly suspicious that this patch would have caused that.  Did anything
> land around it?

The patch causes us to proxy release (2 * number of statements) to the main thread which means 2N nsProxyRelease objects get created (do we have allocator contention on OS X? we don't use jemalloc, right?), scheduled (with potential lock contention, and then have to run on the main thread.  This definitely seems like it could cause a measurable impact.

This cost is paid all the time, even when we would not have previously assert-sploded.  I don't know what the Places DB traffic looks like... hopefully it is using multiply bound statements all the time so the impact is not so bad?
(In reply to comment #27)
> This cost is paid all the time, even when we would not have previously
> assert-sploded.  I don't know what the Places DB traffic looks like...
> hopefully it is using multiply bound statements all the time so the impact is
> not so bad?
Most of the traffic is read statements, so we can't really multi-bound (unless we start delaying link lookup times even more, but I'm not sure that's valuable).
We know and have control over the types for mStatementOwner and mParamsArray, right? Should those objects always be destroyed on the main thread? If so, why don't we proxy the object destruction (after the refcount hits zero), instead of proxying each release call? In the common case where we're already on the correct thread, the cost will be near-zero.
(In reply to comment #29)
> We know and have control over the types for mStatementOwner and mParamsArray,
> right? Should those objects always be destroyed on the main thread? If so, why
> don't we proxy the object destruction (after the refcount hits zero), instead
> of proxying each release call? In the common case where we're already on the
> correct thread, the cost will be near-zero.
That's a fair bit more work than another approach I'd like to try (been thinking about this this weekend).  I think I can piggy-back of an existing event we already dispatch to the calling thread, but I need to check a few more invariants first.  I'll have a new patch up later today.
Whiteboard: [hardblocker][has patch][can land] → [hardblocker][has patch][backed out][new patch ETA 2-28]
Whiteboard: [hardblocker][has patch][backed out][new patch ETA 2-28] → [hardblocker][backed out][new patch ETA 2-28]
Attached patch v2.0Splinter Review
This is a better fix that results in no additional events being created.  We originally held onto the AsyncExecuteStatements in the CompletionNotifier to prevent this type of thing from happening, but we clearly missed this one case.

Now, we give the StatementDataArray to the CompletionNotifier, and it ensures that we release it on the calling thread (by explicitly calling clear).
Attachment #515227 - Attachment is obsolete: true
Attachment #515234 - Attachment is obsolete: true
Attachment #515659 - Flags: review?(bugmail)
Whiteboard: [hardblocker][backed out][new patch ETA 2-28] → [hardblocker][has patch][needs review asuth]\
Whiteboard: [hardblocker][has patch][needs review asuth]\ → [hardblocker][has patch][needs review asuth]
Comment on attachment 515659 [details] [diff] [review]
v2.0

The comments are definitely appreciated!
Attachment #515659 - Flags: review?(bugmail) → review+
Shawn, is this [can land] now?
(In reply to comment #33)
> Shawn, is this [can land] now?
Yes, if you stop midairing my landing comment :P

http://hg.mozilla.org/mozilla-central/rev/0e4b5530151d
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Whiteboard: [hardblocker][has patch][needs review asuth] → [hardblocker]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: