Closed
Bug 635482
Opened 14 years ago
Closed 14 years ago
Crash [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast ] (was: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P]) from nsCycleCollectingAutoRefCnt::decr with mozilla::storage::BindingParams::Release
Categories
(Toolkit :: Places, defect)
Tracking
()
RESOLVED
FIXED
mozilla2.0
Tracking | Status | |
---|---|---|
blocking2.0 | --- | final+ |
People
(Reporter: benjamin, Assigned: sdwilsh)
Details
(Whiteboard: [hardblocker])
Attachments
(1 file, 2 obsolete files)
10.11 KB,
patch
|
asuth
:
review+
|
Details | Diff | Splinter Review |
This crash signature may have been grouped/confused with bug 633445.
https://crash-stats.mozilla.com/report/index/81efd85f-8d9b-4373-a94d-3f4ff2110219
0 mozalloc.dll mozalloc_abort memory/mozalloc/mozalloc_abort.cpp:77
1 xul.dll NS_DebugBreak_P xpcom/base/nsDebugImpl.cpp:350
2 xul.dll AbortIfOffMainThreadIfCheckFast xpcom/base/nsCycleCollector.cpp:1195
3 xul.dll nsCycleCollectingAutoRefCnt::decr obj-firefox/dist/include/nsISupportsImpl.h:196
4 xul.dll XPCVariant::Release js/src/xpconnect/src/xpcvariant.cpp:58
5 xul.dll ReleaseObjects obj-firefox/xpcom/build/nsCOMArray.cpp:148
6 xul.dll mozilla::storage::BindingParams::Release storage/src/mozStorageBindingParams.cpp:243
7 xul.dll nsRunnable::Release obj-firefox/xpcom/build/nsThreadUtils.cpp:55
8 xul.dll nsRefPtr<mozilla::dom::Element>::~nsRefPtr<mozilla::dom::Element> obj-firefox/xpcom/build/nsCOMPtr.cpp:81
9 xul.dll mozilla::storage::BindingParamsArray::Release storage/src/mozStorageBindingParamsArray.cpp:78
10 xul.dll mozilla::storage::AsyncExecuteStatements::Release storage/src/mozStorageAsyncStatementExecution.cpp:514
11 xul.dll nsRefPtr<mozilla::dom::Element>::~nsRefPtr<mozilla::dom::Element> obj-firefox/xpcom/build/nsCOMPtr.cpp:81
I don't know how a dom::Element ended up in a non-mainthread statement, or whether this is caused by a bad extension, but it bears looking into.
Updated•14 years ago
|
Severity: normal → critical
Version: unspecified → Trunk
Comment 1•14 years ago
|
||
I suppose this was a link and the async statement was the isVisited check?
Updated•14 years ago
|
Summary: Crash [@ mozalloc_abort(char const* const) | NS_DebugBreak_P] from nsCycleCollectingAutoRefCnt::decr with mozilla::storage::BindingParams::Release → Crash [@ mozalloc_abort(char const* const) | NS_DebugBreak_P | AbortIfOffMainThreadIfCheckFast ] (was: [@ mozalloc_abort(char const* const) | NS_DebugBreak_P]) from nsCycleCollectingAutoRefCnt::decr with mozilla::storage::BindingParams::Release
Reporter | ||
Comment 2•14 years ago
|
||
This shows up very high on nightly builds, now that we have mozalloc_abort broken down better in crash-stats. I think this should be marked hardblocker at least until we have crash stats from beta12, in order to see whether it's actually going to end up in the top-5 topcrashers.
blocking2.0: --- → ?
Comment 3•14 years ago
|
||
Assigning to Marco, though shawn could just as likely take it. We're marking this a hardblocker provisionally, per benjamin in comment 2.
Can we get an ETA for fix immediately?
Assignee: nobody → mak77
blocking2.0: ? → final+
Whiteboard: [hardblocker]
(In reply to comment #0)
> I don't know how a dom::Element ended up in a non-mainthread statement, or
> whether this is caused by a bad extension, but it bears looking into.
I don't think it did. I think every nsRefPtr::~nsRefPtr is unified by the linker, so we randomly get the symbol name for one type that's used as a template parameter. Just look at it s nsRefPtr<T>::~nsRefPtr<T>.
Reporter | ||
Comment 5•14 years ago
|
||
COMPtrs are unified very often. RefPtrs are unified much less often. But yeah, don't assume that this is a dom::Element without verifying somehow. It's possible that this is caused by an extension, do we have correlation data?
Assignee | ||
Comment 6•14 years ago
|
||
I see what's happening here:
> 4 xul.dll XPCVariant::Release
> js/src/xpconnect/src/xpcvariant.cpp:58
> 5 xul.dll ReleaseObjects obj-firefox/xpcom/build/nsCOMArray.cpp:148
> 6 xul.dll mozilla::storage::BindingParams::Release
> storage/src/mozStorageBindingParams.cpp:243
BindingParams is getting released on the background thread, which happens to hold an xpcvariant, which is cycle collected. We just started getting unlucky as to which thread held onto the last reference it looks like. This has been this way since the initial implementation.
(In reply to comment #5)
> COMPtrs are unified very often. RefPtrs are unified much less often. But yeah,
> don't assume that this is a dom::Element without verifying somehow. It's
> possible that this is caused by an extension, do we have correlation data?
I'm pretty sure the nsRefPtr just got merged here.
Assignee: mak77 → sdwilsh
Status: NEW → ASSIGNED
Whiteboard: [hardblocker] → [hardblocker][eta 2/25/2011]
Comment 7•14 years ago
|
||
Some of the nsRunnable notifiers in mozStorageAsyncStatementExecution.cpp that hold a nsRefPtr<AsyncExecuteStatements> seems that could be trying to release it on a thread different from the creation one. But doesn't seem to reflect the stack.
Status: ASSIGNED → NEW
Updated•14 years ago
|
Status: NEW → ASSIGNED
Comment 8•14 years ago
|
||
(In reply to comment #7)
> Some of the nsRunnable notifiers in mozStorageAsyncStatementExecution.cpp that
> hold a nsRefPtr<AsyncExecuteStatements> seems that could be trying to release
> it on a thread different from the creation one. But doesn't seem to reflect the
> stack.
The completion notifier dispatches back to the calling thread; I think the callstacks agree with this if we assume that the crash-stats backtrace generation is not super-perfect. I tried to re-analyze the dump on my Windows box with windbg, but either did something wrong or my 64-bit box is a major stumbling point.
The situation which would make sense for this would be:
1) Someone uses XPConnect to bind a variant in using one of the variant forms.
2) They somehow dispatch that request from a background thread.
I'm assuming the new, stricter rules about JS and threads means that actually accomplishing this feat requires coordination between JS code and a C++ component...
Does Places do anything sketchy where JS code creates binding parameters (say, populating a statement cache), but C++ code actually executes it?
nb: Although crash stats seems to claim the crashing user had no extensions, if you look at the raw json dump, they clearly had a boatload, including SQLiteManager@mrinalkant.blogspot.com:0.6.8...
Assignee | ||
Comment 9•14 years ago
|
||
(In reply to comment #8)
> The completion notifier dispatches back to the calling thread; I think the
> callstacks agree with this if we assume that the crash-stats backtrace
> generation is not super-perfect. I tried to re-analyze the dump on my Windows
> box with windbg, but either did something wrong or my 64-bit box is a major
> stumbling point.
While that is true that the CompletionNotifier (I think that's the class you are talking about at least) runs on the calling thread, and it holds a strong reference to AsyncExecuteStatements, it's still possible that we get released on the wrong thread. For example:
On async thread, dispatch CompletionNotifier to main thread.
On main thread, complete work in CompletionNotifier.
On main thread, do final release on CompletionNotifier that was just run.
On async thread, finish our Run method.
On async thread, do release, which ends up being the final one because nobody held onto our handle (as a mozIStoragePendingStatement), and the CompletionNotifier no longer hold a reference to us (AsyncExecuteStatements).
Note that the issue here is where AsyncExecuteStatements is released, since it ends up releasing the StatementData, which holds a references to a BindingParamsArray, which holds a reference to a BindingParams, which holds the reference to our troublesome xpcvariant (try saying that ten times fast).
> I'm assuming the new, stricter rules about JS and threads means that actually
> accomplishing this feat requires coordination between JS code and a C++
> component...
Correct.
> Does Places do anything sketchy where JS code creates binding parameters (say,
> populating a statement cache), but C++ code actually executes it?
Not as far as I know.
> nb: Although crash stats seems to claim the crashing user had no extensions, if
> you look at the raw json dump, they clearly had a boatload, including
> SQLiteManager@mrinalkant.blogspot.com:0.6.8...
That may be related, but I'm pretty sure we have a legit bug here too.
Assignee | ||
Comment 10•14 years ago
|
||
As an aside, I wonder if this is related to bug 629120.
Comment 11•14 years ago
|
||
Ah, yeah, the async thread needs to lose its reference in the process, not create a new one. Silly me.
Assignee | ||
Comment 12•14 years ago
|
||
Always proxy our release of these objects to the proper thread.
Attachment #515227 -
Flags: review?(bugmail)
Assignee | ||
Comment 13•14 years ago
|
||
(In reply to comment #11)
> Ah, yeah, the async thread needs to lose its reference in the process, not
> create a new one. Silly me.
I mean, I missed this little detail in review too ;)
It's subtle, to say the least.
Assignee | ||
Updated•14 years ago
|
Whiteboard: [hardblocker][eta 2/25/2011] → [hardblocker][needs review asuth]
Comment 14•14 years ago
|
||
Comment on attachment 515227 [details] [diff] [review]
v1.0
Looks good. The nsCOMPtr code should handle providing the assertions for the byzantine failure where someone writes some code that tries to use getSqliteStatement or the sqlite3 * operator after finalization, which I think should cover all the bases.
Attachment #515227 -
Flags: review?(bugmail) → review+
Assignee | ||
Comment 15•14 years ago
|
||
Same as before, but with a commit message.
Assignee | ||
Updated•14 years ago
|
Keywords: checkin-needed
Whiteboard: [hardblocker][needs review asuth] → [hardblocker][has patch][can land]
Comment 16•14 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla2.0
Comment 17•14 years ago
|
||
Does this patch fix bug 629120?
Assignee | ||
Comment 18•14 years ago
|
||
(In reply to comment #17)
> Does this patch fix bug 629120?
I haven't been able to reproduce that, but it's possible that it was a related issue.
Comment 19•14 years ago
|
||
"3.55"% tp4 regression on 10.5 (and a probable, but unmeasured due to the noise, regression on 10.6) - http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/8e55e9f1cb8321fa#
Comment 20•14 years ago
|
||
Are there any implementations of an aggregating proxy-release mechanism in-tree? Alternatively, we could basically back out this patch and change things so that instead of AsyncExecuteStatements being directly run by the thread, a helper gets run which proxies the release of AES back to the main thread after it completes execution so we're reducing the runnable count and lock contention that way instead. (And to eliminate having any net gain in runnables, smoosh things together so either the CompletionNotifier is responsible for that, or the re-targeted AsyncExecuteStatements just subsumes the duties of the CompletionNotifier.)
Assignee | ||
Comment 21•14 years ago
|
||
(In reply to comment #19)
> "3.55"% tp4 regression on 10.5 (and a probable, but unmeasured due to the
> noise, regression on 10.6) -
> http://groups.google.com/group/mozilla.dev.tree-management/browse_thread/thread/8e55e9f1cb8321fa#
I'm highly suspicious that this patch would have caused that. Did anything land around it?
Comment 22•14 years ago
|
||
I can't imagine any way to blame anything around it - the previous push was more than 4 hours before, the next push was 3.5 hours later, and I retriggered multiple runs on this push and the push before, which all agreed on the less-noisy 10.5, and agreed on the more-noisy 10.6 except one obvious outlier.
Backed out in http://hg.mozilla.org/mozilla-central/rev/78f7521c8e7f, and I'll trigger a fistful again.
Updated•14 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
What did the numbers look like on other platforms?
Comment 24•14 years ago
|
||
Where "before" is the push before this landed, "after" is the push when this landed, and "two after" is the push which followed this push,
Linux: before, 382.98; after, 383.8; two after, 383.03
Linux64: before, 334.16; after, 338.51; two after, 335.6
Win7: before, 321.95; after, 328.87; two after, 311.08
WinXP: before, 296.27; after, 296.34; two after, 299.39
And from the tree-management thread, the numbers for 10.5 were:
before: 373.07, 372.97, 371.12, 371.08, 373.84
after: 387.58, 388.3, 389.16, 388.0, 384.56
and for 10.6 were:
before: 361.51, 359.25, 366.61, 443.35
after: 386.05, 376.54, 384.6, 374.38
Comment 25•14 years ago
|
||
asuth: if you're using a 32bit version of firefox/thunderbird, be sure to use the 32bit version of windbg. things should just work.
Comment 26•14 years ago
|
||
And where "before" is before I backed it out, and "after" is the backout push:
10.5 before: 388.38, 383.87, 389.75, 389.38, 390.23, 387.18
10.5 after: 372.43, 374.74, 373.85, 371.02, 374.94, 371.49
10.6 before: 373.31, 376.31, 379.13, 399.66, 383.7, 383.82
10.6 after: 368.18, 365.0, 365.41, 359.64, 400.86, 366.53
Comment 27•14 years ago
|
||
(In reply to comment #21)
> I'm highly suspicious that this patch would have caused that. Did anything
> land around it?
The patch causes us to proxy release (2 * number of statements) to the main thread which means 2N nsProxyRelease objects get created (do we have allocator contention on OS X? we don't use jemalloc, right?), scheduled (with potential lock contention, and then have to run on the main thread. This definitely seems like it could cause a measurable impact.
This cost is paid all the time, even when we would not have previously assert-sploded. I don't know what the Places DB traffic looks like... hopefully it is using multiply bound statements all the time so the impact is not so bad?
Assignee | ||
Comment 28•14 years ago
|
||
(In reply to comment #27)
> This cost is paid all the time, even when we would not have previously
> assert-sploded. I don't know what the Places DB traffic looks like...
> hopefully it is using multiply bound statements all the time so the impact is
> not so bad?
Most of the traffic is read statements, so we can't really multi-bound (unless we start delaying link lookup times even more, but I'm not sure that's valuable).
Reporter | ||
Comment 29•14 years ago
|
||
We know and have control over the types for mStatementOwner and mParamsArray, right? Should those objects always be destroyed on the main thread? If so, why don't we proxy the object destruction (after the refcount hits zero), instead of proxying each release call? In the common case where we're already on the correct thread, the cost will be near-zero.
Assignee | ||
Comment 30•14 years ago
|
||
(In reply to comment #29)
> We know and have control over the types for mStatementOwner and mParamsArray,
> right? Should those objects always be destroyed on the main thread? If so, why
> don't we proxy the object destruction (after the refcount hits zero), instead
> of proxying each release call? In the common case where we're already on the
> correct thread, the cost will be near-zero.
That's a fair bit more work than another approach I'd like to try (been thinking about this this weekend). I think I can piggy-back of an existing event we already dispatch to the calling thread, but I need to check a few more invariants first. I'll have a new patch up later today.
Updated•14 years ago
|
Whiteboard: [hardblocker][has patch][can land] → [hardblocker][has patch][backed out][new patch ETA 2-28]
Updated•14 years ago
|
Whiteboard: [hardblocker][has patch][backed out][new patch ETA 2-28] → [hardblocker][backed out][new patch ETA 2-28]
Assignee | ||
Comment 31•14 years ago
|
||
This is a better fix that results in no additional events being created. We originally held onto the AsyncExecuteStatements in the CompletionNotifier to prevent this type of thing from happening, but we clearly missed this one case.
Now, we give the StatementDataArray to the CompletionNotifier, and it ensures that we release it on the calling thread (by explicitly calling clear).
Attachment #515227 -
Attachment is obsolete: true
Attachment #515234 -
Attachment is obsolete: true
Attachment #515659 -
Flags: review?(bugmail)
Assignee | ||
Updated•14 years ago
|
Whiteboard: [hardblocker][backed out][new patch ETA 2-28] → [hardblocker][has patch][needs review asuth]\
Assignee | ||
Updated•14 years ago
|
Whiteboard: [hardblocker][has patch][needs review asuth]\ → [hardblocker][has patch][needs review asuth]
Comment 32•14 years ago
|
||
Comment on attachment 515659 [details] [diff] [review]
v2.0
The comments are definitely appreciated!
Attachment #515659 -
Flags: review?(bugmail) → review+
Comment 33•14 years ago
|
||
Shawn, is this [can land] now?
Assignee | ||
Comment 34•14 years ago
|
||
(In reply to comment #33)
> Shawn, is this [can land] now?
Yes, if you stop midairing my landing comment :P
http://hg.mozilla.org/mozilla-central/rev/0e4b5530151d
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Whiteboard: [hardblocker][has patch][needs review asuth] → [hardblocker]
You need to log in
before you can comment on or make changes to this bug.
Description
•