Crash in [@ EdgePool::Iterator::operator*]
Categories
(Core :: Cycle Collector, defect)
Tracking
()
People
(Reporter: gsvelto, Assigned: mccr8)
References
Details
(Keywords: crash, stalled, topcrash)
Crash Data
+++ This bug was initially created as a clone of Bug #500105 +++
This is a separate bug from bug 500105 to investigate the @ EdgePool::Iterator::operator*
crash signature. Here's an example crash report: https://crash-stats.mozilla.org/report/index/db01c693-7bf2-402b-a03f-d0cbd0231212
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 xul.dll EdgePool::Iterator::operator* const xpcom/base/nsCycleCollector.cpp:472
0 xul.dll GraphWalker<ScanBlackVisitor>::DoWalk xpcom/base/nsCycleCollector.cpp:1336
1 xul.dll GraphWalker<ScanBlackVisitor>::Walk xpcom/base/nsCycleCollector.cpp:1307
2 xul.dll FloodBlackNode xpcom/base/nsCycleCollector.cpp:2760
2 xul.dll nsCycleCollector::ScanBlackNodes xpcom/base/nsCycleCollector.cpp:2993
2 xul.dll nsCycleCollector::ScanRoots xpcom/base/nsCycleCollector.cpp:3021
2 xul.dll nsCycleCollector::Collect xpcom/base/nsCycleCollector.cpp:3506
3 xul.dll nsCycleCollector_collectSlice xpcom/base/nsCycleCollector.cpp:4008
4 xul.dll nsJSContext::RunCycleCollectorSlice dom/base/nsJSEnvironment.cpp:1470
4 xul.dll mozilla::CCGCScheduler::CCRunnerFired dom/base/nsJSEnvironment.cpp:1626
Not all crashes under this signature are the same, there's a few that are clearly the result of hardware with flaky memory; they're a minority though (less than 10%). To only see the crashes that match the one described above use this query.
I won't copy all the comments so check the discussion starting in bug 500105 comment 106. At its core it appears like we're finding two NULL pointers one after the other in an EdgePool
array. This shouldn't happen, as a NULL pointer in that array is used a sentinel to instruct the code to use the next block from the following pointer.
Comment 1•5 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 10 content process crashes on beta
For more information, please visit BugBot documentation.
One other thing that's worth looking at in addition to (and very similar to) the analysis in bug 500105 comment 109: examine the alignment of the addresses of the edges (not the nodes that they point to). Since the edges are allocated in large blocks (see EdgePool and the classes nested within it), you should expect the null sentinel that precedes the pointer to the next block to have a particular alignment (depending on the allocator's behavior for that allocation size, which I suspect you know, but I don't), since it should be at the end of the block.
It's possible it may be worth doing the same analysis for the nodes if you can find the address of the node whose edges you're examining at this point. (See NodePool, which is similar except that it uses an enumerator pattern rather than iterators because it needs to be enumerated while being constructed.)
-> S3 because bug 500105 is set so.
Reporter | ||
Comment 4•5 months ago
|
||
Thanks a lot David, I'll try. Given this is happening with a certain frequency I might add some nightly-only checks to see if we can catch this condition when it happens rather than when we crash.
Assignee | ||
Comment 5•5 months ago
|
||
Thanks for looking into this Gabriele, and thanks for the comments David. This is one of those areas of that code that I think has mostly been untouched for a long time. I understood what was going on at some point, but I'll have to look over it again.
Assignee | ||
Comment 6•5 months ago
|
||
It does look like I have some CC_GRAPH_ASSERT in there that check some conditions. Maybe those could be defined to be diagnostic asserts in all builds for investigating this.
Comment 8•5 months ago
|
||
If I may, i am not sure we should carry the severity of the previous bug automatically.
Especially as it is a top crash here. The volume is quite high too.
(In reply to Sylvestre Ledru [:Sylvestre] from comment #8)
If I may, i am not sure we should carry the severity of the previous bug automatically.
Especially as it is a top crash here. The volume is quite high too.
Ah, I misunderstand "Not all crashes under this signature are the same" in comment 0 as it's a few which is handled in this bug. Yeah, S2 should be better due to a top crash (or S1?).
Updated•5 months ago
|
Comment 10•4 months ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 20 desktop browser crashes on release (startup)
For more information, please visit BugBot documentation.
Comment 11•4 months ago
|
||
Is comment 6 the first step we can take for this S2?
Comment 12•4 months ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit BugBot documentation.
Assignee | ||
Comment 13•4 months ago
|
||
There are some actionable ideas here, but I think at this point it is all guesses for investigation rather than specific concrete fixes. I haven't looked at this data structure in a decade so it'll take me a while to be able to do something about it again, and it looks like it is only barely in the top 20 for crashes, so I haven't put a high priority on this.
Assignee | ||
Updated•4 months ago
|
Assignee | ||
Updated•4 months ago
|
Assignee | ||
Comment 14•3 months ago
|
||
I spent some time look over this today. I'm not sure what could be going wrong. We do set the last two values in a block to null initially so maybe something isn't getting updated there? On the other hand, it doesn't really make any sense for 0x2 to be in one of these Edge blocks, so maybe there's some kind of wild pointer issue here.
I tried setting the EdgePool size to 4 in the hope that would better test edge cases around the end of the edge blocks but that didn't turn up anything.
I did find a few minor things about edge pools that aren't exactly right, but they shouldn't cause issues.
I'm not sure what exactly could cause this kind of weirdness. We run the cycle collector very frequently, so there must be some particular combination of unusual events. Nodes with edges near the end of an edge block have to deal with some odd conditions. The final node has some special handling. With incremental cycle collection, we have to deal with objects dying, but I think we handle that correctly here.
Assignee | ||
Updated•2 months ago
|
Assignee | ||
Comment 15•2 months ago
|
||
I'll mark this stalled for now. It is kind of at the edge of qualifying for a topcrash, so it doesn't seem super bad to me. Hopefully I'll get a chance to try to get my head around the kinds of invariants we could test in release builds.
Updated•2 months ago
|
Assignee | ||
Updated•1 month ago
|
Description
•