Open Bug 1869503 Opened 5 months ago Updated 1 month ago

Crash in [@ EdgePool::Iterator::operator*]

Categories

(Core :: Cycle Collector, defect)

defect

Tracking

()

Tracking Status
firefox-esr115 --- affected
firefox121 --- wontfix
firefox122 --- wontfix
firefox123 --- wontfix

People

(Reporter: gsvelto, Assigned: mccr8)

References

Details

(Keywords: crash, stalled, topcrash)

Crash Data

+++ This bug was initially created as a clone of Bug #500105 +++

This is a separate bug from bug 500105 to investigate the @ EdgePool::Iterator::operator* crash signature. Here's an example crash report: https://crash-stats.mozilla.org/report/index/db01c693-7bf2-402b-a03f-d0cbd0231212

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  xul.dll  EdgePool::Iterator::operator* const  xpcom/base/nsCycleCollector.cpp:472
0  xul.dll  GraphWalker<ScanBlackVisitor>::DoWalk  xpcom/base/nsCycleCollector.cpp:1336
1  xul.dll  GraphWalker<ScanBlackVisitor>::Walk  xpcom/base/nsCycleCollector.cpp:1307
2  xul.dll  FloodBlackNode  xpcom/base/nsCycleCollector.cpp:2760
2  xul.dll  nsCycleCollector::ScanBlackNodes  xpcom/base/nsCycleCollector.cpp:2993
2  xul.dll  nsCycleCollector::ScanRoots  xpcom/base/nsCycleCollector.cpp:3021
2  xul.dll  nsCycleCollector::Collect  xpcom/base/nsCycleCollector.cpp:3506
3  xul.dll  nsCycleCollector_collectSlice  xpcom/base/nsCycleCollector.cpp:4008
4  xul.dll  nsJSContext::RunCycleCollectorSlice  dom/base/nsJSEnvironment.cpp:1470
4  xul.dll  mozilla::CCGCScheduler::CCRunnerFired  dom/base/nsJSEnvironment.cpp:1626

Not all crashes under this signature are the same, there's a few that are clearly the result of hardware with flaky memory; they're a minority though (less than 10%). To only see the crashes that match the one described above use this query.

I won't copy all the comments so check the discussion starting in bug 500105 comment 106. At its core it appears like we're finding two NULL pointers one after the other in an EdgePool array. This shouldn't happen, as a NULL pointer in that array is used a sentinel to instruct the code to use the next block from the following pointer.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

One other thing that's worth looking at in addition to (and very similar to) the analysis in bug 500105 comment 109: examine the alignment of the addresses of the edges (not the nodes that they point to). Since the edges are allocated in large blocks (see EdgePool and the classes nested within it), you should expect the null sentinel that precedes the pointer to the next block to have a particular alignment (depending on the allocator's behavior for that allocation size, which I suspect you know, but I don't), since it should be at the end of the block.

It's possible it may be worth doing the same analysis for the nodes if you can find the address of the node whose edges you're examining at this point. (See NodePool, which is similar except that it uses an enumerator pattern rather than iterators because it needs to be enumerated while being constructed.)

-> S3 because bug 500105 is set so.

Severity: -- → S3

Thanks a lot David, I'll try. Given this is happening with a certain frequency I might add some nightly-only checks to see if we can catch this condition when it happens rather than when we crash.

See Also: → 500105
See Also: → 1869170

Thanks for looking into this Gabriele, and thanks for the comments David. This is one of those areas of that code that I think has mostly been untouched for a long time. I understood what was going on at some point, but I'll have to look over it again.

It does look like I have some CC_GRAPH_ASSERT in there that check some conditions. Maybe those could be defined to be diagnostic asserts in all builds for investigating this.

Duplicate of this bug: 1869170
See Also: 1869170

If I may, i am not sure we should carry the severity of the previous bug automatically.
Especially as it is a top crash here. The volume is quite high too.

Flags: needinfo?(masayuki)

(In reply to Sylvestre Ledru [:Sylvestre] from comment #8)

If I may, i am not sure we should carry the severity of the previous bug automatically.
Especially as it is a top crash here. The volume is quite high too.

Ah, I misunderstand "Not all crashes under this signature are the same" in comment 0 as it's a few which is handled in this bug. Yeah, S2 should be better due to a top crash (or S1?).

Severity: S3 → S2
Flags: needinfo?(masayuki)

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release (startup)

For more information, please visit BugBot documentation.

Is comment 6 the first step we can take for this S2?

Flags: needinfo?(continuation)

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

There are some actionable ideas here, but I think at this point it is all guesses for investigation rather than specific concrete fixes. I haven't looked at this data structure in a decade so it'll take me a while to be able to do something about it again, and it looks like it is only barely in the top 20 for crashes, so I haven't put a high priority on this.

Flags: needinfo?(continuation)
Flags: needinfo?(continuation)

I spent some time look over this today. I'm not sure what could be going wrong. We do set the last two values in a block to null initially so maybe something isn't getting updated there? On the other hand, it doesn't really make any sense for 0x2 to be in one of these Edge blocks, so maybe there's some kind of wild pointer issue here.

I tried setting the EdgePool size to 4 in the hope that would better test edge cases around the end of the edge blocks but that didn't turn up anything.

I did find a few minor things about edge pools that aren't exactly right, but they shouldn't cause issues.

I'm not sure what exactly could cause this kind of weirdness. We run the cycle collector very frequently, so there must be some particular combination of unusual events. Nodes with edges near the end of an edge block have to deal with some odd conditions. The final node has some special handling. With incremental cycle collection, we have to deal with objects dying, but I think we handle that correctly here.

Assignee: nobody → continuation

I'll mark this stalled for now. It is kind of at the edge of qualifying for a topcrash, so it doesn't seem super bad to me. Hopefully I'll get a chance to try to get my head around the kinds of invariants we could test in release builds.

Keywords: stalled
Flags: needinfo?(continuation)
You need to log in before you can comment on or make changes to this bug.