1869503 - Crash in [@ EdgePool::Iterator::operator*]

Reporter

Description

•

5 months ago

+++ This bug was initially created as a clone of Bug #500105 +++

This is a separate bug from bug 500105 to investigate the @ EdgePool::Iterator::operator* crash signature. Here's an example crash report: https://crash-stats.mozilla.org/report/index/db01c693-7bf2-402b-a03f-d0cbd0231212

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0  xul.dll  EdgePool::Iterator::operator* const  xpcom/base/nsCycleCollector.cpp:472
0  xul.dll  GraphWalker<ScanBlackVisitor>::DoWalk  xpcom/base/nsCycleCollector.cpp:1336
1  xul.dll  GraphWalker<ScanBlackVisitor>::Walk  xpcom/base/nsCycleCollector.cpp:1307
2  xul.dll  FloodBlackNode  xpcom/base/nsCycleCollector.cpp:2760
2  xul.dll  nsCycleCollector::ScanBlackNodes  xpcom/base/nsCycleCollector.cpp:2993
2  xul.dll  nsCycleCollector::ScanRoots  xpcom/base/nsCycleCollector.cpp:3021
2  xul.dll  nsCycleCollector::Collect  xpcom/base/nsCycleCollector.cpp:3506
3  xul.dll  nsCycleCollector_collectSlice  xpcom/base/nsCycleCollector.cpp:4008
4  xul.dll  nsJSContext::RunCycleCollectorSlice  dom/base/nsJSEnvironment.cpp:1470
4  xul.dll  mozilla::CCGCScheduler::CCRunnerFired  dom/base/nsJSEnvironment.cpp:1626

Not all crashes under this signature are the same, there's a few that are clearly the result of hardware with flaky memory; they're a minority though (less than 10%). To only see the crashes that match the one described above use this query.

I won't copy all the comments so check the discussion starting in bug 500105 comment 106. At its core it appears like we're finding two NULL pointers one after the other in an EdgePool array. This shouldn't happen, as a NULL pointer in that array is used a sentinel to instruct the code to use the next block from the following pointer.

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

5 months ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 10 content process crashes on beta

For more information, please visit BugBot documentation.

Keywords: topcrash

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 2

•

5 months ago

One other thing that's worth looking at in addition to (and very similar to) the analysis in bug 500105 comment 109: examine the alignment of the addresses of the edges (not the nodes that they point to). Since the edges are allocated in large blocks (see EdgePool and the classes nested within it), you should expect the null sentinel that precedes the pointer to the next block to have a particular alignment (depending on the allocator's behavior for that allocation size, which I suspect you know, but I don't), since it should be at the end of the block.

It's possible it may be worth doing the same analysis for the nodes if you can find the address of the node whose edges you're examining at this point. (See NodePool, which is similar except that it uses an enumerator pattern rather than iterators because it needs to be enumerated while being constructed.)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(away 4/27 - 5/6)

Comment 3

•

5 months ago

-> S3 because bug 500105 is set so.

Severity: -- → S3

Gabriele Svelto [:gsvelto]

Reporter

Comment 4

•

5 months ago

Thanks a lot David, I'll try. Given this is happening with a certain frequency I might add some nightly-only checks to see if we can catch this condition when it happens rather than when we crash.

Andrew McCreight [:mccr8]

Assignee

Updated

•

5 months ago

Updated

•

5 months ago

Comment 5

•

5 months ago

Thanks for looking into this Gabriele, and thanks for the comments David. This is one of those areas of that code that I think has mostly been untouched for a long time. I understood what was going on at some point, but I'll have to look over it again.

Andrew McCreight [:mccr8]

Assignee

Comment 6

•

5 months ago

It does look like I have some CC_GRAPH_ASSERT in there that check some conditions. Maybe those could be defined to be diagnostic asserts in all builds for investigating this.

Magnus Melin [:mkmelin]

Updated

•

5 months ago

Duplicate of this bug: 1869170

Magnus Melin [:mkmelin]

Updated

•

5 months ago

Comment 8

•

5 months ago

If I may, i am not sure we should carry the severity of the previous bug automatically.
Especially as it is a top crash here. The volume is quite high too.

Flags: needinfo?(masayuki)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(away 4/27 - 5/6)

Comment 9

•

5 months ago

•

Edited

(In reply to Sylvestre Ledru [:Sylvestre] from comment #8)

If I may, i am not sure we should carry the severity of the previous bug automatically.
Especially as it is a top crash here. The volume is quite high too.

Ah, I misunderstand "Not all crashes under this signature are the same" in comment 0 as it's a few which is handled in this bug. Yeah, S2 should be better due to a top crash (or S1?).

Severity: S3 → S2

Flags: needinfo?(masayuki)

Donal Meehan [:dmeehan]

Updated

•

5 months ago

status-firefox121: --- → affected

status-firefox123: --- → affected

status-firefox-esr115: --- → affected

BugBot [:suhaib / :marco/ :calixte]

Comment 10

•

4 months ago

The bug is linked to a topcrash signature, which matches the following criterion:

Top 20 desktop browser crashes on release (startup)

For more information, please visit BugBot documentation.

Keywords: topcrash-startup

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 11

•

4 months ago

Is comment 6 the first step we can take for this S2?

Flags: needinfo?(continuation)

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

4 months ago

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash-startup

Andrew McCreight [:mccr8]

Assignee

Comment 13

•

4 months ago

There are some actionable ideas here, but I think at this point it is all guesses for investigation rather than specific concrete fixes. I haven't looked at this data structure in a decade so it'll take me a while to be able to do something about it again, and it looks like it is only barely in the top 20 for crashes, so I haven't put a high priority on this.

Andrew McCreight [:mccr8]

Assignee

Updated

•

4 months ago

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Updated

•

4 months ago

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Assignee

Comment 14

•

3 months ago

I spent some time look over this today. I'm not sure what could be going wrong. We do set the last two values in a block to null initially so maybe something isn't getting updated there? On the other hand, it doesn't really make any sense for 0x2 to be in one of these Edge blocks, so maybe there's some kind of wild pointer issue here.

I tried setting the EdgePool size to 4 in the hope that would better test edge cases around the end of the edge blocks but that didn't turn up anything.

I did find a few minor things about edge pools that aren't exactly right, but they shouldn't cause issues.

I'm not sure what exactly could cause this kind of weirdness. We run the cycle collector very frequently, so there must be some particular combination of unusual events. Nodes with edges near the end of an edge block have to deal with some odd conditions. The final node has some special handling. With incremental cycle collection, we have to deal with objects dying, but I think we handle that correctly here.

Andrew McCreight [:mccr8]

Assignee

Updated

•

2 months ago

Assignee: nobody → continuation

Andrew McCreight [:mccr8]

Assignee

Comment 15

•

2 months ago

I'll mark this stalled for now. It is kind of at the edge of qualifying for a topcrash, so it doesn't seem super bad to me. Hopefully I'll get a chance to try to get my head around the kinds of invariants we could test in release builds.

Keywords: stalled

Hsin-Yi Tsai (she/her) [:hsinyi]

Updated

•

2 months ago

status-firefox121: affected → wontfix

status-firefox122: affected → wontfix

status-firefox123: affected → wontfix

Andrew McCreight [:mccr8]

Assignee

Updated

•

1 month ago

Flags: needinfo?(continuation)