Crash @ GraphWalker
Categories
(Core :: Cycle Collector, defect, P5)
Tracking
()
People
(Reporter: samuel.sidler+old, Unassigned)
References
Details
(Keywords: crash, stalled, Whiteboard: [crashkill][crashkill-debug][tbird crash])
Crash Data
Attachments
(2 files, 3 obsolete files)
4.73 KB,
patch
|
dbaron
:
review+
|
Details | Diff | Splinter Review |
986 bytes,
patch
|
dbaron
:
review+
|
Details | Diff | Splinter Review |
Comment 1•16 years ago
|
||
Comment 2•16 years ago
|
||
Comment 3•16 years ago
|
||
Reporter | ||
Comment 4•16 years ago
|
||
Reporter | ||
Comment 5•16 years ago
|
||
Reporter | ||
Comment 6•16 years ago
|
||
Reporter | ||
Updated•16 years ago
|
Comment 7•16 years ago
|
||
Comment 8•16 years ago
|
||
Comment 10•15 years ago
|
||
Updated•15 years ago
|
Comment 11•15 years ago
|
||
Comment 13•15 years ago
|
||
Comment 14•15 years ago
|
||
Updated•15 years ago
|
Comment 17•15 years ago
|
||
Comment 19•15 years ago
|
||
Updated•15 years ago
|
Comment 20•15 years ago
|
||
Comment 21•15 years ago
|
||
Comment 23•15 years ago
|
||
Comment 24•15 years ago
|
||
Comment 25•15 years ago
|
||
Reporter | ||
Comment 26•15 years ago
|
||
Updated•15 years ago
|
Updated•15 years ago
|
Comment 27•15 years ago
|
||
Updated•15 years ago
|
Comment 28•15 years ago
|
||
Comment 30•15 years ago
|
||
Comment 32•15 years ago
|
||
Comment 33•15 years ago
|
||
Comment 34•15 years ago
|
||
Comment 35•15 years ago
|
||
Comment 36•15 years ago
|
||
Comment 38•15 years ago
|
||
Comment 40•15 years ago
|
||
Comment 41•15 years ago
|
||
Updated•15 years ago
|
Comment 43•15 years ago
|
||
Comment 44•15 years ago
|
||
Comment 45•15 years ago
|
||
Comment 46•15 years ago
|
||
Comment 47•15 years ago
|
||
Comment 48•15 years ago
|
||
Comment 50•15 years ago
|
||
Comment 51•15 years ago
|
||
Comment 52•15 years ago
|
||
Updated•14 years ago
|
Comment 54•14 years ago
|
||
Comment 55•14 years ago
|
||
Comment 57•14 years ago
|
||
Comment 59•14 years ago
|
||
Comment 60•14 years ago
|
||
Comment 61•14 years ago
|
||
Comment 62•14 years ago
|
||
Comment 63•14 years ago
|
||
Comment 64•14 years ago
|
||
![]() |
||
Comment 65•14 years ago
|
||
Comment 66•14 years ago
|
||
Assignee | ||
Updated•14 years ago
|
Updated•13 years ago
|
Comment 68•13 years ago
|
||
Comment 69•13 years ago
|
||
Comment 70•13 years ago
|
||
Comment 72•13 years ago
|
||
Comment 73•13 years ago
|
||
Updated•13 years ago
|
Comment 74•13 years ago
|
||
Comment 75•13 years ago
|
||
Updated•13 years ago
|
![]() |
||
Comment 76•13 years ago
|
||
Comment 77•13 years ago
|
||
Comment 78•13 years ago
|
||
Comment 79•13 years ago
|
||
Comment 80•12 years ago
|
||
Updated•12 years ago
|
Comment 81•12 years ago
|
||
Comment 83•10 years ago
|
||
Updated•9 years ago
|
Comment 85•9 years ago
|
||
Updated•8 years ago
|
Updated•7 years ago
|
Comment 92•7 years ago
|
||
Comment 94•6 years ago
|
||
Comment 95•6 years ago
|
||
Comment 101•5 years ago
|
||
(In reply to Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal) from comment #95)
This is now a fairly high volume crash on release 62, for example, for
GraphWalker<T>::DoWalk there are 2400+ crashes in the last week: https://crash-stats.mozilla.com/signature/?signature=GraphWalker%3CT%3E%3A%3ADoWalk
Current rate is about 1,400 per week for Firefox.
TCW's Thunderbird crash bp-c8782125-241c-489a-81d9-7b91c0200712
Comment 102•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release (startup)
- Top 10 content process crashes on beta
- Top 10 content process crashes on release
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 103•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 104•2 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
![]() |
||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 105•1 year ago
|
||
Removing all the signatures that don't have crashes on file anymore.
Comment 106•1 year ago
|
||
I've dug into the three remaining signatures and I found that @ PtrInfo::WasTraversed
and @ GraphWalker<T>::DoWalk
are clearly caused by bad hardware. A lot of the crashes under those signatures have been detected as having a bit-flip and come from older machines, nothing really new there. @ EdgePool::Iterator::operator*
on the other hand looks different, and maybe could be caused by a real bug. A handful of crashes under that signature have been caused by bad memory, this one is a good example. However crashes like that are a minority, the vast majority of crashes under that signature are dereferencing a NULL pointer and not an address that looks like the result of a bit-flip.
The exact point of those crashes is here, and in particular the (mPointer + 1)->block
expression is what's hitting the NULL pointer. Here's what's peculiar about this. *mPointer
yields a NULL pointer, that causes the condition in this line to be satisfied and so we enter the block where we crash. However, as the comment in the block states, that condition means we've found a sentinel and the following element in the array should be non-NULL. I've opened several minidumps and in all of them I found that both *mPointer
and *(mPointer + 1)
yielded NULL pointers. That is, the array we're iterating over contains two adjacent NULL elements, even though the code suggests that this should never happen.
Unfortunately I don't know the cycle-collector well enough to be able to tell what's going on, but it doesn't seem accidental.
A few comments (as the original author of the iterator in 34606b4dd39f):
- the
EdgePool
is basically storage for large blocks of edges in the graph that the cycle collector builds by calling its traversal methods. ThePtrInfo
(which are the nodes in the graph) store iterators for the first and last-plus-one outgoing edges, which are all stored adjacent to each other (logically, according to the iterators). But the edges are allocated in chunks, so the iterators sometimes need to jump from one chunk to the next. - They use the typical C++ iterator pattern where the "start" iterator points to the first item and the "end" iterator points to one past the last item.
- I'm not sure why the
operator*
needs that code to look at the next block at all; in hindsight it feels like it should only be in theoperator++
. It seems like it should be invalid to dereference the one-past-the-end iterator. But if code actually depends on dereferencing the one-past-the-end iterator, then that could be the source of the problem, since (I think, though I didn't really reread the code carefully) that means such code would crash if the very last edge allocated in the graph exactly aligned with the end of a block, and we needed to dereference the iterator corresponding to that very last edge, since at that point I think the next block wouldn't be allocated at all. - (In theory, you could also reach a null-dereference crash as a result of memory corruption if the null sentinel itself were corrupted into non-null and then the traversal continued past the end of the array. However, such a crash seems likely to crash by null-dereference in only a minority of cases.)
On second thoughts, we probably need that dereferencing behavior because we sometimes use the end-of-block iterator as the start iterator of a new chunk. So never mind... I think.
Comment 109•1 year ago
|
||
I dug through a dozen crashes to check for the values that are immediately before the two NULLs, and I found two different patterns. One is this:
00 00 01 54 6d 38 1a 08 <--- these all look like valid pointers
00 00 01 54 8c ac 68 88 <----+ | |
00 00 01 54 73 52 a7 28 <------+ |
00 00 01 54 8b 4f d4 28 <--------+
00 00 00 00 00 00 00 00 <--- mPointer points here
00 00 00 00 00 00 00 00
... all zeroes past this point
Or this:
00 00 02 18 71 2e e4 00 <--- looks like a valid pointer
00 00 00 00 00 00 00 02 <--- 0x2 constant?
00 00 02 18 7d 28 30 00 <--- again, what looks like a valid pointer
00 00 02 18 7d 28 30 28 <--- so does this
00 00 00 00 00 00 00 00 <--- mPointer points here
00 00 00 00 00 00 00 00
00 00 02 18 78 ba 88 00 <--- looks like a valid pointer again
00 00 00 00 00 00 00 00 <--- one more NULL
I found three crashes with each pattern, and I'm fairly confident that they come from different machines. The second pattern surprised me quite a bit, the pointer-0x2-pointer-pointer-null-null sequence doesn't appear like something that would occur accidentally.
Comment 110•1 year ago
|
||
The @ EdgePool::Iterator::operator*
is interesting and - save for the odd crash clearly caused by bad hardware - seems to be unrelated to the others. I'll split it out in a separate bug for further investigation.
Description
•