Open Bug 500105 Opened 16 years ago Updated 10 months ago

Crash @ GraphWalker

Categories

(Core :: Cycle Collector, defect, P5)

defect

Tracking

()

Tracking Status
firefox-esr115 --- affected
firefox119 --- wontfix
firefox120 --- affected
firefox121 --- affected
firefox122 --- affected

People

(Reporter: samuel.sidler+old, Unassigned)

References

Details

(Keywords: crash, stalled, Whiteboard: [crashkill][crashkill-debug][tbird crash])

Crash Data

Attachments

(2 files, 3 obsolete files)

The current #7 topcrash occurs with a signature of GraphWalker::DoWalk(nsDeque&). This crash occurs across platforms (Mac and Windows so far). All crash signatures look like this one, taken from bp-a6c2a662-3402-487e-b4b7-a45442090623, sometimes ending on frame 0, sometimes with the GraphWalker::DoWalk line not repeated: Frame Module Signature Source 0 xul.dll GraphWalker::DoWalk(nsDeque&) xpcom/base/nsCycleCollector.cpp:1186 1 xul.dll GraphWalker::DoWalk(nsDeque&) xpcom/base/nsCycleCollector.cpp:1182 2 xul.dll GraphWalker::WalkFromRoots(GCGraph&) xpcom/base/nsCycleCollector.cpp:1170 3 xul.dll nsCycleCollector::BeginCollection() xpcom/base/nsCycleCollector.cpp:2469 Lars: Can you grab some URLs for this issue from Socorro?
Flags: wanted1.9.1.x+
Bug 500189 has URLs for Firefox 3.5, 3.5pre and 3.5b99 (in that order)
Haven't people learned by now? Porn kills (your computer). Seriously -- ask yourself. If some porn costs a lot of money or a subscription fee, why is some of it free? Is it perhaps because they have an alternate revenue stream: installing malware?
Note: This also happens on 1.9.0 (currently #12 overall).
Flags: wanted1.9.0.x+
See also bug 437449, another cycle collector topcrash.
Peterv: Can you take a look at the crash reporter stack above and see if there's any problem we can fix here?
Assignee: nobody → peterv
Flags: wanted1.9.1.x+
I got one as well, seemingly without interaction. (Internal URLs only, sorry.) http://crash-stats.mozilla.com/report/index/54237630-c199-4b5b-af6e-63f462090728?p=1
Hmmnnnn, my laptop seemed to have crashed on a specific link (http://profile.myspace.com/Modules/Applications/Pages/Canvas.aspx?appId=104283&appParams={%22show_user_id%22%3A%22152617042%22}) THis is the crashing thread i got: Frame Module Signature [Expand] Source 0 xul.dll GraphWalker::DoWalk xpcom/base/nsCycleCollector.cpp:1186 1 xul.dll GraphWalker::DoWalk xpcom/base/nsCycleCollector.cpp:1182 2 xul.dll GraphWalker::WalkFromRoots xpcom/base/nsCycleCollector.cpp:1170 3 xul.dll nsCycleCollector::BeginCollection xpcom/base/nsCycleCollector.cpp:2469 Show/hide other thread
We have a Windows XP desktop suffering from these crashes when it runs Firefox 3.5.2, at least twice a day we encounter the problem. The only way we have been able to avoid this bug is to downgrade to the Firefox 3.0.xx branch. Often this occurs as soon as a link is clicked, but we find this hard to reproduce, it seems to be a very unpredictable occurrence. http://crash-stats.mozilla.com/report/index/46d27c8b-1809-4f90-bb17-9eff12090824 We may be having a related BSOD that crashes the entire system, which we never get while running Firefox 3.0.13 or earlier builds.
Johnny/Jonas: we need to figure this out before we ship Firefox 3.6
Flags: blocking1.9.2+
Priority: -- → P2
Attached patch v1 (obsolete) — Splinter Review
I'm not completely sure this is what causes the crash, but it seems like it potentially could. The iterator always increments mPointer, even after it just jumped to the start of a new block. I think one potential crash is that if mLastChild is set to the first pointer of the block, we could end up reading uninitialized memory because the iterator wouldn't stop since it would skip over the first pointer when doing operator++ (we do |child = pi->mFirstChild, child_end = pi->mLastChild; child != child_end; ++child|). When writing we don't use an iterator, so we do write to the first pointer of the block.
Attachment #403974 - Flags: review?(dbaron)
Comment on attachment 403974 [details] [diff] [review] v1 I think the code looks correct to me as it is now; the idea here is that iterators never point to the first pointer in a block; instead they point to the null sentinel at the end of the previous block (or in the pool itself) and dereferencing an iterator pointing to the sentinel (see operator*) returns the first pointer in that next block. I think this simplified things in other ways, e.g., by allowing us to create a valid iterator for the position after the end of a block before we've created the next block. I certainly should have documented that better, though.
Attachment #403974 - Flags: review?(dbaron) → review-
Keywords: topcrash
Attached patch Add some debugging help (obsolete) — Splinter Review
This adds a number of aborts when certain conditions fail (pointers outside of blocks, null pointers where we didn't expect it, ...). I think we should try to land this on trunk to get some more data out of crash reports. I'm also still looking into adding more stuff on the stack, so we can get more out of minidumps.
Attachment #403974 - Attachment is obsolete: true
Attached patch Add some debugging help (obsolete) — Splinter Review
I'd like to land this on trunk (only), but only until we have a couple of crash reports and minidumps. The aborts will probably move the crash to a different spot, but that should give us slightly more data to go on.
Attachment #404970 - Attachment is obsolete: true
Attachment #406482 - Flags: review?(dbaron)
Whiteboard: [crashkill]
How is this going to help; NS_ABORT_IF_FALSE is DEBUG-only. Don't you want runtime aborts?
Other than that, this looks fine, though. Hopefully it won't be a performance hit. Sorry for the delay in getting to it...
Grmbl, I misread nsDebug.h, I'll switch to NS_RUNTIMEABORT. As for performance, I ran this through tryserver. Most of the numbers didn't really change, shutdown numbers changed a bit but some were down, so not sure how much I need to care about the ones that went up.
OK, r=dbaron with NS_RUNTIMEABORT (you need to write your own if-statements with that).
I actually went with a CC_RUNTIME_ABORT_IF_FALSE. I'll run this through tryserver again.
Attachment #406482 - Attachment is obsolete: true
Attachment #408038 - Flags: review?(dbaron)
Attachment #406482 - Flags: review?(dbaron)
Whiteboard: [crashkill] → [crashkill] ready to land debugging help code?
Debugging help landed (in two pieces): http://hg.mozilla.org/mozilla-central/rev/9bb5e2a5c1ac http://hg.mozilla.org/mozilla-central/rev/80831c195191 I think the second piece missed today's nightly.
Since landing this on trunk there have not been no new reports submitted on this crash, so I don't have any data yet from the logging patches.
Are you sure they wouldn't show up under a different signature?
I check for signatures GraphWalker::DoWalk(nsDeque&), EdgePool::CheckIterator(Iterator&) and NodePool::CheckPtrInfo(PtrInfo*). I think that should catch them.
What's next here?
Comment on attachment 408038 [details] [diff] [review] Add some debugging help On the beta the frequency of this crash is slightly higher (though not very high either). If we have a quick new beta I'd like to take this on the branch so it rides along, and we actually get some data.
Attachment #408038 - Flags: approval1.9.2?
Peter: This bug is blocking1.9.2. You don't need approval to land that. :) (We're also planning to update current beta users to a new beta next week, iirc.)
Whiteboard: [crashkill] ready to land debugging help code? → [crashkill][crashkill-fix] ready to land debugging help code?
Whiteboard: [crashkill][crashkill-fix] ready to land debugging help code? → [crashkill][crashkill-debug] ready to land debugging help code?
Comment on attachment 408038 [details] [diff] [review] Add some debugging help I asked for approval because this isn't really a fix. But anyway, landed on 1.9.2: http://hg.mozilla.org/releases/mozilla-1.9.2/rev/297f674eb90f http://hg.mozilla.org/releases/mozilla-1.9.2/rev/6b79d9973d7b Let's hope we get some reports.
Attachment #408038 - Flags: approval1.9.2?
Whiteboard: [crashkill][crashkill-debug] ready to land debugging help code? → [crashkill][crashkill-debug][debugging code landed on trunk and 1.9.2]
No new crash reports on trunk or 3.6b2pre :-(.
David, wouldn't those be for bug 437449? That one seemed related to thread-safety issues?
No, the MarkRoots crash has an almost identical statistical profile (core count distribution, module correlations) to this one, and I've been presuming it's the same underlying problem as this one. It's definitely not a threadsafety problem.
Well, MarkRoots doesn't have any of the debugging code.
Two reports on 3.6b2: http://crash-stats.mozilla.com/report/index/bfbedc89-ca77-4783-a519-124532091111 http://crash-stats.mozilla.com/report/index/1e1f7869-c343-46c7-b5d5-163632091112 I have the minidumps, the debugging code doesn't seem to have helped much. Back to trying to figure things out from the assembly.
This bug causes frequent intermittent crashes on our Windows XP (SP3) box. There's no single action that precipitates a crash, appearing to be completely random.
We seem to sometimes have a bogus pointer to the next block. At first I thought we might have a bogus mFirstChild/mLastChild, so we'd walk randomly in the blocks and mistake a null PtrInfo* for the sentinel. But one of the crashes seems to be in the debug code I added, when walking the blocks. We walk the blocks using the blocksize, so in that case we just seem to have a bogus pointer at the right spot (last item in the array). I've looked at the block code again, don't see how it could happen. Maybe something else is corrupting our blocks' memory.
(In reply to comment #34) > This bug causes frequent intermittent crashes on our Windows XP (SP3) box. How frequent? Any chance we could get you to generate a full dump when it crashes (I think we can use DrWatson for that)?
-'ing.
blocking2.0: --- → alpha1
Flags: blocking1.9.2+ → blocking1.9.2-
(In reply to comment #18) > OK, r=dbaron with NS_RUNTIMEABORT (you need to write your own if-statements > with that). The NS_RUNTIMEABORT comments are scary, they sound to me like we wouldn't be able to trigger Breakpad on all platforms with that. Is that really what you want?
I think we should back out the debugging code on m-c and 1.9.2.
I agree about 1.9.2 (http://hg.mozilla.org/releases/mozilla-1.9.2/rev/96a497f82546 pushed earlier this week), but I don't see why we want to remove it from m-c yet. I think we should make sure it brings up breakpad, and see if anything shows up on crash-stats then.
I'd like to take this on trunk right now (unless we get a patch for bug 532490 in the meantime), and see if anything shows up in crash-stats.
Attachment #415830 - Flags: review?(dbaron)
Whiteboard: [crashkill][crashkill-debug][debugging code landed on trunk and 1.9.2] → [crashkill][crashkill-debug][debugging code landed on trunk]
Not blocking the first alpha on this bug.
blocking2.0: alpha1 → beta1
I got this reported yesterday using Mozilla/5.0 (X11; U; Linux i686; rv:1.9.1.9pre) Gecko/20100208 SeaMonkey/2.0.4pre but when it happened again (I presume the same) today the reporter apparently ignored it. Today's current URL was http://us.imdb.com/media/rm1043761920/nm0209289 and I was attempting to return to http://us.imdb.com/name/nm0209289/mediaindex when the crash occurred. As I was attempting to count the number of tabs open (after restore/restart; 23 counted, 2 left to count, total 25) so as to proceed with this comment, it crashed again, and again reported failed to come up. SM was started from Konsole, and this is that window's resulting output: The program 'seamonkey-bin' received an X Window System error. This probably reflects a bug in the program. The error was 'RenderBadPicture (invalid Picture parameter)'. (Details: serial 1998858 error_code 182 request_code 155 minor_code 5) (Note to programmers: normally, X errors are reported asynchronously; that is, you will receive the error a while after causing it. To debug your program, run it with the --sync command line option to change this behavior. You can then get a meaningful backtrace from your debugger if you break on the gdk_x_error() function.)
In 8 or so hours since comment 43 it crashed again, and again. Then the machine locked up and would not reboot into Linux. Main memory failed in a big way according to Memtest86+ 4.0.
Blocks: 557161
Moving this to beta2. Not seeing a lot of movement here, but yell if you think this should block the first beta.
blocking2.0: beta1+ → beta2+
is the debugging code talked about in comment 39 - 41 still on mozilla-central? that means it would be going out in beta. we should figure out if that's a good idea even if we don't have a good understanding of the cause of the crash or the fix yet.
Moving this to beta3, where it will block hard at least on ensuring that the debugging code has been removed - not sure where it lands as a blocker for the fix.
blocking2.0: beta2+ → beta3+
Has the debugging code been removed? Can we get an answer to comment 46, please?
The debugging code is still present in mozilla-central (and thus still needs to be removed).
PeterV: can we get that debugging code removed by Monday, Aug 2 at 23:00 PT please so we can bump this back off the blocking list as per comment 47
I just backed this out.
Peter's backout is: http://hg.mozilla.org/mozilla-central/rev/01877f113dab - thanks! Moving back to blocking2.0:? for retriage on the crash issue.
blocking2.0: beta3+ → ?
Summary: top crash [@ GraphWalker::DoWalk(nsDeque&)] → top crash [@ GraphWalker::DoWalk(nsDeque&)][@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)]
about 1500 crashes per day. current volumes per release look like. checking --- GraphWalker::DoWalk.nsDeque.. 20101003-crashdata.csv found in: 3.6.10 3.5.13 3.6.8 3.0.19 3.6.3 3.6.6 3.6 3.6.9 3.6.4 3.5b4 3.5.7 3.6b5 3.6.2 3.5.5 3.5.11 3.5.3 3.1b2 3.5.9 3.5.2 3.0b2 3.6b1 3.6.7 3.5.6 3.5.10 3.5 3 .0b5 3.0.5 3.0.17 3.0.10 3.5.8 3.5.12 3.5.1 3.1b3 3.0.9 3.0.6 3.0.18 3.0.15 3.0.14 3.0 release total-crashes GraphWalker::DoWalk.nsDeque.. crashes pct. all 353258 1213 0.00343375 3.6.10 211184 822 0.00389234 3.5.13 17578 112 0.0063716 3.6.8 19390 65 0.00335224 checking --- GraphWalker.scanVisitor.::DoWalk.nsDeque.. 20101003-crashdata.csv found in: 4.0b6 4.0b2 4.0b4 4.0b7pre 4.0b1 4.0b5 4.0b3 3.7a1 release total-crashes GraphWalker.scanVisitor.::DoWalk.nsDeque.. crashes pct. all 353258 127 0.000359511 4.0b6 24891 87 0.00349524 4.0b2 1209 12 0.00992556 4.0b4 1853 10 0.00539665 4.0b7pre2328 5 0.00214777
Not a serious regression, and without clues as how to reproduce probably not a blocker. I'd love to have more information, though. Correlation reports would be especially helpful.
blocking2.0: ? → -
So is this saying that the bug still exists from all the way back to 2009-06-23 18:57:38 PDT? That it still hasn't been fixed? If so is there an ETA of a fix?
Nobody knows how to cause the crash to happen, and as a result no developer has been able to observe the crash happening and figure out why.
Thanks at least I have an answer to the question.
It is #9 top crasher in 4.0b8 for the last week.
Keywords: crash
Still #9 in 4.0b9. GraphWalker<scanVisitor>::DoWalk(nsDeque&)|EXCEPTION_ACCESS_VIOLATION_READ (85 crashes) 18% (15/85) vs. 6% (805/14431) {AB2CE124-6272-4b12-94A9-7303C7397BD1} (Skype) 26% (22/85) vs. 14% (2016/14431) {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d} (Adblock Plus, https://addons.mozilla.org/addon/1865) 25% (21/85) vs. 16% (2373/14431) engine@conduit.com 13% (11/85) vs. 7% (1020/14431) {CAFEEFAC-0016-0000-0022-ABCDEFFEDCBA} (Java console) 92% (78/85) vs. 87% (12517/14431) testpilot@labs.mozilla.com (Mozilla Labs - Test Pilot, https://addons.mozilla.org/addon/13661)
Minefield just crashed on me while I was away from my PC, the crash report pointed me here. https://crash-stats.mozilla.com/report/index/bp-ea34b673-743a-44e2-ad7a-729f22110213
doesn't seem to be associated with start up (only about 10% of crashes are within first 3 minutes of start up. also no pattern in urls. looks like just general browsing. domains of sites 228 http://www.facebook.com 105 \N// 57 http://www.youtube.com 55 http://apps.facebook.com 29 http://www.orkut.com.br 29 http://vkontakte.ru 14 https://mail.google.com 9 http://nk.pl 8 http://www.google.com 6 http://www.google.de 6 http://my.mail.ru 6 about:blank// 5 https://www.google.com 5 http://www.odnoklassniki.ru 4 http://www.xvideos.com 4 http://www.google.com.br 4 http://us.mg1.mail.yahoo.com 4 http://en.wikipedia.org 3 https://www.facebook.com 3 https://login.yahoo.com 3 http://www.tuenti.com 3 http://www.meinvz.net 3 http://www.google.co.in 3 http://vnexpress.net 3 http://us.mg5.mail.yahoo.com 3 http://twitter.com 3 http://kino.to 3 http://foto.mail.ru 3 http://a.adwolf.ru I notice about 50% of reports might have unversioned .dll's around http://crash-stats.mozilla.com/report/index/9b98bd1f-d91f-4666-9699-2aa312110306 FFExternalAlert.dll http://crash-stats.mozilla.com/report/index/7e552492-9189-411d-8dfe-d2c502110306 http://crash-stats.mozilla.com/report/index/f6539ce3-c857-447f-ad7f-4ec4c2110306 zipfldra.dll http://crash-stats.mozilla.com/report/index/5b0ff3be-1662-4489-8896-e40912110306 UnlockerHook.dll http://crash-stats.mozilla.com/report/index/bc98fcd3-0177-4494-b0ff-b4ff82110306 http://crash-stats.mozilla.com/report/index/016e47d1-971a-4f33-a8df-951022110306 http://crash-stats.mozilla.com/report/index/a219ccdc-14c0-4121-9b70-89f3e2110306 http://crash-stats.mozilla.com/report/index/7d223bb8-3558-4bbb-8aed-b36852110306 MGKBHook.dll FFExternalAlert.dll http://crash-stats.mozilla.com/report/index/fcef3988-b3c6-4dd5-b1c4-f47c82110306 http://crash-stats.mozilla.com/report/index/d2484921-a187-42ad-8a50-c99842110306 http://crash-stats.mozilla.com/report/index/aa66f9a0-4e4d-486c-bc1b-de40e2110306 http://crash-stats.mozilla.com/report/index/26e031be-8f17-479a-bf60-b44972110306 AlotXpcom.dll BRNstFF.dll Iminent.XPCOM.dll http://crash-stats.mozilla.com/report/index/1108d236-967f-4453-bc0c-bfb5f2110306 http://crash-stats.mozilla.com/report/index/3e83e53d-17c8-4d2b-bd22-874a32110306 http://crash-stats.mozilla.com/report/index/311dc382-619d-47e5-8349-4a38a2110306 http://crash-stats.mozilla.com/report/index/eee5710a-9109-4449-809a-3fce82110306 GrabXpcom.dll GrabKernel.dll http://crash-stats.mozilla.com/report/index/77cdb656-ce4b-40e5-9698-1fd7b2110306 http://crash-stats.mozilla.com/report/index/8a9e290b-d60b-4676-a409-2aaeb2110306 UnlockerHook.dll ActWndHk.dll http://crash-stats.mozilla.com/report/index/6cc26abd-6033-4a0d-8ff0-01add2110306 frozen.dll googletoolbar-ff3.dll http://crash-stats.mozilla.com/report/index/5ffefa1d-d2fb-4b62-b788-7ec652110306 googletoolbar-ff3.dll frozen.dll http://crash-stats.mozilla.com/report/index/16511654-9864-43af-9cb5-c71d62110306 http://crash-stats.mozilla.com/report/index/17826542-f8c7-4a4b-82f1-74f542110306 GrabKernel.dll GrabXpcom.dll http://crash-stats.mozilla.com/report/index/787e7779-446a-4d08-9abe-8d91e2110306 http://crash-stats.mozilla.com/report/index/9d8aab45-ccf9-44ac-8530-f954d2110306 http://crash-stats.mozilla.com/report/index/84f1c8e8-b420-4b07-b893-acb622110306 http://crash-stats.mozilla.com/report/index/a5ea2b64-fb9b-4d05-8737-c57052110306 http://crash-stats.mozilla.com/report/index/0c325356-a9df-4f8b-953e-c45ac2110306 GrabXpcom.dll GrabKernel.dll http://crash-stats.mozilla.com/report/index/4a28d16c-91c1-4ba5-a293-ac0d82110306 http://crash-stats.mozilla.com/report/index/4e82b168-a933-415b-9d30-0dc9f2110306 http://crash-stats.mozilla.com/report/index/8d1595ca-4a3e-4a3d-8523-f35e62110306 http://crash-stats.mozilla.com/report/index/bdc74bb5-5a87-4125-875d-e241f2110306 http://crash-stats.mozilla.com/report/index/003cc31f-1c6f-427c-b99d-698442110306 googletoolbar-ff3.dll frozen.dll http://crash-stats.mozilla.com/report/index/510dda1d-c6f0-4f40-8c34-c8e332110306 BTKeyInd.dll http://crash-stats.mozilla.com/report/index/5e1a4445-537e-43b6-9358-390382110306 http://crash-stats.mozilla.com/report/index/af9ad84b-4208-4177-8581-5f6002110306 http://crash-stats.mozilla.com/report/index/db4928c5-484e-45aa-ab1b-17a962110306 http://crash-stats.mozilla.com/report/index/1dd145d9-782c-4121-b4e5-d41842110306 http://crash-stats.mozilla.com/report/index/f4200636-0bed-4a2b-9fc0-624b42110306 RadioWMPCore.dll FFExternalAlert.dll http://crash-stats.mozilla.com/report/index/70107fad-dd42-43bd-b558-191da2110306 http://crash-stats.mozilla.com/report/index/d90ab2c1-00d7-432a-8e60-9c7a62110306 http://crash-stats.mozilla.com/report/index/853fdee6-a245-425a-a63a-88fa12110306 newdll.dll http://crash-stats.mozilla.com/report/index/e1d2ea02-e7c0-4a65-b96d-957e42110306 http://crash-stats.mozilla.com/report/index/9e8966cd-2bbe-4763-ae5c-8d3632110306 http://crash-stats.mozilla.com/report/index/d3f4599c-6b14-4f82-aeba-7c9a62110306 http://crash-stats.mozilla.com/report/index/5a748cee-3988-4043-8fe6-f0d7d2110306 http://crash-stats.mozilla.com/report/index/cd78e5ca-6390-4be0-8a7b-b53b12110306 UnlockerHook.dll http://crash-stats.mozilla.com/report/index/6563c12f-e1df-47c5-ad19-d57c82110306 http://crash-stats.mozilla.com/report/index/651a50e3-b979-440e-a1d6-82f1c2110306 dll.dll UKHook40.dll http://crash-stats.mozilla.com/report/index/2e7d6d85-7467-4b95-b2bb-41b332110306 lpxpcom.dll http://crash-stats.mozilla.com/report/index/946db64b-1277-4e45-a0d1-c8d572110306 http://crash-stats.mozilla.com/report/index/eb18fb21-3db5-42db-8584-6b4802110306 http://crash-stats.mozilla.com/report/index/22026bdf-e305-4625-9bb8-7d96b2110306 http://crash-stats.mozilla.com/report/index/c84acfec-898c-4c39-ae55-aa8392110306 http://crash-stats.mozilla.com/report/index/37d55f55-736f-4c06-9504-1bbb02110306 googletoolbar-ff3.dll frozen.dll http://crash-stats.mozilla.com/report/index/010fb54c-34ad-4714-9784-248362110306 UKHook40.dll http://crash-stats.mozilla.com/report/index/291202e5-0991-439e-92db-471d82110306 FFExternalAlert.dll
It starts showing up as #4 top crasher in 4.0 RC1. Some comments say: "I was on the Addons page, and had clicked to go to top rated personas when it crashed." "was downloading some stuff and got booted off the internet" "Just looking around on Amazon"
#10 on 5.0b3 right now, FWIW.
(In reply to comment #65) > #10 on 5.0b3 right now, FWIW. And #3 top crasher without hangs.
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)]
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] → [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ]
https://tbpl.mozilla.org/php/getParsedLog.php?id=7562031&tree=Mozilla-Inbound - so there's one way to repro, run mochitest-other 50K times (or however many pushes we've actually triggered builds for), you'll hit it once.
Still appears consistently in the top 20 crashes for releases. Can we investigate this further?
This has been a top crash for a long time. The stack that's consistently high is GraphWalker<scanVisitor>::DoWalk(nsDeque&). We have just over 3500 on 10.0 in the past week. It's not a startup crash. Is there anything we can do to investigate this further?
About half of them are null-derefs. Maybe we can add some release-mode assertions to push around the crash to an earlier point where it would be more useful. I can take a look at that after I finish with a NoteXPCOMChild crashes.
Whiteboard: [crashkill][crashkill-debug][debugging code landed on trunk] → [crashkill][crashkill-debug]
Assignee: peterv → continuation
mccr8, that would be awesome.
Depends on: 727604
WalkFromRoots is a similar signature that has shown up recently. Probably the same thing, just showing up differently in the crash reports due to different inlining.
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ] → [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)]
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] → GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] [@ GraphWalker<scanVisitor>::WalkFromRoots] [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk…
Summary: top crash [@ GraphWalker::DoWalk(nsDeque&)][@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] → Crash @ GraphWalker
Still a topcrash - mccr8, did you get somewhere with what you mentioned in comment #73?
I landed some assertions and un-inlining, that are currently on Nightly and Aurora. No progress in figuring out what the problem is. I don't know if I should back out the changes or not. I don't think it will affect performance to any measurable extent, but I could check. It isn't that common on Nightly. If you add the two GraphWalker signatures up on Nightly, you get a ranking of around 65. On Aurora, around 45. On beta, it shows up at 16. In release 11 it is at 12. I'm not sure why there is a such a large difference. I've noticed it before. It could be malware/junkware related, or perhaps our cycle collector optimizations, which make the CC touch less things in memory, just avoid touching bad things, so it isn't showing up here.
Still in the top 20 for crashes on release, Fx12.
This has gone way down in volume on all channels. Still a valid crash but removing the top crash keyword.
Keywords: topcrash
Currently this is around #90 on 16, #80 on 17. Either the move to a new compiler fixed a compiler bug, or with our CC optimizations we're touching bad memory less.
Version: 1.9.1 Branch → Trunk
top 50 crash for TB17
Whiteboard: [crashkill][crashkill-debug] → [crashkill][crashkill-debug][tbird crash]
Currently about #288 on Nightly.
Assignee: continuation → nobody
Different crashes: https://crash-stats.mozilla.com/search/?product=Firefox&signature=DoWalk&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature Accounts for 1824 crashes the last 7 days. Adding Thunderbird to the mix raises the number to 1860. Currently placed as #104 for 38.0.5 for GraphWalker<T>::DoWalk(nsDeque&) Top-crashes however only counts 948 of these, meaning half of them are other versions of firefox. Using the search numbers would place it in top 50 for Firefox top-crashes.
Crash Signature: GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] [@ GraphWalker<scanVisitor>::WalkFromRoots] → GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] [@ GraphWalker<scanVisitor>::WalkFromRoots] [@ GraphWalker::DoWalk] [@ GraphWalker<T>::DoWalk] [@ GraphWalker<T>::Walk] [@ GraphWalker<T>::WalkFromRoots]
Flags: needinfo?(norikachi003)
The bug is back -- 9 (!) years later: https://crash-stats.mozilla.com/report/index/cbfa60bf-d335-4ace-9e09-939341171106 I was not doing anything, I was away from the browser. The page with YouTube's subscriptions just died on its own.
I was not doing anything, I was away from the browser. maybe the ads script on the site: http://wiki.edu.vn/wiki http://wikideu.com/wiki/

(In reply to Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal) from comment #95)

This is now a fairly high volume crash on release 62, for example, for
GraphWalker<T>::DoWalk there are 2400+ crashes in the last week: https://crash-stats.mozilla.com/signature/?signature=GraphWalker%3CT%3E%3A%3ADoWalk

Current rate is about 1,400 per week for Firefox.

TCW's Thunderbird crash bp-c8782125-241c-489a-81d9-7b91c0200712

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on beta
  • Top 10 content process crashes on release

For more information, please visit auto_nag documentation.

Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk] [@ GraphWalker<ScanBlackVisitor>::Walk… → [@ GraphWalker<scanVisitor>::DoWalk] [@ GraphWalker<ScanBlackVisitor>::DoWalk] [@ GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots] [@ GraphWalker::DoWalk] [@ GraphWalker<T>::DoWalk] [@ GraphWalker<T>::Walk] [@ GraphW…

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Severity: critical → S2

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Crash Signature: GraphWalker<T>::WalkFromRoots] [@ EdgePool::Iterator::operator* ] → GraphWalker<T>::WalkFromRoots] [@ EdgePool::Iterator::operator*] [@ PtrInfo::WasTraversed]
Component: XPCOM → Cycle Collector
Keywords: stalled
Severity: S2 → S3
Priority: P2 → P5

Removing all the signatures that don't have crashes on file anymore.

Crash Signature: [@ GraphWalker<scanVisitor>::DoWalk] [@ GraphWalker<ScanBlackVisitor>::DoWalk] [@ GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots] [@ GraphWalker::DoWalk] [@ GraphWalker<T>::DoWalk] [@ GraphWalker<T>::Walk] [@ GraphW… → [@ GraphWalker<T>::DoWalk] [@ EdgePool::Iterator::operator*] [@ PtrInfo::WasTraversed]

I've dug into the three remaining signatures and I found that @ PtrInfo::WasTraversed and @ GraphWalker<T>::DoWalk are clearly caused by bad hardware. A lot of the crashes under those signatures have been detected as having a bit-flip and come from older machines, nothing really new there. @ EdgePool::Iterator::operator* on the other hand looks different, and maybe could be caused by a real bug. A handful of crashes under that signature have been caused by bad memory, this one is a good example. However crashes like that are a minority, the vast majority of crashes under that signature are dereferencing a NULL pointer and not an address that looks like the result of a bit-flip.

The exact point of those crashes is here, and in particular the (mPointer + 1)->block expression is what's hitting the NULL pointer. Here's what's peculiar about this. *mPointer yields a NULL pointer, that causes the condition in this line to be satisfied and so we enter the block where we crash. However, as the comment in the block states, that condition means we've found a sentinel and the following element in the array should be non-NULL. I've opened several minidumps and in all of them I found that both *mPointer and *(mPointer + 1) yielded NULL pointers. That is, the array we're iterating over contains two adjacent NULL elements, even though the code suggests that this should never happen.

Unfortunately I don't know the cycle-collector well enough to be able to tell what's going on, but it doesn't seem accidental.

A few comments (as the original author of the iterator in 34606b4dd39f):

  • the EdgePool is basically storage for large blocks of edges in the graph that the cycle collector builds by calling its traversal methods. The PtrInfo (which are the nodes in the graph) store iterators for the first and last-plus-one outgoing edges, which are all stored adjacent to each other (logically, according to the iterators). But the edges are allocated in chunks, so the iterators sometimes need to jump from one chunk to the next.
  • They use the typical C++ iterator pattern where the "start" iterator points to the first item and the "end" iterator points to one past the last item.
  • I'm not sure why the operator* needs that code to look at the next block at all; in hindsight it feels like it should only be in the operator++. It seems like it should be invalid to dereference the one-past-the-end iterator. But if code actually depends on dereferencing the one-past-the-end iterator, then that could be the source of the problem, since (I think, though I didn't really reread the code carefully) that means such code would crash if the very last edge allocated in the graph exactly aligned with the end of a block, and we needed to dereference the iterator corresponding to that very last edge, since at that point I think the next block wouldn't be allocated at all.
  • (In theory, you could also reach a null-dereference crash as a result of memory corruption if the null sentinel itself were corrupted into non-null and then the traversal continued past the end of the array. However, such a crash seems likely to crash by null-dereference in only a minority of cases.)

On second thoughts, we probably need that dereferencing behavior because we sometimes use the end-of-block iterator as the start iterator of a new chunk. So never mind... I think.

I dug through a dozen crashes to check for the values that are immediately before the two NULLs, and I found two different patterns. One is this:

00 00 01 54 6d 38 1a 08 <--- these all look like valid pointers
00 00 01 54 8c ac 68 88 <----+ | |
00 00 01 54 73 52 a7 28 <------+ |
00 00 01 54 8b 4f d4 28 <--------+
00 00 00 00 00 00 00 00 <--- mPointer points here
00 00 00 00 00 00 00 00
... all zeroes past this point

Or this:

00 00 02 18 71 2e e4 00 <--- looks like a valid pointer
00 00 00 00 00 00 00 02 <--- 0x2 constant?
00 00 02 18 7d 28 30 00 <--- again, what looks like a valid pointer
00 00 02 18 7d 28 30 28 <--- so does this
00 00 00 00 00 00 00 00 <--- mPointer points here
00 00 00 00 00 00 00 00
00 00 02 18 78 ba 88 00 <--- looks like a valid pointer again
00 00 00 00 00 00 00 00 <--- one more NULL

I found three crashes with each pattern, and I'm fairly confident that they come from different machines. The second pattern surprised me quite a bit, the pointer-0x2-pointer-pointer-null-null sequence doesn't appear like something that would occur accidentally.

The @ EdgePool::Iterator::operator* is interesting and - save for the odd crash clearly caused by bad hardware - seems to be unrelated to the others. I'll split it out in a separate bug for further investigation.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: