Open Bug 500105 Opened 13 years ago Updated 2 years ago

Crash @ GraphWalker


(Core :: XPCOM, defect, P2)




Tracking Status
firefox42 --- affected
firefox43 --- affected
firefox44 --- affected
firefox45 --- affected
blocking2.0 --- -
status1.9.1 --- wanted
firefox-esr52 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox62 --- wontfix
firefox63 --- affected
firefox64 --- affected


(Reporter: samuel.sidler+old, Unassigned)




(Keywords: crash, Whiteboard: [crashkill][crashkill-debug][tbird crash])

Crash Data


(2 files, 3 obsolete files)

The current #7 topcrash occurs with a signature of GraphWalker::DoWalk(nsDeque&).

This crash occurs across platforms (Mac and Windows so far).

All crash signatures look like this one, taken from bp-a6c2a662-3402-487e-b4b7-a45442090623, sometimes ending on frame 0, sometimes with the GraphWalker::DoWalk line not repeated:

Frame  	Module  	Signature  	Source
0 	xul.dll 	GraphWalker::DoWalk(nsDeque&) 	xpcom/base/nsCycleCollector.cpp:1186
1 	xul.dll 	GraphWalker::DoWalk(nsDeque&) 	xpcom/base/nsCycleCollector.cpp:1182
2 	xul.dll 	GraphWalker::WalkFromRoots(GCGraph&) 	xpcom/base/nsCycleCollector.cpp:1170
3 	xul.dll 	nsCycleCollector::BeginCollection() 	xpcom/base/nsCycleCollector.cpp:2469 

Lars: Can you grab some URLs for this issue from Socorro?
Flags: wanted1.9.1.x+
Bug 500189 has URLs for Firefox 3.5, 3.5pre and 3.5b99 (in that order)
Haven't people learned by now? Porn kills (your computer). Seriously -- ask yourself. If some porn costs a lot of money or a subscription fee, why is some of it free? Is it perhaps because they have an alternate revenue stream: installing malware?
Note: This also happens on 1.9.0 (currently #12 overall).
Flags: wanted1.9.0.x+
See also bug 437449, another cycle collector topcrash.
Peterv: Can you take a look at the crash reporter stack above and see if there's any problem we can fix here?
Assignee: nobody → peterv
Flags: wanted1.9.1.x+
I got one as well, seemingly without interaction. (Internal URLs only, sorry.)
Hmmnnnn, my laptop seemed to have crashed on a specific link ({%22show_user_id%22%3A%22152617042%22})
THis is the crashing thread i got:

Frame  	Module  	Signature [Expand]  	Source
0 	xul.dll 	GraphWalker::DoWalk 	xpcom/base/nsCycleCollector.cpp:1186
1 	xul.dll 	GraphWalker::DoWalk 	xpcom/base/nsCycleCollector.cpp:1182
2 	xul.dll 	GraphWalker::WalkFromRoots 	xpcom/base/nsCycleCollector.cpp:1170
3 	xul.dll 	nsCycleCollector::BeginCollection 	xpcom/base/nsCycleCollector.cpp:2469

Show/hide other thread
We have a Windows XP desktop suffering from these crashes when it runs Firefox 3.5.2, at least twice a day we encounter the problem.
The only way we have been able to avoid this bug is to downgrade to the Firefox 3.0.xx branch.

Often this occurs as soon as a link is clicked, but we find this hard to reproduce, it seems to be a very unpredictable occurrence.

We may be having a related BSOD that crashes the entire system, which we never get while running Firefox 3.0.13 or earlier builds.
Johnny/Jonas: we need to figure this out before we ship Firefox 3.6
Flags: blocking1.9.2+
Priority: -- → P2
Attached patch v1 (obsolete) — Splinter Review
I'm not completely sure this is what causes the crash, but it seems like it potentially could. The iterator always increments mPointer, even after it just jumped to the start of a new block. I think one potential crash is that if mLastChild is set to the first pointer of the block, we could end up reading uninitialized memory because the iterator wouldn't stop since it would skip over the first pointer when doing operator++ (we do |child = pi->mFirstChild, child_end = pi->mLastChild; child != child_end; ++child|). When writing we don't use an iterator, so we do write to the first pointer of the block.
Attachment #403974 - Flags: review?(dbaron)
Comment on attachment 403974 [details] [diff] [review]

I think the code looks correct to me as it is now; the idea here is that iterators never point to the first pointer in a block; instead they point to the null sentinel at the end of the previous block (or in the pool itself) and dereferencing an iterator pointing to the sentinel (see operator*) returns the first pointer in that next block.  I think this simplified things in other ways, e.g., by allowing us to create a valid iterator for the position after the end of a block before we've created the next block.  I certainly should have documented that better, though.
Attachment #403974 - Flags: review?(dbaron) → review-
Keywords: topcrash
Attached patch Add some debugging help (obsolete) — Splinter Review
This adds a number of aborts when certain conditions fail (pointers outside of blocks, null pointers where we didn't expect it, ...). I think we should try to land this on trunk to get some more data out of crash reports. I'm also still looking into adding more stuff on the stack, so we can get more out of minidumps.
Attachment #403974 - Attachment is obsolete: true
Attached patch Add some debugging help (obsolete) — Splinter Review
I'd like to land this on trunk (only), but only until we have a couple of crash reports and minidumps. The aborts will probably move the crash to a different spot, but that should give us slightly more data to go on.
Attachment #404970 - Attachment is obsolete: true
Attachment #406482 - Flags: review?(dbaron)
Whiteboard: [crashkill]
How is this going to help; NS_ABORT_IF_FALSE is DEBUG-only.  Don't you want runtime aborts?
Other than that, this looks fine, though.  Hopefully it won't be a performance hit.  Sorry for the delay in getting to it...
Grmbl, I misread nsDebug.h, I'll switch to NS_RUNTIMEABORT.
As for performance, I ran this through tryserver. Most of the numbers didn't really change, shutdown numbers changed a bit but some were down, so not sure how much I need to care about the ones that went up.
OK, r=dbaron with NS_RUNTIMEABORT (you need to write your own if-statements with that).
I actually went with a CC_RUNTIME_ABORT_IF_FALSE. I'll run this through tryserver again.
Attachment #406482 - Attachment is obsolete: true
Attachment #408038 - Flags: review?(dbaron)
Attachment #406482 - Flags: review?(dbaron)
Whiteboard: [crashkill] → [crashkill] ready to land debugging help code?
Debugging help landed (in two pieces):

I think the second piece missed today's nightly.
Attachment #408038 - Flags: review?(dbaron) → review+
Since landing this on trunk there have not been no new reports submitted on this crash, so I don't have any data yet from the logging patches.
Are you sure they wouldn't show up under a different signature?
I check for signatures GraphWalker::DoWalk(nsDeque&), EdgePool::CheckIterator(Iterator&) and NodePool::CheckPtrInfo(PtrInfo*). I think that should catch them.
What's next here?
Comment on attachment 408038 [details] [diff] [review]
Add some debugging help

On the beta the frequency of this crash is slightly higher (though not very high either). If we have a quick new beta I'd like to take this on the branch so it rides along, and we actually get some data.
Attachment #408038 - Flags: approval1.9.2?
Peter: This bug is blocking1.9.2. You don't need approval to land that. :)

(We're also planning to update current beta users to a new beta next week, iirc.)
Whiteboard: [crashkill] ready to land debugging help code? → [crashkill][crashkill-fix] ready to land debugging help code?
Whiteboard: [crashkill][crashkill-fix] ready to land debugging help code? → [crashkill][crashkill-debug] ready to land debugging help code?
Comment on attachment 408038 [details] [diff] [review]
Add some debugging help

I asked for approval because this isn't really a fix. But anyway, landed on 1.9.2:

Let's hope we get some reports.
Attachment #408038 - Flags: approval1.9.2?
Whiteboard: [crashkill][crashkill-debug] ready to land debugging help code? → [crashkill][crashkill-debug][debugging code landed on trunk and 1.9.2]
No new crash reports on trunk or 3.6b2pre :-(.
David, wouldn't those be for bug 437449? That one seemed related to thread-safety issues?
No, the MarkRoots crash has an almost identical statistical profile (core count distribution, module correlations) to this one, and I've been presuming it's the same underlying problem as this one.  It's definitely not a threadsafety problem.
Well, MarkRoots doesn't have any of the debugging code.
Two reports on 3.6b2:

I have the minidumps, the debugging code doesn't seem to have helped much. Back to trying to figure things out from the assembly.
This bug causes frequent intermittent crashes on our Windows XP (SP3) box. There's no single action that precipitates a crash, appearing to be completely random.
We seem to sometimes have a bogus pointer to the next block. At first I thought we might have a bogus mFirstChild/mLastChild, so we'd walk randomly in the blocks and mistake a null PtrInfo* for the sentinel. But one of the crashes seems to be in the debug code I added, when walking the blocks. We walk the blocks using the blocksize, so in that case we just seem to have a bogus pointer at the right spot (last item in the array). I've looked at the block code again, don't see how it could happen. Maybe something else is corrupting our blocks' memory.
(In reply to comment #34)
> This bug causes frequent intermittent crashes on our Windows XP (SP3) box.

How frequent? Any chance we could get you to generate a full dump when it crashes (I think we can use DrWatson for that)?
blocking2.0: --- → alpha1
Flags: blocking1.9.2+ → blocking1.9.2-
(In reply to comment #18)
> OK, r=dbaron with NS_RUNTIMEABORT (you need to write your own if-statements
> with that).

The NS_RUNTIMEABORT comments are scary, they sound to me like we wouldn't be able to trigger Breakpad on all platforms with that. Is that really what you want?
I think we should back out the debugging code on m-c and 1.9.2.
I agree about 1.9.2 ( pushed earlier this week), but I don't see why we want to remove it from m-c yet. I think we should make sure it brings up breakpad, and see if anything shows up on crash-stats then.
I'd like to take this on trunk right now (unless we get a patch for bug 532490 in the meantime), and see if anything shows up in crash-stats.
Attachment #415830 - Flags: review?(dbaron)
Whiteboard: [crashkill][crashkill-debug][debugging code landed on trunk and 1.9.2] → [crashkill][crashkill-debug][debugging code landed on trunk]
Attachment #415830 - Flags: review?(dbaron) → review+
Not blocking the first alpha on this bug.
blocking2.0: alpha1 → beta1
I got this reported yesterday using Mozilla/5.0 (X11; U; Linux i686; rv: Gecko/20100208 SeaMonkey/2.0.4pre but when it happened again (I presume the same) today the reporter apparently ignored it. Today's current URL was  and I was attempting to return to when the crash occurred. As I was attempting to count the number of tabs open (after restore/restart; 23 counted, 2 left to count, total 25) so as to proceed with this comment, it crashed again, and again reported failed to come up. SM was started from Konsole, and this is that window's resulting output:
The program 'seamonkey-bin' received an X Window System error.
This probably reflects a bug in the program.
The error was 'RenderBadPicture (invalid Picture parameter)'.
  (Details: serial 1998858 error_code 182 request_code 155 minor_code 5)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)
In 8 or so hours since comment 43 it crashed again, and again. Then the machine locked up and would not reboot into Linux. Main memory failed in a big way according to Memtest86+ 4.0.
Blocks: 557161
Moving this to beta2.  Not seeing a lot of movement here, but yell if you think this should block the first beta.
blocking2.0: beta1+ → beta2+
is the debugging code talked about in comment 39 - 41 still on mozilla-central? that means it would be going out in beta.  we should figure out if that's a good idea even if we don't have a good understanding of the cause of the crash or the fix yet.
Moving this to beta3, where it will block hard at least on ensuring that the debugging code has been removed - not sure where it lands as a blocker for the fix.
blocking2.0: beta2+ → beta3+
Has the debugging code been removed? Can we get an answer to comment 46, please?
The debugging code is still present in mozilla-central (and thus still needs to be removed).
PeterV: can we get that debugging code removed by Monday, Aug 2 at 23:00 PT please so we can bump this back off the blocking list as per comment 47
I just backed this out.
Peter's backout is: - thanks!

Moving back to blocking2.0:? for retriage on the crash issue.
blocking2.0: beta3+ → ?
Duplicate of this bug: 569688
Summary: top crash [@ GraphWalker::DoWalk(nsDeque&)] → top crash [@ GraphWalker::DoWalk(nsDeque&)][@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)]
about 1500 crashes per day.  current volumes per release look like.

checking --- GraphWalker::DoWalk.nsDeque.. 20101003-crashdata.csv
found in: 3.6.10 3.5.13 3.6.8 3.0.19 3.6.3 3.6.6 3.6 3.6.9 3.6.4 3.5b4 3.5.7 3.6b5 3.6.2 3.5.5 3.5.11 3.5.3 3.1b2 3.5.9 3.5.2 3.0b2 3.6b1 3.6.7 3.5.6 3.5.10 3.5 3
.0b5 3.0.5 3.0.17 3.0.10 3.5.8 3.5.12 3.5.1 3.1b3 3.0.9 3.0.6 3.0.18 3.0.15 3.0.14 3.0
release total-crashes
              GraphWalker::DoWalk.nsDeque.. crashes
all     353258  1213    0.00343375
3.6.10  211184  822     0.00389234
3.5.13  17578   112     0.0063716
3.6.8   19390   65      0.00335224

checking --- GraphWalker.scanVisitor.::DoWalk.nsDeque.. 20101003-crashdata.csv
found in: 4.0b6 4.0b2 4.0b4 4.0b7pre 4.0b1 4.0b5 4.0b3 3.7a1
release total-crashes
              GraphWalker.scanVisitor.::DoWalk.nsDeque.. crashes
all     353258  127     0.000359511
4.0b6   24891   87      0.00349524
4.0b2   1209    12      0.00992556
4.0b4   1853    10      0.00539665
4.0b7pre2328    5       0.00214777
Not a serious regression, and without clues as how to reproduce probably not a blocker. I'd love to have more information, though. Correlation reports would be especially helpful.
blocking2.0: ? → -
Duplicate of this bug: 606820
So is this saying that the bug still exists from all the way back to 2009-06-23 18:57:38 PDT? That it still hasn't been fixed? If so is there an ETA of a fix?
Nobody knows how to cause the crash to happen, and as a result no developer has been able to observe the crash happening and figure out why.
Thanks at least I have an answer to the question.
It is #9 top crasher in 4.0b8 for the last week.
Keywords: crash
Still #9 in 4.0b9.

GraphWalker<scanVisitor>::DoWalk(nsDeque&)|EXCEPTION_ACCESS_VIOLATION_READ (85 crashes)
     18% (15/85) vs.   6% (805/14431) {AB2CE124-6272-4b12-94A9-7303C7397BD1} (Skype)
     26% (22/85) vs.  14% (2016/14431) {d10d0bf8-f5b5-c8b4-a8b2-2b9879e08c5d} (Adblock Plus,
     25% (21/85) vs.  16% (2373/14431)
     13% (11/85) vs.   7% (1020/14431) {CAFEEFAC-0016-0000-0022-ABCDEFFEDCBA} (Java console)
     92% (78/85) vs.  87% (12517/14431) (Mozilla Labs - Test Pilot,
Minefield just crashed on me while I was away from my PC, the crash report pointed me here.
doesn't seem to be associated with start up (only about 10% of crashes are within first 3 minutes of start up.

also no pattern in urls.  looks like just general browsing.

domains of sites
 105 \N//
   6 about:blank//

I notice about 50% of reports might have unversioned .dll's around FFExternalAlert.dll zipfldra.dll UnlockerHook.dll MGKBHook.dll FFExternalAlert.dll AlotXpcom.dll BRNstFF.dll Iminent.XPCOM.dll GrabXpcom.dll GrabKernel.dll UnlockerHook.dll ActWndHk.dll frozen.dll googletoolbar-ff3.dll googletoolbar-ff3.dll frozen.dll GrabKernel.dll GrabXpcom.dll GrabXpcom.dll GrabKernel.dll googletoolbar-ff3.dll frozen.dll BTKeyInd.dll RadioWMPCore.dll FFExternalAlert.dll newdll.dll UnlockerHook.dll dll.dll UKHook40.dll lpxpcom.dll googletoolbar-ff3.dll frozen.dll UKHook40.dll FFExternalAlert.dll
It starts showing up as #4 top crasher in 4.0 RC1.

Some comments say:
"I was on the Addons page, and had clicked to go to top rated personas when it crashed."
"was downloading some stuff and got booted off the internet"
"Just looking around on Amazon"
#10 on 5.0b3 right now, FWIW.
(In reply to comment #65)
> #10 on 5.0b3 right now, FWIW.
And #3 top crasher without hangs.
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)]
Duplicate of this bug: 682598
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] → [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ] - so there's one way to repro, run mochitest-other 50K times (or however many pushes we've actually triggered builds for), you'll hit it once.
Still appears consistently in the top 20 crashes for releases. Can we investigate this further?
This has been a top crash for a long time. The stack that's consistently high is GraphWalker<scanVisitor>::DoWalk(nsDeque&). We have just over 3500 on 10.0 in the past week. It's not a startup crash.

Is there anything we can do to investigate this further?
About half of them are null-derefs.  Maybe we can add some release-mode assertions to push around the crash to an earlier point where it would be more useful.  I can take a look at that after I finish with a NoteXPCOMChild crashes.
Whiteboard: [crashkill][crashkill-debug][debugging code landed on trunk] → [crashkill][crashkill-debug]
Assignee: peterv → continuation
mccr8, that would be awesome.
Depends on: 727604
WalkFromRoots is a similar signature that has shown up recently.  Probably the same thing, just showing up differently in the crash reports due to different inlining.
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ] → [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)]
Crash Signature: [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<ScanBlackVisitor>::DoWalk(nsDeque&) ] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] → GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] [@ GraphWalker<scanVisitor>::WalkFromRoots] [@ GraphWalker::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] [@ GraphWalker<scanVisitor>::DoWalk…
Summary: top crash [@ GraphWalker::DoWalk(nsDeque&)][@ GraphWalker<scanVisitor>::DoWalk(nsDeque&)] → Crash @ GraphWalker
Still a topcrash - mccr8, did you get somewhere with what you mentioned in comment #73?
I landed some assertions and un-inlining, that are currently on Nightly and Aurora.  No progress in figuring out what the problem is.  I don't know if I should back out the changes or not.  I don't think it will affect performance to any measurable extent, but I could check.

It isn't that common on Nightly.  If you add the two GraphWalker signatures up on Nightly, you get a ranking of around 65.  On Aurora, around 45.  On beta, it shows up at 16.  In release 11 it is at 12.  I'm not sure why there is a such a large difference.  I've noticed it before.  It could be malware/junkware related, or perhaps our cycle collector optimizations, which make the CC touch less things in memory, just avoid touching bad things, so it isn't showing up here.
Still in the top 20 for crashes on release, Fx12.
This has gone way down in volume on all channels. Still a valid crash but removing the top crash keyword.
Keywords: topcrash
Currently this is around #90 on 16, #80 on 17. Either the move to a new compiler fixed a compiler bug, or with our CC optimizations we're touching bad memory less.
Version: 1.9.1 Branch → Trunk
top 50 crash for TB17
Whiteboard: [crashkill][crashkill-debug] → [crashkill][crashkill-debug][tbird crash]
Currently about #288 on Nightly.
Assignee: continuation → nobody
Different crashes:
Accounts for 1824 crashes the last 7 days. Adding Thunderbird to the mix raises the number to 1860.

Currently placed as #104 for 38.0.5 for GraphWalker<T>::DoWalk(nsDeque&)
Top-crashes however only counts 948 of these, meaning half of them are other versions of firefox.

Using the search numbers would place it in top 50 for Firefox top-crashes.
Crash Signature: GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] [@ GraphWalker<scanVisitor>::WalkFromRoots] → GraphWalker<ScanBlackVisitor>::Walk] [@ GraphWalker<scanVisitor>::WalkFromRoots(GCGraph&)] [@ GraphWalker<scanVisitor>::WalkFromRoots] [@ GraphWalker::DoWalk] [@ GraphWalker<T>::DoWalk] [@ GraphWalker<T>::Walk] [@ GraphWalker<T>::WalkFromRoots]
Flags: needinfo?(norikachi003)
The bug is back -- 9 (!) years later:

I was not doing anything, I was away from the browser. The page with YouTube's subscriptions just died on its own.
Duplicate of this bug: 1490016
I was not doing anything, I was away from the browser. maybe the ads script on the site:

(In reply to Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal) from comment #95)

This is now a fairly high volume crash on release 62, for example, for
GraphWalker<T>::DoWalk there are 2400+ crashes in the last week:

Current rate is about 1,400 per week for Firefox.

TCW's Thunderbird crash bp-c8782125-241c-489a-81d9-7b91c0200712

You need to log in before you can comment on or make changes to this bug.