Closed Bug 984361 Opened 6 years ago Closed 11 months ago

crash [@ PushMarkStack]

Categories

(Core :: JavaScript: GC, defect, critical)

30 Branch
All
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox31 --- affected

People

(Reporter: tracy, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-290b0866-3032-408f-be89-59fcf2140316.
=============================================================

This long standing signature part of bug 719114 had a significant spike in volume this past weekend. Lifting it to the #1 topcrasher on Nightly (Fx30). 

regression range is between nightly builds of 20140314030202 and 20140315030204: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f073b3d6db1f&tochange=82c90c17fc95
I only did a "find in page" for "GC" on the regression range from comment #0, and bug 982847 is the only one mentioning GC except two that seem to be about the not-yet-activated GGC.

Andrew, could this be related?
Flags: needinfo?(continuation)
Hmm.  My bug shouldn't affect behavior under normal circumstances, so I don't think it is the problem.

What are the GGC-related bugs?  At least some of those do affect non-GGC behavior.
Flags: needinfo?(continuation)
They're both commits related to bug 979913.
Flags: needinfo?(tzimmermann)
PushMarkStack has normally been a stack-exhaustion crash, but not in this case. Interestingly to me, most (99%) of the crashes here have a common crash address, 0xfc0b0.

https://crash-stats.mozilla.com/report/index/ea31c72d-b5ca-4e6b-847d-2005d2140315 has a better stack than the report in comment 0

dmajor can you check whether we're dereferencing "thing" or "gcmarker" at 
http://hg.mozilla.org/mozilla-central/annotate/82c90c17fc95/js/src/gc/Marking.cpp#l929 ?
Flags: needinfo?(dmajor)
One of the patches in bug 979913 fixes a missing include statement, another one removes an unused variable. Both are probably not related.

The only serious fix [1] is a change to the stack memory of the NUWA thread. But this only concerns b2g. According to the reviewer the code shouldn't even run on Windows NT.

[1] https://bug979913.bugzilla.mozilla.org/attachment.cgi?id=8386159
Flags: needinfo?(tzimmermann)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #4)
> PushMarkStack has normally been a stack-exhaustion crash, but not in this
> case. Interestingly to me, most (99%) of the crashes here have a common
> crash address, 0xfc0b0.
> 
> https://crash-stats.mozilla.com/report/index/ea31c72d-b5ca-4e6b-847d-
> 2005d2140315 has a better stack than the report in comment 0
> 
> dmajor can you check whether we're dereferencing "thing" or "gcmarker" at 
> http://hg.mozilla.org/mozilla-central/annotate/82c90c17fc95/js/src/gc/
> Marking.cpp#l929 ?

Agreed that this particular one doesn't look like stack exhaustion.

This appears to be bit magic for thing->chunk()->bitmap.markIfUnmarked:

000007fe`de1d4861 4981e0b0c0ffff  and     r8,0FFFFFFFFFFFFC0B0h
000007fe`de1d4868 80e13f          and     cl,3Fh
000007fe`de1d486b 4981c8b0c00f00  or      r8,0FC0B0h
000007fe`de1d4872 48d3e2          shl     rdx,cl
000007fe`de1d4875 8bcf            mov     ecx,edi
000007fe`de1d4877 48c1e906        shr     rcx,6
000007fe`de1d487b 498b04c8        mov     rax,qword ptr [r8+rcx*8]

rax=0000000022ab02b0 rbx=00000000068c73c8 rcx=0000000000000000
rdx=0000000000000001 rsi=000000002f20d240 rdi=0000000000000000
rip=000007fede1d487b rsp=000000000027eee0 rbp=00000000068c7000
 r8=00000000000fc0b0  r9=0000000000000001 r10=0000000000000000
r11=0000000000000000 r12=00000000068c73c8 r13=0000000000000000
r14=000000000027efb0 r15=00007fffffffffff
Is this showing up on any other channels, or just Nightly?
(In reply to Andrew McCreight [:mccr8] from comment #8)
> Is this showing up on any other channels, or just Nightly?

The spike is Nightly only.
Flags: needinfo?(dmajor)
The most likely regressor is bug 969012, which was backed out yesterday. Tomorrow's nightly should contain the backout: will be interesting to see what happens to the crash stats then.
This has dropped off significantly with 1408 crashes in the last week (down 15.05%) but still remains our #1 topcrash in Firefox 31 (~15.17% of all browser crashes).
It seems pretty clear that if there is a nullptr in the mark stack, that means someone called mark on a nullptr. Let's add a check for this condition in crash-diagnostics builds (e.g. nightly), print something useful if it happens, and get a more useful stack too.
Attachment #8420296 - Flags: review?(sphink)
Comment on attachment 8420296 [details] [diff] [review]
diagnostic_crash_on_null_mark-v0.diff

Review of attachment 8420296 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Marking.cpp
@@ +152,4 @@
>  {
> +    JS_ASSERT(trc);
> +    JS_ASSERT(thingp);
> +    JS_ASSERT(*thingp);

I'd move the JS_ASSERT(*thingp) down below the JS_CRASH_DIAGNOSTICS block so we get the same output in opt vs debug when JS_CRASH_DIAGNOSTICS is set.
Attachment #8420296 - Flags: review?(sphink) → review+
My try push froze for 1h38m before I killed it, so no try run. Jit-tests pass though and it's a simple enough patch.

https://hg.mozilla.org/integration/mozilla-inbound/rev/472087bf4626
I looked at a few other random GC crashes after pushing the previous patch. It looks like some of the crashes may be coming from small-but-not-null pointers. This patch enhances the new diagnostic to check for any small or tagged pointer, rather than just 0.
Attachment #8420373 - Flags: review?(sphink)
Shell linking as part of the browser build appears to not find our new symbol. Will debug out-of-line:
https://hg.mozilla.org/integration/mozilla-inbound/rev/824cb0827493
Comment on attachment 8420373 [details] [diff] [review]
enhance_null_marking_diagnostic-v0.diff

Review of attachment 8420373 [details] [diff] [review]:
-----------------------------------------------------------------

r+ from IRC; merged and landed with previous patch.
Attachment #8420373 - Flags: review?(sphink) → review+
So, this patch has apparently caused a Talos regression (see bug 1009260). Any idea what can have happened? PGO fluke?
Flags: needinfo?(terrence)
No, I added an extra branch to the central Mark function for crash-diagnostics builds. Some regression was expected, but since it is crash-diagnostics only, it will not show up once we move out of nightly.
Flags: needinfo?(terrence)
Warnings fix for opt, non-crash-diagnostics builds:
https://hg.mozilla.org/integration/mozilla-inbound/rev/40bc628f60c4
Current rank for this signature is #61 for Firefox 31. I think we can drop it as a topcrash.
Keywords: topcrash-win
The leave-open keyword is there and there is no activity for 6 months.
:jonco, maybe it's time to close this bug?
Flags: needinfo?(jcoppeard)
Volume has dropped to nearly nothing, but I think this is just a signature shift.  We already have several bugs on file for these bad-pointer-during marking bugs so we can close this though.
Status: NEW → RESOLVED
Closed: 11 months ago
Flags: needinfo?(jcoppeard)
Resolution: --- → INCOMPLETE
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.