Closed Bug 1011474 Opened 5 years ago Closed 5 years ago

crash in js::GCMarker::drainMarkStack(js::SliceBudget&)

Categories

(Core :: JavaScript: GC, defect, critical)

All
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox31 --- unaffected
firefox32 + verified
firefox33 --- verified
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed
fennec 32+ ---

People

(Reporter: gcp, Assigned: terrence)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash-android-armv7)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-f2596b8a-6d49-4906-b0fd-628c42140516.
=============================================================

This signature exists for a while on desktop, but it's now a (as far as I can see, new) topcrasher on Android. Was reported by some friends that use Fennec as well, though without specific STR.
Component: General → JavaScript: GC
Product: Firefox for Android → Core
This is a top crash on Firefox for Android Nightly.
tracking-fennec: --- → ?
Flags: needinfo?(cpeterson)
Jon: do you know of any recent changes to incremental GC that might particularly affect Android?
Flags: needinfo?(cpeterson) → needinfo?(jcoppeard)
The crashes spiked with the build on 20140505 so I looked for things that landed before then.  But TBH nothing jumps out.  Most of out changes have been generational or exact rooting, I don't think we've done anything recently that affects the original conservative scanner / incremental GC.

Looking at the crash addresses is interesting because most of them are inside what would be the mark bitmap for a chunk:

~60% 0x5a5fea08 - this looks like trying to mark the address 0x5a5a5a00
~10% 0xfc0a0    - trying to mark a null pointer

0x5a5a5a5a is the pattern jemalloc uses to poison freed memory so it looks like use after free or incorrect tracing code in gecko somewhere.
Flags: needinfo?(jcoppeard)
tracking-fennec: ? → 32+
Flags: needinfo?(nihsanullah)
Keywords: regression
Terrence please find an owner for this
Flags: needinfo?(nihsanullah) → needinfo?(terrence)
There appears to be nothing further we can do here without STR.
Flags: needinfo?(terrence)
Keywords: steps-wanted
What can we do to make it more reproducible? Are there low-overhead logging mechanisms, or intermediate checks for poison values in the tracer that would move the crash closer to the actual bad code?
Blocks: GC.stability
Attempts to mark a small or invalid pointer directly will already assert when inserted into the mark stack in nightly builds. This indicates that the corruption is internal to the object and therefore likely mis-usage by the JSAPI consumer.

We can potentially also check for such corruption at insertion time, however, there are some extremely subtle issues, unfortunately, with blindly checking the internals of an object. I'm working on isolating a safe subset under bug 994253, but this still requires a substantial amount of work.
(In reply to Terrence Cole [:terrence] from comment #5)
> There appears to be nothing further we can do here without STR.

This crash has spiked in 32, can you have a look at what's landed in that time frame?
Flags: needinfo?(terrence)
loading http://www.bmwusa.com reliably crashes nightly with this or the scanshape signature.
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #8)
> (In reply to Terrence Cole [:terrence] from comment #5)
> > There appears to be nothing further we can do here without STR.
> 
> This crash has spiked in 32, can you have a look at what's landed in that
> time frame?

About 200 patches land each day. Even if someone could pay attention for long enough to read and understand the ~6000 checkins, the actual mistake is likely not something that's going to be reflected there.

(In reply to Kevin Brosnan [:kbrosnan] from comment #9)
> loading http://www.bmwusa.com reliably crashes nightly with this or the
> scanshape signature.

This, on the other hand, is quite actionable. I'll try to take a look as soon as I get back home, but I'm traveling until late tonight.
Flags: needinfo?(terrence)
I haven't been able to reproduce this on Android with the BMW site and latest nightly.
(In reply to Kevin Brosnan [:kbrosnan] from comment #9)
> loading http://www.bmwusa.com reliably crashes nightly with this or the
> scanshape signature.

!!!!!!

Kevin, is this still the case for you?

If so, **PLEASE** grab the recording HTTP proxy of your choice (http://mitmproxy.org/ is one -- it's a pain to get installed, though, and there are other choices) and record the site exactly as seen by the device.

If this holds up, it's almost guaranteed to turn into a bugfix, possibly a substantial stability win, in a component with lots of unreproducible crashes. Therefore shoring up the STR here is a huge deal for us.
Flags: needinfo?(kbrosnan)
Looks like they are detecting Firefox for Android as a mobile browser now. Previously I was crashing on their desktop site. Using the request desktop site or the Phony extension set to "Desktop Firefox" still crashes.
Flags: needinfo?(kbrosnan)
(In reply to Kevin Brosnan [:kbrosnan] from comment #13)
Thanks, I've reproduced this now.
Duplicate of this bug: 1021454
Bisection shows that this started with changeset accdf191ac4e, part of the fix for bug 969012.
Great work bisecting this, Jon!

The patch in question actually had quite a bit of trouble landing on ARM because it was running into a codegen bug. I rewrote it completely, but it seems maybe that just hid the underlying problem. For now let's just suppress all of the new codepaths on ARM so we can debug at leisure.

The attached patch passes all jsapi and jit tests in the ARM simulator and it should be safe and easy to uplift as needed, assuming, of course, that it actually works around the issue successfully.

I've put up a try build so that someone that can reproduce can test this out easily: https://tbpl.mozilla.org/?tree=Try&rev=8a4b6dc55317
Assignee: nobody → terrence
Status: NEW → ASSIGNED
Attachment #8437121 - Flags: review?(sphink)
Can confirm that the patch stops the crashing on bmwusa.com and loading several of the less consistent crashing urls obtained from crash stats.
Attachment #8437121 - Flags: review?(sphink) → review+
We still need to find the root cause.
Keywords: leave-open
ok, so this is almost certainly *not* the patch that we want, but after a bit of digging, the issue seems to be that on x86 and x64, CallTempReg0 is volatile, but on arm, it is not.  we probably just want to choose a new register that is volatile on all platforms.  Bonus points if you choose something that is either r0 or r1 on arm, since those are arg0 and arg1, and r0 is also the return value (it'll save a bunch of moves, if nothing else.)
Attachment #8437527 - Flags: review-
Attachment #8437527 - Flags: feedback?(terrence)
Attached patch bug1011474-fixSplinter Review
As discussed on IRC, use one of the volatile registers that we save rather than trashing CallTemp1.

I verified that this does fix the crash.
Attachment #8437692 - Flags: review?(mrosenberg)
We should try to uplift this crash fix to Aurora 32 because it will ship with B2G 2.0. AFAIK, B2G 1.4 is shipping Gecko 30, which is unaffected.
Comment on attachment 8437692 [details] [diff] [review]
bug1011474-fix

Review of attachment 8437692 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/CodeGenerator.cpp
@@ +5452,5 @@
>      masm.PushRegsInMask(regs);
>  
> +    const Register regTemp = regs.takeGeneral();
> +    const Register regRuntime = regTemp;
> +    regs.add(regTemp);

Sadly, I don't think there's a way of getting around this dance.
Attachment #8437692 - Flags: review?(mrosenberg) → review+
Comment on attachment 8437527 [details] [diff] [review]
ClobberinTime-r0.patch

Review of attachment 8437527 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like Jon beat me to it. :-)
Attachment #8437527 - Flags: feedback?(terrence)
Comment on attachment 8437692 [details] [diff] [review]
bug1011474-fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 969012
User impact if declined: Increased crashes in b2g and fennec.
Testing completed (on m-c, etc.): m-i for a day, now on m-c.
Risk to taking this patch (and alternatives if risky): low.
String or IDL/UUID changes made by this patch: none.
Attachment #8437692 - Flags: approval-mozilla-aurora?
I got this crash while browsing the new Google Maps:

https://crash-stats.mozilla.com/report/index/0165bb83-0f79-4ae4-b768-c3b2a2140612
(In reply to Markus Popp from comment #30)
> I got this crash while browsing the new Google Maps:
> 
> https://crash-stats.mozilla.com/report/index/0165bb83-0f79-4ae4-b768-
> c3b2a2140612

Thanks for the crash report! The fix here is only relevant for ARM, so it looks like we have a similar crash affecting linux amd64 that we need to look into too.
(In reply to Terrence Cole [:terrence] from comment #31)
> Thanks for the crash report! The fix here is only relevant for ARM, so it
> looks like we have a similar crash affecting linux amd64 that we need to
> look into too.

You're welcome!

Firefox crashes quite frequently when accessing the new Google Maps. While I think these other crashes are probably unrelated to this one, I reported them as Bug 1022987. Unlike this crash (1011474) which occurred in the middle of browsing the new Google Maps, those of Bug 1022987 happen when first accessing Google Maps.
(In reply to Markus Popp from comment #32)
> (In reply to Terrence Cole [:terrence] from comment #31)
> > Thanks for the crash report! The fix here is only relevant for ARM, so it
> > looks like we have a similar crash affecting linux amd64 that we need to
> > look into too.
> 
> You're welcome!
> 
> Firefox crashes quite frequently when accessing the new Google Maps. While I
> think these other crashes are probably unrelated to this one, I reported
> them as Bug 1022987. Unlike this crash (1011474) which occurred in the
> middle of browsing the new Google Maps, those of Bug 1022987 happen when
> first accessing Google Maps.

You are correct, the crash in comment 30 is definitely different from the ones in bug 1022987. I've moved it to Core::Graphics, which should hopefully get some eyes on it. At a guess, the linux radeon driver is probably at fault there, but I'm not at all familiar with our webgl code though.
Comment on attachment 8437692 [details] [diff] [review]
bug1011474-fix

Aurora approval granted. Please verify that this fixes the crash on Aurora as well.
Attachment #8437692 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/7df8944f610b

Resolving per IRC discussion with Jon. Please file a new bug for any remaining crashes as it makes tracking much easier than piling multiple different issues into one bug :)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
verified by checking crash stats, no more Android crashes on 33a1 or 32a2.
You need to log in before you can comment on or make changes to this bug.