Open Bug 1541905 Opened 6 months ago Updated 21 days ago

Crash in [@ js::gc::StoreBuffer::MonoTypeBuffer<T>::trace]

Categories

(Core :: JavaScript Engine, defect, critical)

Unspecified
Android
defect
Not set
critical

Tracking

()

Tracking Status
firefox-esr60 --- wontfix
firefox66 --- wontfix
firefox67 - wontfix
firefox68 --- wontfix
firefox69 --- wontfix

People

(Reporter: marcia, Unassigned, NeedInfo)

Details

(Keywords: crash, regression)

Crash Data

This bug is for crash report bp-495d13ca-cb4b-4b08-a93a-20fc60190401.

Seen while looking at Fennec release crash stats: https://bit.ly/2ONwRXP. #25 currently on 66.0.2 release and no bug was associated with the crash.

Some correlations:

  • (97.29% in signature vs 05.49% overall) CPU Info = ARMv7 ARM part(0x4100c070) features: swp,half,thumb,fastmult,vfpv2,edsp,thumbee,neon,vfpv3,tls,vfpv4,idiva,idivt
  • (95.66% in signature vs 03.13% overall) adapter_device_id = PowerVR SGX 544MP [99.44% vs 49.80% if adapter_vendor_id = Imagination Technologies]
Top 10 frames of crashing thread:

0 libxul.so js::gc::StoreBuffer::MonoTypeBuffer<js::gc::StoreBuffer::ValueEdge>::trace js/public/Value.h:541
1 libxul.so js::Nursery::collect js/src/gc/StoreBuffer.h:476
2 libxul.so js::gc::GCRuntime::minorGC js/src/gc/GC.cpp:7787
3 libxul.so mozilla::CycleCollectedJSContext::IsIdleGCTaskNeeded xpcom/base/CycleCollectedJSRuntime.h:241
4 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1179
5 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:482
6 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:88
7 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:315
8 libxul.so nsBaseAppShell::Run widget/nsBaseAppShell.cpp:137
9 libxul.so nsAppStartup::Run toolkit/components/startup/nsAppStartup.cpp:271

Type: task → defect

Tracking for 67 as the volume is worrying.

Steven, can you help find someone to take a look?

Flags: needinfo?(sdetar)

Steve, is this something you can quickly investigate.

Flags: needinfo?(sdetar) → needinfo?(sphink)

No conclusions, just some notes.

This is a NULL Value* dereference, looking at its tag_ field (the crash address is 0x4. tag_ is at 0x5, but it'd be loading the full 32 bits and masking.) It happens while scanning the Value buffer (of type MonoTypeBuffer<ValueEdge>). That is supposed to contain a set of pointers to Values that we update when tenuring, and for some reason, on of those pointers is null here.

C++ code goes through put() at https://searchfox.org/mozilla-central/source/js/src/gc/StoreBuffer.h#94 to store these pointers. There is a 1-element cache (last_) in front of the actual buffer storage, which just happens to serve as a null-check. (Because nullptr is the same as having that cache empty, it will never pass through the null value.) Which leaves underlying buffer manipulations, the JIT, or bad values making it into that last_ cache slot. The initialization and clearing of last_ is pretty straightforward. The underlying buffer is a HashSet, which is rather particular about keeping things valid.

I'm not sure what to look for in the JIT. There is some jitted store buffer manipulation. The scariest would probably be the stuff used for nursery strings (bug 1442481), but that looks like it's not until Firefox 67. But if this is a bug with ARM JIT code for storebuffer insertion, I'm not going to be able to spot it.

Maybe someone familiar with ARM code could look at the minidump and at least see if we're tracing last_ and if not, how far through the buffer we made it? Not that there's an easy way to see that. I believe we are probably crashing in https://hg.mozilla.org/mozilla-central/annotate/34e99cf78401ab40f6217e0b445e93aa86518d86/js/src/gc/Marking.cpp#l2899 when it does deref(), specifically when it tests isGCThing(). The calling code is https://hg.mozilla.org/mozilla-central/annotate/34e99cf78401ab40f6217e0b445e93aa86518d86/js/src/gc/Marking.cpp#l2746 or https://hg.mozilla.org/mozilla-central/annotate/34e99cf78401ab40f6217e0b445e93aa86518d86/js/src/gc/Marking.cpp#l2749

sdetar, who is the right ARM JIT person here?

jonco, needinfo'ing you just in case you recognize the symptom of a null edge in the store buffer. Unlikely, but I thought I'd ask.

Flags: needinfo?(sphink) → needinfo?(jcoppeard)

Nicolas, could you look at this from the ARM64 JIT perspective?

Flags: needinfo?(nicolas.b.pierron)

(In reply to Steven DeTar [:sdetar] from comment #5)

Nicolas, could you look at this from the ARM64 JIT perspective?

From the ARM64 perspective, this bug does not exists. It only crashed 2 times:

From the JIT perspective, there is nothing we can do with a crash report which is outside the generated JIT code as the minidump will not contain anything useful to see what the generated code looks like.

The only way to gain more information from a JIT perspective is if there is any frequently reproducible website. Looking at the set of reported URL does not reveal anything actionable. The only remaining option would be to audit the code, but then this volume does not sounds high enough to justify such amount of work.

Flags: needinfo?(nicolas.b.pierron)

I see jemalloc poison values as part of the reported crash value with an offset of +0x4.
I wonder whether this issue is not a duplicate of the general LifoAlloc bucket of issue as the Store buffer is supposed to be a LifoAlloc-ated array.

Kannan, would you be able to help with reading the ARM minidump as Steve mentioned in comment 4?

Flags: needinfo?(kvijayan)

The crash volume seems oddly high on beta compared to what we're seeing in release. Too late for a fix for 68 though.

Looking at this today.

Flags: needinfo?(kvijayan)
You need to log in before you can comment on or make changes to this bug.