Intermittent browser_storage_dynamic_windows.js | application crashed [@ js::jit::JitcodeGlobalEntry::IonCacheEntry::sweep(JSRuntime*)]

RESOLVED FIXED in Firefox 42

Status

()

Firefox
Developer Tools: Performance Tools (Profiler/Timeline)
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: RyanVM, Assigned: shu)

Tracking

({crash, intermittent-failure})

unspecified
Firefox 43
crash, intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox41 unaffected, firefox42 fixed, firefox43 fixed, firefox-esr38 unaffected)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
Jordan, these are showing up in lots of places in the devtools tests lately. You said you were looking into them?
Flags: needinfo?(jsantell)
Comment hidden (Treeherder Robot)
Pinging Shu and Kannan for the JIT crashes
Flags: needinfo?(kvijayan)
Component: Developer Tools → Developer Tools: Performance Tools (Profiler/Timeline)
Flags: needinfo?(jsantell) → needinfo?(shu)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 4

3 years ago
I pushed a try build with printfs, let's see if anything fails: https://treeherder.mozilla.org/#/jobs?repo=try&revision=86350c9d940e
Flags: needinfo?(shu)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 8

3 years ago
So I have confirmed that the segfault at nullptr is due to not being able to find the rejoin entry of an IonCacheEntry.

My first hunch was that due to OSI, we could somehow end up with a not-marked mainline IonScript but a marked IC. I pushed to try with an assertion that at the time of adding an IonCacheEntry to to the global table, those IC entries' rejoin entries exist. That assertion never tripped, and some of the tests still crashed during sweep.

That is to say, I haven't gotten very far. I still don't understand how an IC JitCode could be marked but its rejoin entry wasn't.

I also pushed to try with a patch that removes IonCachEntries from the table if their rejoin entries can't be found: https://treeherder.mozilla.org/#/jobs?repo=try&revision=de57859e095c

Guess I'll wait to see if anything crashes above.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Shu-yu Guo [:shu] from comment #8)
> So I have confirmed that the segfault at nullptr is due to not being able to
> find the rejoin entry of an IonCacheEntry.
> 
> My first hunch was that due to OSI, we could somehow end up with a
> not-marked mainline IonScript but a marked IC. I pushed to try with an
> assertion that at the time of adding an IonCacheEntry to to the global
> table, those IC entries' rejoin entries exist. That assertion never tripped,
> and some of the tests still crashed during sweep.
> 
> That is to say, I haven't gotten very far. I still don't understand how an
> IC JitCode could be marked but its rejoin entry wasn't.
> 
> I also pushed to try with a patch that removes IonCachEntries from the table
> if their rejoin entries can't be found:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=de57859e095c
> 
> Guess I'll wait to see if anything crashes above.

I wonder if a blind fix for this might be to keep a refcount on IonEntries, counting the number of IonCacheEntries that refer to them.  A non-zero refcount implicitly keeps the IonEntry marked and live.
Flags: needinfo?(kvijayan)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 34

3 years ago
I'm making some progress, but still haven't found the root cause.

I have confirmed that when sweeping the IonCacheEntry with no rejoin entry, SPS is *off*, and IonCacheEntry's JitCode was not marked by the table itself.

This means that the IonCacheEntry's JitCode was marked by something else, while the rejoin entry's JitCode was never marked. With sfink's help I've pushed another diagnostic patch to try that should dump the entire heap when the JitcodeGlobalTable detects this situation.

Hopefully that'll tell us what's marking the stub JitCode.
(Assignee)

Comment 35

3 years ago
:( One failure's heapdump.txt didn't upload. The one that did upload doesn't even show the JitCode pointer in question in the dump. It doesn't seem like heap corruption, so I don't know why a supposedly marked TenuredCell isn't showing up in the heap dump.
Comment hidden (Treeherder Robot)
(Assignee)

Comment 37

3 years ago
I was able to get a failure where the IC stub JitCode in question appears in the heap dump [1]. It only appears a single time and isn't reachable from anything else. I suppose that makes sense, because the only time it's ever marked is when it's pushed on stack during GC. Still doesn't tell us where its mainline JitCode went though.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae82ca8c287f
Some background, this didn't come up until bug 1172180 landed (the real performance actor) -- and looks like it's only on e10s. Some of these tests that cause it it seems are currently disabled, but can be reenabled on try runs via reverting https://hg.mozilla.org/integration/mozilla-inbound/rev/9c35a8c75f64
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 42

3 years ago
Created attachment 8648871 [details] [diff] [review]
Always mark the global jitcode table during major GCs.

I've debugged this for 2 day now, and the only sure conclusion I've been able
to drawn is that this assertion happens during major GCs where an IC stub
jitcode is marked but its rejoin jitcode is not marked.

Without being able to reproduce locally and use rr, I have no tractable way to
find out why an IC stub's mainline jitcode isn't getting marked.

Always marking the table during major GCs would keep the rejoin code alive, so
I'm just going to do this.

The try push looked pretty green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ee7210f97cb
I think the test causing this is still disabled in your push
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 48

3 years ago
Comment on attachment 8648871 [details] [diff] [review]
Always mark the global jitcode table during major GCs.

Review of attachment 8648871 [details] [diff] [review]:
-----------------------------------------------------------------

sfink pointed out there's such a thing as "delayed marking", which could result in a situation where an IC stub JitCode is marked but its mainline JitCode isn't.

I tried to confirm but couldn't get the crash again across ~100 retriggers. :/

I'm just gonna go with this patch.
Attachment #8648871 - Flags: review?(kvijayan)
Comment hidden (Treeherder Robot)
Comment on attachment 8648871 [details] [diff] [review]
Always mark the global jitcode table during major GCs.

Review of attachment 8648871 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/JitcodeMap.cpp
@@ +812,5 @@
>      AutoSuppressProfilerSampling suppressSampling(trc->runtime());
>      uint32_t gen = trc->runtime()->profilerSampleBufferGen();
>      uint32_t lapCount = trc->runtime()->profilerSampleBufferLapCount();
>  
> +    if (!trc->runtime()->spsProfiler.enabled())

A small comment for why this is done would be good.
Attachment #8648871 - Flags: review?(kvijayan) → review+
Nice job tracking this down btw.  Crazy finnicky crash.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
https://hg.mozilla.org/mozilla-central/rev/938701d3cf0c
Status: NEW → RESOLVED
Last Resolved: 3 years ago
status-firefox43: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 43
Comment hidden (Treeherder Robot)
(Reporter)

Comment 60

3 years ago
Shu, does this issue affect older branches as well? If so, please nominate for approval as-necessary :)
Assignee: nobody → shu
status-firefox41: --- → ?
status-firefox42: --- → ?
Flags: needinfo?(shu)
(Assignee)

Comment 61

3 years ago
This affects Aurora but not Beta.
Flags: needinfo?(shu)
(Assignee)

Comment 62

3 years ago
Comment on attachment 8648871 [details] [diff] [review]
Always mark the global jitcode table during major GCs.

Approval Request Comment
[Feature/regressing bug #]: 1182730
[User impact if declined]: crashes when profiling
[Describe test coverage new/current, TreeHerder]: on central on TH
[Risks and why]: low, bugfix only
[String/UUID change made/needed]: none
Attachment #8648871 - Flags: approval-mozilla-aurora?
(Reporter)

Updated

3 years ago
status-firefox41: ? → unaffected
status-firefox42: ? → affected
status-firefox-esr38: --- → unaffected
Comment on attachment 8648871 [details] [diff] [review]
Always mark the global jitcode table during major GCs.

Fix an intermittent, taking it.
Attachment #8648871 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

3 years ago
Duplicate of this bug: 1191990
Depends on: 1203791
You need to log in before you can comment on or make changes to this bug.