Closed Bug 1194430 Opened 10 years ago Closed 10 years ago

Intermittent browser_storage_dynamic_windows.js | application crashed [@ js::jit::JitcodeGlobalEntry::IonCacheEntry::sweep(JSRuntime*)]

Categories

(DevTools :: Performance Tools (Profiler/Timeline), defect)

defect
Not set
normal

Tracking

(firefox41 unaffected, firefox42 fixed, firefox43 fixed, firefox-esr38 unaffected)

RESOLVED FIXED
Firefox 43
Tracking Status
firefox41 --- unaffected
firefox42 --- fixed
firefox43 --- fixed
firefox-esr38 --- unaffected

People

(Reporter: RyanVM, Assigned: shu)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(1 file)

Jordan, these are showing up in lots of places in the devtools tests lately. You said you were looking into them?
Flags: needinfo?(jsantell)
Pinging Shu and Kannan for the JIT crashes
Flags: needinfo?(kvijayan)
Component: Developer Tools → Developer Tools: Performance Tools (Profiler/Timeline)
Flags: needinfo?(jsantell) → needinfo?(shu)
I pushed a try build with printfs, let's see if anything fails: https://treeherder.mozilla.org/#/jobs?repo=try&revision=86350c9d940e
Flags: needinfo?(shu)
So I have confirmed that the segfault at nullptr is due to not being able to find the rejoin entry of an IonCacheEntry. My first hunch was that due to OSI, we could somehow end up with a not-marked mainline IonScript but a marked IC. I pushed to try with an assertion that at the time of adding an IonCacheEntry to to the global table, those IC entries' rejoin entries exist. That assertion never tripped, and some of the tests still crashed during sweep. That is to say, I haven't gotten very far. I still don't understand how an IC JitCode could be marked but its rejoin entry wasn't. I also pushed to try with a patch that removes IonCachEntries from the table if their rejoin entries can't be found: https://treeherder.mozilla.org/#/jobs?repo=try&revision=de57859e095c Guess I'll wait to see if anything crashes above.
(In reply to Shu-yu Guo [:shu] from comment #8) > So I have confirmed that the segfault at nullptr is due to not being able to > find the rejoin entry of an IonCacheEntry. > > My first hunch was that due to OSI, we could somehow end up with a > not-marked mainline IonScript but a marked IC. I pushed to try with an > assertion that at the time of adding an IonCacheEntry to to the global > table, those IC entries' rejoin entries exist. That assertion never tripped, > and some of the tests still crashed during sweep. > > That is to say, I haven't gotten very far. I still don't understand how an > IC JitCode could be marked but its rejoin entry wasn't. > > I also pushed to try with a patch that removes IonCachEntries from the table > if their rejoin entries can't be found: > https://treeherder.mozilla.org/#/jobs?repo=try&revision=de57859e095c > > Guess I'll wait to see if anything crashes above. I wonder if a blind fix for this might be to keep a refcount on IonEntries, counting the number of IonCacheEntries that refer to them. A non-zero refcount implicitly keeps the IonEntry marked and live.
Flags: needinfo?(kvijayan)
I'm making some progress, but still haven't found the root cause. I have confirmed that when sweeping the IonCacheEntry with no rejoin entry, SPS is *off*, and IonCacheEntry's JitCode was not marked by the table itself. This means that the IonCacheEntry's JitCode was marked by something else, while the rejoin entry's JitCode was never marked. With sfink's help I've pushed another diagnostic patch to try that should dump the entire heap when the JitcodeGlobalTable detects this situation. Hopefully that'll tell us what's marking the stub JitCode.
:( One failure's heapdump.txt didn't upload. The one that did upload doesn't even show the JitCode pointer in question in the dump. It doesn't seem like heap corruption, so I don't know why a supposedly marked TenuredCell isn't showing up in the heap dump.
I was able to get a failure where the IC stub JitCode in question appears in the heap dump [1]. It only appears a single time and isn't reachable from anything else. I suppose that makes sense, because the only time it's ever marked is when it's pushed on stack during GC. Still doesn't tell us where its mainline JitCode went though. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=ae82ca8c287f
Some background, this didn't come up until bug 1172180 landed (the real performance actor) -- and looks like it's only on e10s. Some of these tests that cause it it seems are currently disabled, but can be reenabled on try runs via reverting https://hg.mozilla.org/integration/mozilla-inbound/rev/9c35a8c75f64
I've debugged this for 2 day now, and the only sure conclusion I've been able to drawn is that this assertion happens during major GCs where an IC stub jitcode is marked but its rejoin jitcode is not marked. Without being able to reproduce locally and use rr, I have no tractable way to find out why an IC stub's mainline jitcode isn't getting marked. Always marking the table during major GCs would keep the rejoin code alive, so I'm just going to do this. The try push looked pretty green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4ee7210f97cb
I think the test causing this is still disabled in your push
Comment on attachment 8648871 [details] [diff] [review] Always mark the global jitcode table during major GCs. Review of attachment 8648871 [details] [diff] [review]: ----------------------------------------------------------------- sfink pointed out there's such a thing as "delayed marking", which could result in a situation where an IC stub JitCode is marked but its mainline JitCode isn't. I tried to confirm but couldn't get the crash again across ~100 retriggers. :/ I'm just gonna go with this patch.
Attachment #8648871 - Flags: review?(kvijayan)
Comment on attachment 8648871 [details] [diff] [review] Always mark the global jitcode table during major GCs. Review of attachment 8648871 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/JitcodeMap.cpp @@ +812,5 @@ > AutoSuppressProfilerSampling suppressSampling(trc->runtime()); > uint32_t gen = trc->runtime()->profilerSampleBufferGen(); > uint32_t lapCount = trc->runtime()->profilerSampleBufferLapCount(); > > + if (!trc->runtime()->spsProfiler.enabled()) A small comment for why this is done would be good.
Attachment #8648871 - Flags: review?(kvijayan) → review+
Nice job tracking this down btw. Crazy finnicky crash.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 43
Shu, does this issue affect older branches as well? If so, please nominate for approval as-necessary :)
Assignee: nobody → shu
Flags: needinfo?(shu)
This affects Aurora but not Beta.
Flags: needinfo?(shu)
Comment on attachment 8648871 [details] [diff] [review] Always mark the global jitcode table during major GCs. Approval Request Comment [Feature/regressing bug #]: 1182730 [User impact if declined]: crashes when profiling [Describe test coverage new/current, TreeHerder]: on central on TH [Risks and why]: low, bugfix only [String/UUID change made/needed]: none
Attachment #8648871 - Flags: approval-mozilla-aurora?
Comment on attachment 8648871 [details] [diff] [review] Always mark the global jitcode table during major GCs. Fix an intermittent, taking it.
Attachment #8648871 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Product: Firefox → DevTools
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: