Closed Bug 976536 Opened 6 years ago Closed 6 years ago

Firefox 30.0a1 Crash Report [@ js::jit::InlineFrameIteratorMaybeGC<int>::findNextFrame()

Categories

(Core :: JavaScript Engine: JIT, defect)

30 Branch
x86_64
Windows 7
defect
Not set

Tracking

()

VERIFIED FIXED
mozilla31
Tracking Status
firefox28 --- unaffected
firefox29 + verified
firefox30 + verified
firefox31 + verified

People

(Reporter: codacodercodacoder, Assigned: jandem)

References

Details

(Keywords: crash, topcrash-win)

Crash Data

Attachments

(3 files)

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0 (Beta/Release)
Build ID: 20140224030203

Steps to reproduce:

https://crash-stats.mozilla.com/report/index/15e84a3a-5d34-44c9-a996-8cff82140225
Crash Signature: [@ js::jit::InlineFrameIteratorMaybeGC<int>::findNextFrame() ]
Looks like this crash signature is showing up in Firefox 29 and 30. 

https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=14&signature=js%3A%3Ajit%3A%3AInlineFrameIteratorMaybeGC%3Cint%3E%3A%3AfindNextFrame%28%29#tab-reports

KaiRo, how would you interpret the crashstats for this?
Flags: needinfo?(kairo)
Component: Untriaged → JavaScript Engine: JIT
Product: Firefox → Core
It looks like the signature has been around for a while but the new crashes seem to be all at address 0x10. Explosiveness tells me this started on 30 at 2014-02-15 and on 29 at 2014-02-19 (2 crashes) or -20 (22 crashes).

Looking for something that landed on the 14th on trunk and on 18th/19th on Aurora might already help us a lot to find suspects.

The 0x10 address could help devs to find what's going on.
Flags: needinfo?(kairo)
As the original reporter perhaps I ought to share the following:

My STR was intentionally terse - my steps at the time were to investigate another bug which i will now explain - and PLEASE note, it may be ENTIRELY UNRELATED... you were warned! :)  

Here's what I was doing...

I was testing for bug 972447 in Nightly, turned away to my laptop to type my last response to Mihai and when I looked back, the crash was on the screen.  If you read 972447 you'll see that involved running my app (large ASP.NET thing with umpteen scripts) and watching the console in devTools.

I don't routinely watch the console and have never seen this bug before or since.  And nightly is my 24/7 dev platform... so whatever it is, it's very rare/specific.
The crash signature is not that rare and probably not that specific, it shows up as #9 on Aurora and #22 on Nightly (which has a lot of noise from regressions recently) with firmly over 100 crashes per week on both channels.
Keywords: crash
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #4)
> The crash signature is not that rare and probably not that specific, it
> shows up as #9 on Aurora and #22 on Nightly (which has a lot of noise from
> regressions recently) with firmly over 100 crashes per week on both channels.

Then I'd be very interested when the culprit shows up.  During my 70+ hour week using Nightly, I've seen it once.  Do you think it's end user code generated, or FF/DevTools generated?
I have no idea what triggers it for users, I only can read stats.

In any case, we should find the checkins that match the regression ranges and have devs take a look, it may be detectable by inspecting the stacks and code when knowing what the checkins actually did.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #4)
> The crash signature is not that rare and probably not that specific, it
> shows up as #9 on Aurora and #22 on Nightly (which has a lot of noise from
> regressions recently) with firmly over 100 crashes per week on both channels.

This is now at #2 on Aurora.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Jason, who should own this?
Flags: needinfo?(jorendorff)
script_ is null here:

http://dxr.mozilla.org/mozilla-central/source/js/src/jit/IonFrames.cpp#1612

        script_ = callee_->existingScript();

        pc_ = script_->offsetToPC(si_.pcOffset());
(In reply to David Major [:dmajor] from comment #9)
> script_ is null here:

Yup. I'll try to repro this.
Flags: needinfo?(jorendorff) → needinfo?(jdemooij)
I can reproduce this on OS X, 64-bit opt build, inbound rev 8ab0a696d2ec. This is with a patch that changes usesBeforeCompile_ in IonOptimizationLevels.cpp to 100, but I'm not sure if that's necessary to reproduce this.

The crash happens when I load this page, you may have to try it a few times or in multiple tabs:

http://activities.aliexpress.com/superdeals.php#from=head 

It looks like we have a function with a LazyScript. Then we call JSFunction::existingScript(), this function tries to unlazify the function, but the LazyScript has a nullptr JSScript so we crash. It looks like the JSFunction is a lambda clone.

Till, I think we relazify the JSFunction (bug 886193) and that's where we reset the LazyScript's script...
Flags: needinfo?(jdemooij) → needinfo?(till)
Keywords: reproducible
This has escalated 19 positions to be our #10 topcrash in Firefox 31 over the last week. It now accounts for 0.86% of our crashes on Firefox 31.
This bug is also ranking high in Beta 29 while is did not exist at all in Beta 28.

What's up with Till that he didn't reply to ni? in more than 3 weeks?
till was on holiday - I think he's still catching up to his backlog at the moment.
While I can't reproduce the issue (neither with rev 8ab0a696d2ec, nor current Nightly), I'm pretty sure I know what's going on: 1) the function gets inlined by ionmonkey, 2) the function's JIT code gets thrown away by a GC, 3) the function gets relazified, 4) the function that inlined this, whose jitted code survived the GC, gets called, 5) the InlineFrameIterator tries to access the inlined function's script, 6) profit.

I don't think there's a way to tell whether a function is inlined somewhere *right now*, so I added a flag that gets set the first time inlining happens, and prevents relazification from then on. Since so few functions ever get ion-compiled, that shouldn't meaningfully affect the effectiveness of relazification.
Attachment #8398319 - Flags: review?(jdemooij)
Assignee: nobody → till
Status: NEW → ASSIGNED
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #14)
> till was on holiday - I think he's still catching up to his backlog at the
> moment.

Correct - I went on vacation a couple of days before the ni? happened, and then returned to a JS work week. Sorry for the delay on a top crasher!
Flags: needinfo?(till)
I'll try this patch and see if it fixes the problem (assuming I can still reproduce it). I'm not yet convinced comment 15 is correct. I'll be traveling tomorrow but should have time to do this after that.
(In reply to Till Schneidereit [:till] from comment #15)
> While I can't reproduce the issue (neither with rev 8ab0a696d2ec, nor
> current Nightly), I'm pretty sure I know what's going on: 1) the function
> gets inlined by ionmonkey, 2) the function's JIT code gets thrown away by a
> GC, 3) the function gets relazified, 4) the function that inlined this,
> whose jitted code survived the GC, gets called, 5) the InlineFrameIterator
> tries to access the inlined function's script, 6) profit.

I don't think (2) and (4) can happen at the same time (discarding Ion/JIT code on GC for some functions but not all of them). The Ion bailout logic relies on the fact that there's a BaselineScript for all inlined scripts (because that's where we have to resume!), and scripts with JIT code are never relazified.

In other words, if (5) would bailout we'd also be in trouble.

Trying to repro this again now.
Comment on attachment 8398319 [details] [diff] [review]
Don't relazify inlined functions.

Review of attachment 8398319 [details] [diff] [review]:
-----------------------------------------------------------------

Unfortunately I can't reproduce this anymore. I'll r+ this patch because it's pretty simple, shouldn't hurt and it's possible it will fix the crashes. Crash-stats should tell us soon enough.

I'll file a separate bug to add some new asserts as I think there's still another underlying bug.
Attachment #8398319 - Flags: review?(jdemooij) → review+
Thanks Jan. 

Till, Can we have a checkin + uplift of this patch? It would be nice to have it for the beta 6 (next monday). Thanks
Flags: needinfo?(till)
(In reply to Sylvestre Ledru [:sylvestre] from comment #20)
> Till, Can we have a checkin + uplift of this patch? It would be nice to have
> it for the beta 6 (next monday). Thanks

Sorry for dragging my feet on this. I just confirmed that the patch still applies and works. Setting checkin-needed because closed tree. Will request uplift and ping people about it as soon as it has landed.
Flags: needinfo?(till)
Keywords: checkin-needed
Comment on attachment 8398319 [details] [diff] [review]
Don't relazify inlined functions.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 886193
User impact if declined: Potentially much higher crash rate
Testing completed (on m-c, etc.): landed on inbound, thorough manual testing
Risk to taking this patch (and alternatives if risky): very low, prevents an optimization under circumstances that potentially make it invalid
String or IDL/UUID changes made by this patch: none
Attachment #8398319 - Flags: approval-mozilla-aurora?
I will approve it right after it landed in m-c.
Thanks btw!
[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 886193
User impact if declined: Potentially much higher crash rate
Testing completed (on m-c, etc.): landed on inbound, thorough manual testing
Risk to taking this patch (and alternatives if risky): very low, prevents an optimization under circumstances that potentially make it invalid
String or IDL/UUID changes made by this patch: none
Attachment #8402535 - Flags: review+
Attachment #8402535 - Flags: approval-mozilla-beta?
https://hg.mozilla.org/mozilla-central/rev/eaa499b12ffd
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Attachment #8402535 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Attachment #8398319 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Paul, can you please confirm this is fixed in tomorrow's Nightly?
Keywords: verifyme
QA Contact: paul.silaghi
Depends on: 993619
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #27)
> Paul, can you please confirm this is fixed in tomorrow's Nightly?
Unfortunately, I couldn't reproduce the initial issue using the indications in comment 3 and the link in comment 11, on nightlies from 2014-02-25, 2014-03-03, 2014-04-04 Win 7 x64, OS X 10.9.
Probably will be best to check via crashstats in couple of days later.
Dropping the reproducible keyword based on comment 29 and comment 19. We'll check crashstats in a few days to see if the patch has moved the needle.
Keywords: reproducible
Hrm, https://crash-stats.mozilla.com/topcrasher/products/Firefox/versions/29.0b7 reports this still as #3 (and the connected bug 988961 signature as #5) even though this version should contain the fix.

Till, does that mean it didn't work correctly?
Flags: needinfo?(till)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #31)
> Till, does that mean it didn't work correctly?

Yes, it does :(

Jandem was sceptical of this being the solution (see comment 18), and it seems he was right. I'll discuss our options with him on IRC.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Till, the go to build for beta 9 is tomorrow (Thursday around 1pm PDT). I don't want to put too much pressure but that is one of the top crasher of 29 and beta 9 is the best fit to test a potential fix.
According to crash-stats, a lot of these crashes are on www.farsnews.com on all branches. Can we do some testing with that website on various platforms? I'm also trying this myself but my efforts are unsuccessful so far.
Flags: needinfo?(anthony.s.hughes)
Not much of a help, but I got one crash, by just leaving www.farsnews.com open in Firefox for about 15 minutes. 31.0a1 (2014-04-15), Win 7 x64
https://crash-stats.mozilla.com/report/index/887a25ca-65bd-46ca-895c-7b55a2140416
I'll investigate more.
Also, comments talk a lot about Yahoo mail.

And most of the crashes are actually on Facebook (but that is up front on many of our crashes), with farsnews being second after that.
Flags: needinfo?(anthony.s.hughes)
(In reply to Sylvestre Ledru [:sylvestre] from comment #33)
> Till, the go to build for beta 9 is tomorrow (Thursday around 1pm PDT). I
> don't want to put too much pressure but that is one of the top crasher of 29
> and beta 9 is the best fit to test a potential fix.

Unfortunately, I highly doubt that we'll have a fix for this in time for the beta. Jandem and me are both trying to reproduce the crash, but haven't been successful yet. I'm currently setting up a proper Windows development environment with the hope that that will help me debug the issue.
Flags: needinfo?(till)
(In reply to Paul Silaghi, QA [:pauly] from comment #35)
> Not much of a help, but I got one crash, by just leaving www.farsnews.com
> open in Firefox for about 15 minutes

This is really useful, thanks. I can also get it to crash on Windows when I leave it open. It can take more than an hour though, but it's better than nothing. I'll try it with my own build now so that we can hopefully start logging some information.

It takes a while before it crashes so I also doubt we'll have a fix in time for beta, but at least we have some way to repro it now.
I think I know what's going on. Debugging this was pretty frustrating; it often takes hours before it crashes, probably due to GC timing and setTimeout calls etc.

The problem is that JSFunction::existingScript() assumes that, if the function has a LazyScript, that LazyScript must have a non-null JSScript. This can fail with incremental GC, here's a possible scenario:

* Assume we have 2 lambda clones F1, F2 with script S1.
* GC. Relazify both functions so that they have LazyScript L with L->script_ == nullptr.

* Unlazify F1 and F2. This sets F1->script, F2->script and L->script to S2.
* Another GC. Incremental Slice X: Relazify F1. Sets F1->script = L and L->script = nullptr.
* F2 still has script S2 with S2->lazyScript == L but L->script == nullptr.

* Ion inlines F2 (has script S2).
* When this IonScript runs, the actual callee function is F1.
* InlineFrameIterator tries to unlazify by calling existingScript but crashes because L->script == nullptr.

To fix this, I think we need to do the following:

(1) Ion shouldn't inline scripts that have lazyScript->script_ == nullptr.
(2) We shouldn't relazify functions with scripts that have been inlined.

Together these should make sure lazyScript->script stays non-null, so that existingScript() works.

I know the patch that landed for this bug does (2), but I realized it sets the flag on the *outer* script, not the inlined script. My bad, my review should have caught this.
Also, I could reproduce this in the shell with a simple shell function: preventRelazify(f) that sets a flag on f and doesn't relazify functions with that set. This is a simple way to get the same effect as the incremental GC in the browser, but in a very deterministic way.

Tomorrow I'll post patches.
That is a great news. Well done!
If you have an idea on the fix, do you think it is going to be a risky patch? Thanks.
(In reply to Jan de Mooij [:jandem] from comment #41)
> Tomorrow I'll post patches.

\o/

Thanks a ton for the work there!

We should try to still get this into 29 somehow, the planned go-to-build for the release builds is Monday (to be followed by testing those builds on the beta channel for almost a week), it would be awesome if we can get the patch(es) in before that.
(In reply to Jan de Mooij [:jandem] from comment #40)
> To fix this, I think we need to do the following:
> 
> (1) Ion shouldn't inline scripts that have lazyScript->script_ == nullptr.
> (2) We shouldn't relazify functions with scripts that have been inlined.
> 
> Together these should make sure lazyScript->script stays non-null, so that
> existingScript() works.

After thinking about this more, there's a simpler and smaller (and likely safer) fix. We should fix existingScript() to get the canonical function's script, it's guaranteed to exist (because Ion used it to inline the script in the first place, and it must have a Baseline script so it's definitely not relazified). Very small patch, testing it right now.

(In reply to Sylvestre Ledru [:sylvestre] from comment #42)
> That is a great news. Well done!
> If you have an idea on the fix, do you think it is going to be a risky
> patch? Thanks.

The patch I'm testing now should be pretty low-risk. Personally I think it's better than not fixing this bug.

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #43)
> We should try to still get this into 29 somehow, the planned go-to-build for
> the release builds is Monday (to be followed by testing those builds on the
> beta channel for almost a week), it would be awesome if we can get the
> patch(es) in before that.

I'll try to get this reviewed and landed on m-c before Monday, assuming potential reviewers are around during the weekend.
The InitFromBailout topcrash is definitely the same issue. existingScript has two callers: bailout and frame iterator, and they'll both crash if existingScript returns NULL.
Assignee: till → jdemooij
Blocks: 988961
Rename existingScript to existingScriptForInlinedFunction and change it to look at the canonical function.

I can no longer repro the crashes with this patch. I also added some code to print to stderr whenever the old code would have crashed, and after a few hours we "survived" 4 of them.

Green on Try: https://tbpl.mozilla.org/?tree=Try&rev=52455bcc1c0c
Attachment #8409291 - Flags: review?(till)
Sylvestre, OK if I backout the patch that originally landed in this bug from aurora and beta, as it didn't fix the crashes?
Flags: needinfo?(sledru)
Comment on attachment 8409291 [details] [diff] [review]
Fix existingScript

Brian also knows about lazy scripts and Ion inlining so adding him too. It'd be great if this patch made it in the next nightly, so that we can see how it affects crash stats.
Attachment #8409291 - Flags: review?(bhackett1024)
Comment on attachment 8409291 [details] [diff] [review]
Fix existingScript

Review of attachment 8409291 [details] [diff] [review]:
-----------------------------------------------------------------

This is truly great!

Also, it means that we can back out my patch, as with this solution, it's not needed for the fix, right? We can do that next week, as I don't think it's important enough to rush it in before the uplift.

I don't think we need to block this on an additional review from bhackett, so cancelling that r?, too.

::: js/src/jsfun.h
@@ +269,5 @@
>      //   use, but has extra checks, requires a cx and may trigger a GC.
>      //
> +    // - For inlined functions which may have a LazyScript but whose JSScript
> +    //   is known to exist, existingScriptForInlinedFunction() will get the
> +    //   script and delazify the function if necessary.

Huh, now that you say it existingScript is really only used for inlined scripts, yes. I didn't realize that, I guess.

@@ +297,2 @@
>              js::LazyScript *lazy = lazyScript();
> +            JSScript *script = lazy->functionNonDelazifying()->nonLazyScript();

I think it'd be good to assert functionNonDelazifying() not returning nullptr here. I.e.,
JSFunction *fun = lazy->functionNonDelazifying();
MOZ_ASSERT(fun);
JSScript *script = fun->nonLazyScript();
Ideally, functionNonDelazifying would have a "maybe" in its name, but that doesn't seem too important.
Attachment #8409291 - Flags: review?(till)
Attachment #8409291 - Flags: review?(bhackett1024)
Attachment #8409291 - Flags: review+
(In reply to Till Schneidereit [:till] from comment #49)
> I don't think we need to block this on an additional review from bhackett,
> so cancelling that r?, too.

Ah, I just saw comment 48, so we agree, nice. :)

Also, together with bug 988961, this is the no1 top-crasher for 29 by a wide margin - very, very nice to have it fixed! \o/
Pushed this directly to m-c so that it makes it into tomorrow's Nightly. It's our #1 topcrash on beta (see comment 50) and we don't have much time left there.

https://hg.mozilla.org/mozilla-central/rev/fd7c870fa960

Thanks for the quick review!

(In reply to Till Schneidereit [:till] from comment #49)
> I think it'd be good to assert functionNonDelazifying() not returning
> nullptr here.

Done. I remember a rule about not adding such asserts because the dereference that follows it will immediately crash anyway, but I imagine this assert may make life easier for the fuzzers etc.

Also closing this bug; if the patch doesn't fix the crashes somehow we can reopen it.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Comment on attachment 8409291 [details] [diff] [review]
Fix existingScript

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 886193.
User impact if declined: (Top)crashes.
Testing completed (on m-c, etc.): Landed on m-c. Browser with this patch applied appears stable after using it as my main browser for a few hours.
Risk to taking this patch (and alternatives if risky): Low. The patch is small and doesn't affect a lot of code.
String or IDL/UUID changes made by this patch: None.
Attachment #8409291 - Flags: approval-mozilla-beta?
Attachment #8409291 - Flags: approval-mozilla-aurora?
Attachment #8409291 - Flags: approval-mozilla-beta?
Attachment #8409291 - Flags: approval-mozilla-beta+
Attachment #8409291 - Flags: approval-mozilla-aurora?
Attachment #8409291 - Flags: approval-mozilla-aurora+
Flags: needinfo?(sledru)
Congrat Jan!
This search confirms, if you look into the Build ID facet, that no Nightly builds after 2014-04-19 have seen this crash signature so far:
https://crash-stats.mozilla.com/search/?signature=~js%3A%3Ajit%3A%3AInlineFrameIteratorMaybeGC%3Cint%3E%3A%3AfindNextFrame%28%29&version=31.0a1&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform

So this looks good. Please get it landed on the branches ASAP so we can have it in the release (candidate) that is going to build today.

And thanks, Jan, for the awesome work, and Till for the weekend-review!
Jan, are you landing this on the branches yourself today, or should we go for checkin-needed and have sheriffs uplift it?
Flags: needinfo?(jdemooij)
RyanVM says they're doing the uplifting anyhow, so I'll let Jan have his Easter Monday. :)
Flags: needinfo?(jdemooij)
I don't see any crash reports for this signature for Firefox 31.0a1 after the 20140419 build.  Verifying for 31.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.