Closed Bug 1490009 Opened 6 years ago Closed 3 years ago

Crash in OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::gc::StoreBuffer::putCell

Categories

(Core :: JavaScript Engine, defect, P1)

x86
Windows 7
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr60 --- unaffected
firefox62 --- unaffected
firefox63 + wontfix
firefox64 + wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is
report bp-19001319-8a54-49f5-a5a3-dc1990180910.
=============================================================

New Windows content crash showing up in the 63 betas: https://bit.ly/2x2DahT. No comments and the only URL that shows up is facebook. Doesn't appear to affect 62 and no crashes so far in 64 nightly.

Top 10 frames of crashing thread:

0 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1628
1 xul.dll js::gc::StoreBuffer::putCell js/src/gc/StoreBuffer.h:461
2 xul.dll JS::HeapObjectPostBarrier js/src/gc/Barrier.h:269
3 xul.dll static bool mozilla::CycleCollectedJSContext::EnqueuePromiseJobCallback xpcom/base/CycleCollectedJSContext.cpp:272
4 xul.dll JSRuntime::enqueuePromiseJob js/src/vm/Runtime.cpp:653
5 xul.dll static bool EnqueuePromiseReactionJob js/src/builtin/Promise.cpp:1008
6 xul.dll static bool ResolvePromise js/src/builtin/Promise.cpp:1052
7 xul.dll static bool FulfillMaybeWrappedPromise js/src/builtin/Promise.cpp:1086
8 xul.dll static bool ResolvePromiseInternal js/src/builtin/Promise.cpp:823
9 xul.dll js::PromiseObject::resolve js/src/builtin/Promise.cpp:4154

=============================================================
Adding another signature seen in 63 beta. 63 beta 4 has over 100 crashes so far.
Crash Signature: [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::gc::StoreBuffer::putCell] → [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::gc::StoreBuffer::putCell] [@ shutdownhang | js::gc::StoreBuffer::putCell]
Hello Steven - this is a new JS crash which is emerging in 63 - can someone triage this bug? Thanks in advance.
Flags: needinfo?(sdetar)
Tracking for 63 as the volume on beta is sizable. Steven could you triage this bug and help find an owner please? Thanks
Jason: during your triage of the JS engine bugs this week, could we make sure this bug is triaged.  Thank you
Flags: needinfo?(sdetar) → needinfo?(jorendorff)
P1 to investigate, at least
Flags: needinfo?(jorendorff) → needinfo?(jcoppeard)
Priority: -- → P1
I don't know why this just started happening, but here's what I think is going on:

Whenever a promise is resolved or rejected the JS engine tells the embedding to create a job to handle it (by calling back into the engine) at the appropriate time.  These job objects frequently contain pointers to JSObject in the nursery, and so we have to record them in our storebuffer.  If a script uses a lot of promises then this will result in there being many store buffer entries.

(It occurs to me that this may have started because we're sweeping the nursery less frequently now.)

These job objects are cycle collected and so may persist until the next CC even after the job has been run and the object is no longer used.
Flags: needinfo?(jcoppeard)
Here's a patch I think will help.  This clears all the JSObject references in the callback object after use.  This will remove the storebuffer entries ASAP rather than waiting for the next minor GC.

It also occurs to me that this could be a more general problem with CC'ed things holding GC things live longer than necessary.
Assignee: nobody → jcoppeard
Attachment #9010582 - Flags: review?(continuation)
Comment on attachment 9010582 [details] [diff] [review]
bug1490009-reset-promise-callback

Review of attachment 9010582 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/base/CycleCollectedJSContext.cpp
@@ +226,5 @@
>      if (global && !global->IsDying()) {
>        mCallback->Call("promise callback");
>        aAso.CheckForInterrupt();
>      }
> +    mCallback->Reset();

Please add a comment here about why you are doing this.
Attachment #9010582 - Flags: review?(continuation) → review+
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c173cb530f63
Clear CallbackObject fields after use for promise job to avoid tenuring objects unnecessarily r=mccr8
I'd like to leave this open until we see whether it fixes the problem.
Keywords: leave-open
There are explicit guarantees that CallbackObject provides about when CallbackOrNull() can return null, and the newly-added API totally violates those guarantees.  Worse yet, it didn't even adjust the corresponding documentation, so consumers would think that the guarantees are still being provided.

As a bare minimum, the new API needs to come with documentation explaining what invariants it's violating, and the existing documentation about the invariants needs to be updated accordingly.

Better yet, in addition to the documentation the API would be private with a friend declaration to allow this one callsite.

Even better would be to rejigger the whole thing to have a concept of "run-once" callback, which auto-disappears  after you Call() it.  Exactly how to set that up is not 100% clear to me yet, but conceptually it would be the best solution.
Flags: needinfo?(jcoppeard)
Flags: needinfo?(continuation)
Yeah, I guess I should have checked more about whether it was okay or not to just randomly clear. I didn't realize there was this particular invariant. This should be backed out then. It'll still have been in a Nightly, so we can see if it helps the original issue, maybe.

I don't entirely understand how this violates the description on CallbackOrNull(). Doesn't JS run in the mCallback->Call()? Is the problem that it is expected that CallbackOrNull() can return null only when certain particular kinds of JS are run (as specified in the second sentence of the comment) or that sometimes we don't actually run mCallback->Call()?
Flags: needinfo?(continuation) → needinfo?(bzbarsky)
The invariant being violated is the second sentence:

  // It will only return null
  // after a JavaScript caller has called nukeSandbox on a Sandbox object, and
  // the cycle collector has had a chance to run.

That can't happen without script running, but not all script running can cause that.
Flags: needinfo?(bzbarsky)
It looks like Jon is on PTO. Sheriffs, please back this patch out.
Keywords: checkin-needed
could this be related to bug 1488164 which has the same spike time?  (of course, it could just be two bugs that uplifted and caused OOMs in beta) (note: I think jonco is on PTO per above)
Flags: needinfo?(bzbarsky)
Doesn't seem related to me.
Flags: needinfo?(bzbarsky)
Sorry about that.  How does this version look?  I updated the doc comments and make Reset() private.
Flags: needinfo?(jcoppeard)
Attachment #9013311 - Flags: review?(bzbarsky)
Comment on attachment 9013311 [details] [diff] [review]
bug1490009-reset-promise-callback v2

>+  // this point. This should only be called if the object is know not to be used

"known"
Attachment #9013311 - Flags: review?(bzbarsky) → review+
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/cae1a7f33840
Clear CallbackObject fields after use for promise job to avoid tenuring objects unnecessarily r=bz
We gained some perf boosts!

== Change summary for alert #16377 (as of Tue, 02 Oct 2018 09:21:35 GMT) ==

Improvements:

 16%  tp5o_webext responsiveness linux64 pgo e10s stylo         1.23 -> 1.03
 16%  tp5o_webext responsiveness windows7-32 opt e10s stylo     1.20 -> 1.01
 15%  tp5o_webext responsiveness linux64 opt e10s stylo         1.39 -> 1.18
 12%  tp5o_webext responsiveness linux64-qr opt e10s stylo      1.96 -> 1.73

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=16377
Jon, can we uplift this patch to beta as almost all the  crashes affect 63?
Flags: needinfo?(jcoppeard)
Attachment #9010582 - Attachment is obsolete: true
Comment on attachment 9013311 [details] [diff] [review]
bug1490009-reset-promise-callback v2

[Beta/Release Uplift Approval Request]

Feature/Bug causing the regression: None

User impact if declined: Hopefully this will reduce the volume of OOM crashes on beta.

Is this code covered by automated tests?: Yes

Has the fix been verified in Nightly?: Yes

Needs manual test from QE?: No

If yes, steps to reproduce: 

List of other uplifts needed: None

Risk to taking this patch: Low

Why is the change risky/not risky? (and alternatives if risky): This is a simple change and problems would have resulted in immediate test failures on central.

String changes made/needed: None.
Flags: needinfo?(jcoppeard)
Attachment #9013311 - Flags: approval-mozilla-beta?
Comment on attachment 9013311 [details] [diff] [review]
bug1490009-reset-promise-callback v2

OOM Crash fix, uplift approved for 63 beta 12, thanks.
Attachment #9013311 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
I see a perf win here:
== Change summary for alert #16386 (as of Tue, 02 Oct 2018 18:40:23 GMT) ==

Improvements:

 14%  tp5o_webext responsiveness windows7-32 opt e10s stylo     1.20 -> 1.03

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=16386
QA Contact: sdetar
QA Contact: sdetar
the volume of [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::gc::StoreBuffer::putCell ] crashes doesn't look like it has significantly reduced on beta since the patch landed in 63.0b12.
Flags: needinfo?(jcoppeard)
I don't know what's happening here.  Note that this is an OOM crash so the cause may not be related to this signature at all.

It's hard to know for sure but low volume of crashes like this seem to show up in nightly in the 20180823220048 build.
Flags: needinfo?(jcoppeard)
Unassigning myself as I'm not actively working on this.
Assignee: jcoppeard → nobody
Steven, this is currently tracking64+ and unassigned. With only a couple weeks left in this cycle, is this on the radar still?
Flags: needinfo?(sdetar)
Jon, can you help answer Ryan's question?  Is there anything we could be doing here?  It seems this is not on our radar at this point, is that accurate?
Flags: needinfo?(sdetar) → needinfo?(jcoppeard)
I don't think there's anything we can do here.  We crashing because we're running out of memory so something has made us allocate more, but it's not necessarily the code in the stack trace.
Flags: needinfo?(jcoppeard)
Too late to do anything for 64.
Marking this fix-optional for 65 and 66 to remove it from repeated triage.

Based on comment 30 and comment 34, this bug seems to be non-actionable.

Keywords: stalled
Regressions: 1555729

The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?

Flags: needinfo?(sdetar)
Flags: needinfo?(sdetar)

Marking this as Resolved > Worksforme since there are no more crashes with this signature in the past 6 months.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: