Closed Bug 1280591 Opened 4 years ago Closed 2 years ago

Crash in js::gc::GCRuntime::markBufferedGrayRoots or js::

Categories

(Core :: JavaScript: GC, defect, P3, critical)

47 Branch
Unspecified
macOS
defect

Tracking

()

RESOLVED DUPLICATE of bug 1203273
Tracking Status
firefox47 --- wontfix
firefox48 + wontfix
firefox49 + wontfix
firefox-esr45 --- wontfix
firefox50 + wontfix
firefox51 --- affected

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, topcrash, triage-deferred)

Crash Data

Attachments

(2 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-b27d8452-2dfd-405f-af75-82b0a2160617.
=============================================================

Top mac crash on 47 - currently #4. https://crash-stats.mozilla.com/report/list?signature=js%3A%3A

Lion's share of URLs are from facebook.com. 

ni on Terrence - any ideas?
Flags: needinfo?(terrence)
These are all under markBufferedGrayRoots.

Andrew, could a missing ExposeToActiveJS cause this? Is there anything that landed on 13 Jun that touches how we do rooting or traversal of the DOM?
Flags: needinfo?(terrence) → needinfo?(continuation)
Hi Naveed, is this something your team can investigate? This was mentioned as a top crash on release and something we should consider investigating in time for 48 go-live.
Flags: needinfo?(nihsanullah)
From a URL perspective, most of them are facebook.com and many of the comments mention doing some operation on Facebook.
Duplicate of this bug: 1265024
Depends on: 1283916
(In reply to Terrence Cole [:terrence] from comment #1)
> Andrew, could a missing ExposeToActiveJS cause this?

Not as far as I can think of.

> Is there anything that landed on 13 Jun that touches how we do rooting or traversal of the DOM?

Why June 13? It looks like this crash is happening on all branches.
Flags: needinfo?(continuation)
[Tracking Requested - why for this release]: 48 Beta crashes are now showing up, albeit in smaller volume. Nominating for tracking so we can keep on eye on this crash as well as investigate further if possible.
Assignee: nobody → terrence
Flags: needinfo?(nihsanullah)
FYI, the crash address in the report in comment 0 is misleading due to bug 974420. From the raw dump tab you can see "rcx": "0xe5e5e5e5e5e00000", and the crashing instruction from that dump is:

http://hg.mozilla.org/releases/mozilla-release/annotate/b0310cb90fd0/js/src/gc/Heap.h#l1154
 1154     return isTenured() ? asTenured().getTraceKind() : JS::TraceKind::Object;
   1055d54d6:   f6 81 e8 ff 0f 00 01    testb  $0x1,0xfffe8(%rcx)
Ted, awesome work!

That pointer is |cell->chunk()->trailer.runtime|. So it's |this| there that's 0xe5: the jemalloc poison pattern. Presumably, the DOM object that was storing the pointer on the C heap was freed before we marked it in the first slice.

We can check when buffering gray roots that the pointer we're seeing is not poisoned. Unfortunately, that will almost certainly just take us to [1], which is also too generic to tell us anything. Whatever is being freed is being freed while it is still held in that table, so just checking the table entries at insertion is not going to give us anything interesting. Moreover, the table entries are very generic, so we wouldn't necessarily even know what we could check.

Andrew is there anywhere in the Tracer called at [1] where we could check the pointer and still have enough information about the type of the owner to print anything useful in the crash message? Also, are there any other places where we buffer gray roots that might be implicated?

1- https://dxr.mozilla.org/mozilla-central/source/xpcom/base/CycleCollectedJSRuntime.cpp#1065
Flags: needinfo?(continuation)
What is it that buffers gray roots? I assume that it buffers up gray roots, then executes random code. What ensures that a dying object removes itself from that table?
Flags: needinfo?(continuation)
(In reply to Andrew McCreight [:mccr8] from comment #11)
> What is it that buffers gray roots? I assume that it buffers up gray roots,
> then executes random code. What ensures that a dying object removes itself
> from that table?

Chatted with Andrew about this on IRC. We copy the Cell* out of the CC thing and into a list in the GC in the first slice of GC, so the pointer must have been 0xe5 in the table. This means either that someone added an uninitialized pointer or it was deleted without having DropJSHolder called on it. Since Coverity, ASAN, and Valgrind will all catch the first, it's likely the latter. I think Andrew has some idea of where we can assert in order to catch this case and find out what's failing to drop.
(ASan detects use after free, but not uninitialized variables, to be pedantic.)

My theory is that uninitialized variables would be _always_ detected, while getting this UAF would require some very particular timing, so it would be harder to detect.

Anyways, I'm hoping Terrence can write and land a patch that somehow would crash in we try to add a poisoned value to gcGrayRoots, which will give us more information about exactly what kind of gray root it is.

Meanwhile, I'll look over places where we call HoldJSObject and DropJSObject, as those are the most obvious possible culprit. We may also be able to come up with a way to be less error-prone for that.
Attached patch assert_no_poisoned_gray-v0.diff (obsolete) — Splinter Review
This will at least make us crash sooner in debug builds.
Attachment #8769389 - Flags: review?(sphink)
Comment on attachment 8769389 [details] [diff] [review]
assert_no_poisoned_gray-v0.diff

Review of attachment 8769389 [details] [diff] [review]:
-----------------------------------------------------------------

Do we do msan runs anywhere?
Attachment #8769389 - Flags: review?(sphink) → review+
Can't we just call cell->chunk()->trailer.runtime and do it in release builds or something?

> Do we do msan runs anywhere?

No, but Julian does full Mochitest runs with Valgrind regularly.
Is this a dupe of Bug 1265414 which was filed back in April?
Duplicate of this bug: 1265414
Crash Signature: [@ js::] → [@ js::] [@ js::gc::GCRuntime::markBufferedGrayRoots]
Summary: Crash in js:: → Crash in js::gc::GCRuntime::markBufferedGrayRoots or js::
sorry had to back this out since we hit this assertion like https://treeherder.mozilla.org/logviewer.html#?job_id=31706830&repo=mozilla-inbound and on other tests as intermittent failure
Flags: needinfo?(terrence)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/fd1350f7c3e1
Backed out changeset 010d92da655d for frequent assertion failures
It's worrying that this assertion fails so much!
Depends on: 1286474
Depends on: 1286475
Of the three crashes from Treeherder linked here, one is tracing mWrappedJSRoots, one is for CallbackObject, and one is not showing anything useful in the stack.

I don't understand how this assertion is failing so much, though. Why don't we crash in these conditions normally on Treeherder? Why has ASan never noticed this problem if we hit it 3 times in less than a day?
Assignee: terrence → continuation
Flags: needinfo?(terrence)
This patch calls the same function we seem to be crashing on in the wild. Does this look reasonable to you, Terrence? I'd push to try, but it is closed.
Attachment #8771427 - Flags: feedback?(terrence)
Comment on attachment 8771427 [details] [diff] [review]
Check if gray roots are valid when we add them.

Review of attachment 8771427 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/RootMarking.cpp
@@ +415,5 @@
>  {
>      MOZ_ASSERT(runtime()->isHeapBusy());
> +    MOZ_RELEASE_ASSERT(thing);
> +    // Check if |thing| is a valid JS pointer.
> +    MOZ_RELEASE_ASSERT(thing.asCell()->getTraceKind() <= JS::TraceKind::Null);

I don't think |enum class| supports ordering operations, but I guess |enum class : uint8_t| does? I guess if it compiles it's fine, although you might want to use != here instead just to be safe.

I'd also update the comment: it looks like the "validness" is coming from the check whereas that's kinda a sideshow given the null check just above. Maybe something like:
// Check if |thing| is corrupt by calling a method that touches the heap.
Attachment #8771427 - Flags: feedback?(terrence) → feedback+
Attachment #8769389 - Attachment is obsolete: true
Attachment #8771427 - Attachment is obsolete: true
Comment on attachment 8772620 [details] [diff] [review]
Check if gray roots are valid when we add them.

Review of attachment 8772620 [details] [diff] [review]:
-----------------------------------------------------------------

Welp, that looks like fewer false positives. Let's hope it turns up some intermittents.
Attachment #8772620 - Flags: review?(terrence) → review+
Pushed by amccreight@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/923bcd0c37b0
Check if gray roots are valid when we add them. r=terrence
No sign of this on crash stats, as far as I can see, and there's one crash on Nightly with the markBufferedGrayRoots signature, so I don't know.
(In reply to Andrew McCreight [:mccr8] from comment #30)
> No sign of this on crash stats, as far as I can see, and there's one crash
> on Nightly with the markBufferedGrayRoots signature, so I don't know.

js:: never had crashes on nightly that I saw, so unfortunately that makes it a bit difficult to evaluate on nightly with only 1 crash. I think we would need more volume on nightly to really evaluate. 

:js crash is currently #31 top crash on 47.0.1, all Mac crashes.
(In reply to Marcia Knous [:marcia - use ni] from comment #31)
> js:: never had crashes on nightly that I saw, so unfortunately that makes it
> a bit difficult to evaluate on nightly with only 1 crash. I think we would
> need more volume on nightly to really evaluate. 

Sorry, I didn't really say what I meant. What I meant was, my patch was intended to change the crash from markBufferedGrayRoots to something else, by trying to detect the failure earlier. I've seen no evidence of a different signature.

js:: is not on Nightly any more due to bug 1283916, which fixed the signature. I do see a few instances of markBufferedGrayRoots there, which is what I think the proper signature is.
Crash volume for signature 'js::':
 - esr (version 45): 35 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          0          0          0          0          1          2          3
 - aurora           4          2          2          2          0          3          0
 - beta           135        115        124        136        104        105         24
 - release       2305       2346       2213       2285       2339       2053        573
 - esr              7          5          4          3          2          4          3

Affected platform: Mac OS X
Depends on: 1283634
We will ship 48 with this crash, updating the flags
Crash volume for signature 'js::gc::GCRuntime::markBufferedGrayRoots':
 - nightly (version 51): 3 crashes from 2016-08-01.
 - aurora  (version 50): 73 crashes from 2016-08-01.
 - beta    (version 49): 35 crashes from 2016-08-02.
 - release (version 48): 66 crashes from 2016-07-25.
 - esr     (version 45): 1306 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       2       1
 - aurora       29      26       2
 - beta         12       9       5
 - release      14      26      14
 - esr         111     109      94

Affected platforms: Windows, Mac OS X

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly
 - aurora  #128      #35
 - beta    #1049     #504
 - release #1054
 - esr     #79
Looks like this is more of an issue on ESR that it is in other current releases. Maybe it was fixed along the way since 45 but not backported to ESR. I don't think we need to track this.  
Terence, does that sound right to you? Do you want this on your radar? At least to fix for the next ESR (52).
The patches here were diagnostic (and didn't catch anything). I think the likely fixes are somewhere in the massive tree of ExposeToActiveJS additions that bz and I made last month. Unfortunately, it's hard to say exactly what in that pile we'd need to uplift, and that's if I've even identified the right cause. I think we have to wont-fix on branches.
Flags: needinfo?(terrence)
This is a pretty low volume crash on current release 49.x and also on beta 50, we are a week away from RC, this is now a wontfix for 50.
Assignee: continuation → nobody
See Also: → 1203273
The crash still continues to happen in fully updated FF 54: https://crash-stats.mozilla.com/report/index/d1d2f2c8-d79a-4fb8-b7a0-1c5a71170701

I was not doing anything in the browser, it just died on its own while I was away. Thus it is impossible to reproduce.
Keywords: triage-deferred
Priority: -- → P3
Folding this into Bug 1203273 to track signatures a little better.
Status: NEW → RESOLVED
Closed: 2 years ago
Keywords: leave-open
Resolution: --- → DUPLICATE
Duplicate of bug: 1203273
You need to log in before you can comment on or make changes to this bug.