Closed Bug 1547561 Opened 6 years ago Closed 2 years ago

Crash in [@ js::frontend::RewritingParseNodeVisitor<T>::visit]

Categories

(Core :: JavaScript Engine, defect, P1)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox67 - wontfix
firefox68 --- wontfix
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix

People

(Reporter: jseward, Assigned: jorendorff)

References

(Blocks 1 open bug)

Details

(Keywords: crash, csectype-wildptr, sec-high)

Crash Data

Attachments

(3 files)

This bug is for crash report bp-2eb348c9-c737-4cd1-ae99-94cfc0190426.

This appears at least 3 times in at least 2 different installations
in the Windows nightly 20190425094625.

Top 10 frames of crashing thread:

0 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:108
1 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
2 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
3 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
4 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
5 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
6 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
7 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
8 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
9 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112

Type: task → defect
Flags: needinfo?(khyperia)

[Tracking Requested - why for this release]: this a startup crash in beta 67

OS: Windows 10 → All
Priority: -- → P1

Steven, this is a startup crash affecting 67, can it be investigated rapidly please? Thanks

Flags: needinfo?(sdetar)

Sorry for the lack of response, I am investigating this, I just forgot to respond here.

Jason, is this something you could look at?

Flags: needinfo?(sdetar) → needinfo?(jorendorff)

Since Ashley is working on this I removing the needinfo for Jason

Flags: needinfo?(jorendorff)

Ashley and I looked at these reports.

Ashley noticed that the reports with Reason=EXCEPTION_BREAKPOINT have "MOZ_CRASH(invalid node kind)".

This is a similar case to the GC crashes we see pretty regularly: RewritingParseNodeVisitor<FoldVisitor> is the first thing that walks the whole AST after parsing. Therefore, if the user has some bad RAM, or if something in another thread scribbles on some memory and happens to hit the AST, this is the code that will fail, every time.

Two possible actions.

  1. Change these assertions to MOZ_RELEASE_ASSERT.

    This should not affect performance appreciably, because the assertion should be optimized away in cases where that constructor is sufficiently inlined (maybe all call sites?).

    This asserts that the node kind is valid. Note that the pn_type field is const. If we make this change, but still get crashes with "MOZ_CRASH(invalid node kind)", then it's not a JS engine bug.

  2. Change the definition of ParseNodeKind so that 0 is an invalid value. If crashes increase in frequency, or we get more "MOZ_CRASH(invalid node kind)" hits and fewer other wildptr crashes, then it's not a JS engine bug.

(Also possible: some scheme where we isolate the frontend to its own process, separate from the rest of ... yeah.)

Flags: needinfo?(khyperia)

I don't think this is a regression. This is the new signature for bug 1274887. These crashes used to happen in Fold; we rewrote that code to use a Visitor class, and now the Visitor class is blamed. Note the crash volume went down there when it started here.

The fact that we did a major refactor and still have crashes is consistent with these crashes not being a JS engine bug.

This is only barely actionable, and the volume of crashes looks low. pascalc, is this really a top priority for us?

Flags: needinfo?(pascalc)

About 40-50% of the crashes are wildptrs, everyone I looked at on line 108 (pn->getKind()), which implies that pn is corrupt in a bunch of cases.

Group: core-security

Volume went down, we are not going to fix it in 67 and per comment #7 this is a signature change for a crash that we have had for years. Wontfix 67.

Flags: needinfo?(pascalc)
Group: core-security → javascript-core-security

This is an experiment. We'll see if the fuzzers find anything. If nothing else,
it's good to have the invariants of the ParseNode data structure documented in
code, as assertions, separate from the parser and emitter code.

We can add more checks to this as we go. It may not be fast enough to leave on
in release builds, even in Nightly, and I don't want to push my luck. So
DEBUG-only for now.

Depends on D32244

Landing the diagnostic patches (see comment 6) but these are not expected to fix the crashes.

Keywords: leave-open

Comment on attachment 9066901 [details]
Bug 1547561 - Part 1: Change integer values used by ParseNodeKind, for extra assertiness. r?khyperia

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Extraordinarily hard. We don't know what the bug is. These are diagnostic patches.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: all
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: No
  • If not, how different, hard to create, and risky will they be?: no backports
  • How likely is this patch to cause regressions; how much testing does it need?: no backports
Attachment #9066901 - Flags: sec-approval?
Attachment #9066902 - Flags: sec-approval?
Attachment #9066903 - Flags: sec-approval?
Assignee: nobody → jorendorff

Comment on attachment 9066901 [details]
Bug 1547561 - Part 1: Change integer values used by ParseNodeKind, for extra assertiness. r?khyperia

sec-approval+ for trunk.

Attachment #9066901 - Flags: sec-approval? → sec-approval+
Attachment #9066902 - Flags: sec-approval? → sec-approval+
Attachment #9066903 - Flags: sec-approval? → sec-approval+

The situation is: I implemented both options 1 and 2 from comment 6. Now we have to wait for Firefox 69 to go to beta and see what the crash report data says. This isn't really actionable until then.

FF69 has now hit beta and we are getting crash data.

In [1], I summarize all the beta crashes for JS_PARSE_NODE_ASSERT crash. The only thing that is crashing is the ParseNode::getKind() accessor. The crashes are split between half being above limit and half below the start (1000).

[1] https://crash-stats.mozilla.com/search/?moz_crash_reason=~parse%20node&moz_crash_reason=~ParseNodeKind&release_channel=beta&product=Firefox&date=%3E%3D2019-07-11T17%3A22%3A00.000Z&date=%3C2019-07-25T17%3A22%3A00.000Z&_facets=signature&_facets=moz_crash_reason&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-moz_crash_reason

Can you take another look now that there is more data? Anything useful in there?
I'm going through some untouched-for-a-while sec-high issues to make sure we've investigated all we can before marking them stalled.

Flags: needinfo?(jorendorff)

This looks very likely to be a memory bug/stomping issue.

As Ted said, it looks like the new crashes are crashing here: https://searchfox.org/mozilla-central/rev/e5327b05c822cdac24e233afa37d72c0552dbbaf/js/src/frontend/ParseNode.h#655-656

Plainly visible, however, is the asserts in the constructor of ParseNode, just above that line in the source. Those identical asserts are not firing, and the field is const, meaning the value cannot change between the constructor and us checking it in getKind. So, because we are crashing in getKind, that means that the const value somehow changed - i.e. some issue with memory stomping, or some other form of memory corruption, not a logic bug.

I believe this is what Jason thought was pretty likely to happen with these diagnostics, and said earlier in the thread that this implies that this is not a JS engine bug, but rather some other thread scribbling on our data, or some other crazy thing happening. I'll leave that up to someone else to strongly conclude, however.

OK. The wild write happens either (a) in the parser thread itself; or (b) in another thread.

If it's (a), then I would expect the bug to happen during some particular phase of parsing, and the crash to happen during the next phase of parsing. But we have crash reports from several phases.

If it's (b), then why does the wild write happen to land in the ParseNode arena? If it's pure coincidence, these crashes would be vanishingly rare on 64-bit, which they aren't. So what is it? Possibly a use-after-free, since LifoAlloc slabs are allocated using jemalloc.

So as an experiment, we could try changing LifoAlloc to allocate slabs using something like js::gc::MapAlignedPagesRandom. This would make (b) incredibly unlikely on 64-bit platforms, and seems like a prudent move anyway (for the same reasons as other users of that function).

I might suggest that if we try this that we limit it to the JSContext::tempLifoAlloc to begin with since it is already long-lived. That said, our typical LifoAlloc chunks are multiple 4kB pages so any jemalloc benefits might not apply.

Assignee: jorendorff → nobody
Flags: needinfo?(jorendorff)

Is this a candidate for being marked stalled?

Flags: needinfo?(sdetar)

Nicolas, do you think this is something you take a quick look at and help us to identify if this is still actionable or stalled?

Flags: needinfo?(sdetar) → needinfo?(nicolas.b.pierron)

What is described in comment 6 looks like this crashes are similar to Bug 1263794, such as crashing while reading data allocated within a LifoAlloc chunks. However, one of the symptoms Bug 1263794 is the fact that we have plenty of jemalloc poison values in the addresses, which is not the case here.

The crash addresses are reporting -1 (~11%) and 0 (~6%) at being the most frequent addresses. These might be hint of logical issues.

The crash addresses are also reporting pointer reads where the pointers are null values with a single bit set. This is a huge hint toward memory corruption, and as such likely to be outside the JS engine, as concluded in Bug 1263794 and in comment 22.

Using the GC allocation functions as suggested comment 23, or ExecutableMemoryAllocator pre-reserved space, would help figure out the corruption case, but would not change anything in case of bad memory. This would not help as only 2 crashes look like jemalloc poison values over the past 6 months.

However, many of the attempt to investigate LifoAlloc corruptions failed until now (see Bug 1263794), highly hinting towards bad memory.

I think it ok to consider this bug as stalled, but we should try to change the backend memory allocator of LifoAlloc to investigate the evolution of Bug 1263794 crashes.

Flags: needinfo?(nicolas.b.pierron)
Keywords: stalled

The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?

Flags: needinfo?(sdetar)

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #28)

The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?

leave-open is present because patches were landed to investigate the issue without fixing it.
stalled is present to indicate that no activity is expected unless new information is added to this bug.

I think we have the right set of keywords, and I do not think this stalled bug can be closed.

Flags: needinfo?(sdetar) → needinfo?(cdenizet)

The bug is linked to a topcrash signature, which matches the following criteria:

  • Top 20 desktop browser crashes on release (startup)
  • Top 10 content process crashes on release

For more information, please visit auto_nag documentation.

Severity: critical → S2

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
Resolution: INCOMPLETE → FIXED
Assignee: nobody → jorendorff

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
Group: javascript-core-security → core-security-release
Group: core-security-release
See Also: → 1928445
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: