Crash in [@ js::frontend::RewritingParseNodeVisitor<T>::visit]
Categories
(Core :: JavaScript Engine, defect, P1)
Tracking
()
People
(Reporter: jseward, Assigned: jorendorff)
References
(Blocks 1 open bug)
Details
(Keywords: crash, csectype-wildptr, sec-high)
Crash Data
Attachments
(3 files)
Bug 1547561 - Part 1: Change integer values used by ParseNodeKind, for extra assertiness. r?khyperia
47 bytes,
text/x-phabricator-request
|
abillings
:
sec-approval+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
abillings
:
sec-approval+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
abillings
:
sec-approval+
|
Details | Review |
This bug is for crash report bp-2eb348c9-c737-4cd1-ae99-94cfc0190426.
This appears at least 3 times in at least 2 different installations
in the Windows nightly 20190425094625.
Top 10 frames of crashing thread:
0 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:108
1 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
2 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
3 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
4 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
5 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
6 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
7 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
8 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h
9 xul.dll js::frontend::RewritingParseNodeVisitor<FoldVisitor>::visit js/src/frontend/ParseNodeVisitor.h:112
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Comment 1•6 years ago
|
||
[Tracking Requested - why for this release]: this a startup crash in beta 67
Updated•6 years ago
|
Updated•6 years ago
|
Comment 2•6 years ago
|
||
Steven, this is a startup crash affecting 67, can it be investigated rapidly please? Thanks
Comment 3•6 years ago
|
||
Sorry for the lack of response, I am investigating this, I just forgot to respond here.
Comment 4•6 years ago
|
||
Jason, is this something you could look at?
Comment 5•6 years ago
|
||
Since Ashley is working on this I removing the needinfo for Jason
Assignee | ||
Comment 6•6 years ago
|
||
Ashley and I looked at these reports.
Ashley noticed that the reports with Reason=EXCEPTION_BREAKPOINT have "MOZ_CRASH(invalid node kind)".
This is a similar case to the GC crashes we see pretty regularly: RewritingParseNodeVisitor<FoldVisitor>
is the first thing that walks the whole AST after parsing. Therefore, if the user has some bad RAM, or if something in another thread scribbles on some memory and happens to hit the AST, this is the code that will fail, every time.
Two possible actions.
-
Change these assertions to
MOZ_RELEASE_ASSERT
.This should not affect performance appreciably, because the assertion should be optimized away in cases where that constructor is sufficiently inlined (maybe all call sites?).
This asserts that the node kind is valid. Note that the
pn_type
field isconst
. If we make this change, but still get crashes with "MOZ_CRASH(invalid node kind)", then it's not a JS engine bug. -
Change the definition of
ParseNodeKind
so that 0 is an invalid value. If crashes increase in frequency, or we get more "MOZ_CRASH(invalid node kind)" hits and fewer other wildptr crashes, then it's not a JS engine bug.
(Also possible: some scheme where we isolate the frontend to its own process, separate from the rest of ... yeah.)
Assignee | ||
Comment 7•6 years ago
|
||
I don't think this is a regression. This is the new signature for bug 1274887. These crashes used to happen in Fold; we rewrote that code to use a Visitor class, and now the Visitor class is blamed. Note the crash volume went down there when it started here.
The fact that we did a major refactor and still have crashes is consistent with these crashes not being a JS engine bug.
This is only barely actionable, and the volume of crashes looks low. pascalc, is this really a top priority for us?
Comment 9•6 years ago
|
||
About 40-50% of the crashes are wildptrs, everyone I looked at on line 108 (pn->getKind()), which implies that pn is corrupt in a bunch of cases.
Comment 10•6 years ago
|
||
Volume went down, we are not going to fix it in 67 and per comment #7 this is a signature change for a crash that we have had for years. Wontfix 67.
Updated•6 years ago
|
Assignee | ||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
Depends on D32243
Assignee | ||
Comment 13•6 years ago
|
||
This is an experiment. We'll see if the fuzzers find anything. If nothing else,
it's good to have the invariants of the ParseNode data structure documented in
code, as assertions, separate from the parser and emitter code.
We can add more checks to this as we go. It may not be fast enough to leave on
in release builds, even in Nightly, and I don't want to push my luck. So
DEBUG-only for now.
Depends on D32244
Assignee | ||
Comment 14•6 years ago
|
||
Assignee | ||
Comment 15•6 years ago
|
||
Landing the diagnostic patches (see comment 6) but these are not expected to fix the crashes.
Assignee | ||
Comment 16•6 years ago
|
||
Comment on attachment 9066901 [details]
Bug 1547561 - Part 1: Change integer values used by ParseNodeKind, for extra assertiness. r?khyperia
Security Approval Request
- How easily could an exploit be constructed based on the patch?: Extraordinarily hard. We don't know what the bug is. These are diagnostic patches.
- Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
- Which older supported branches are affected by this flaw?: all
- If not all supported branches, which bug introduced the flaw?: None
- Do you have backports for the affected branches?: No
- If not, how different, hard to create, and risky will they be?: no backports
- How likely is this patch to cause regressions; how much testing does it need?: no backports
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Comment 17•6 years ago
|
||
Comment on attachment 9066901 [details]
Bug 1547561 - Part 1: Change integer values used by ParseNodeKind, for extra assertiness. r?khyperia
sec-approval+ for trunk.
Updated•6 years ago
|
Updated•6 years ago
|
![]() |
||
Comment 18•6 years ago
|
||
https://hg.mozilla.org/integration/autoland/rev/e391e47e6fa43709b8575f200bd9698074b41b03
https://hg.mozilla.org/integration/autoland/rev/e1e95251ab35c6ca222e64593a8fcf0c2738e38e
https://hg.mozilla.org/integration/autoland/rev/6f6ebd68ccad4b93cb2238b029ab2476577d0128
https://hg.mozilla.org/mozilla-central/rev/e391e47e6fa4
https://hg.mozilla.org/mozilla-central/rev/e1e95251ab35
https://hg.mozilla.org/mozilla-central/rev/6f6ebd68ccad
Assignee | ||
Comment 19•6 years ago
|
||
The situation is: I implemented both options 1 and 2 from comment 6. Now we have to wait for Firefox 69 to go to beta and see what the crash report data says. This isn't really actionable until then.
Comment 20•6 years ago
|
||
FF69 has now hit beta and we are getting crash data.
In [1], I summarize all the beta crashes for JS_PARSE_NODE_ASSERT crash. The only thing that is crashing is the ParseNode::getKind() accessor. The crashes are split between half being above limit and half below the start (1000).
Comment 21•6 years ago
|
||
Can you take another look now that there is more data? Anything useful in there?
I'm going through some untouched-for-a-while sec-high issues to make sure we've investigated all we can before marking them stalled.
Comment 22•6 years ago
|
||
This looks very likely to be a memory bug/stomping issue.
As Ted said, it looks like the new crashes are crashing here: https://searchfox.org/mozilla-central/rev/e5327b05c822cdac24e233afa37d72c0552dbbaf/js/src/frontend/ParseNode.h#655-656
Plainly visible, however, is the asserts in the constructor of ParseNode, just above that line in the source. Those identical asserts are not firing, and the field is const, meaning the value cannot change between the constructor and us checking it in getKind. So, because we are crashing in getKind, that means that the const value somehow changed - i.e. some issue with memory stomping, or some other form of memory corruption, not a logic bug.
I believe this is what Jason thought was pretty likely to happen with these diagnostics, and said earlier in the thread that this implies that this is not a JS engine bug, but rather some other thread scribbling on our data, or some other crazy thing happening. I'll leave that up to someone else to strongly conclude, however.
Assignee | ||
Comment 23•6 years ago
|
||
OK. The wild write happens either (a) in the parser thread itself; or (b) in another thread.
If it's (a), then I would expect the bug to happen during some particular phase of parsing, and the crash to happen during the next phase of parsing. But we have crash reports from several phases.
If it's (b), then why does the wild write happen to land in the ParseNode arena? If it's pure coincidence, these crashes would be vanishingly rare on 64-bit, which they aren't. So what is it? Possibly a use-after-free, since LifoAlloc slabs are allocated using jemalloc.
So as an experiment, we could try changing LifoAlloc to allocate slabs using something like js::gc::MapAlignedPagesRandom
. This would make (b) incredibly unlikely on 64-bit platforms, and seems like a prudent move anyway (for the same reasons as other users of that function).
Comment 24•6 years ago
|
||
I might suggest that if we try this that we limit it to the JSContext::tempLifoAlloc to begin with since it is already long-lived. That said, our typical LifoAlloc chunks are multiple 4kB pages so any jemalloc benefits might not apply.
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Comment 26•6 years ago
|
||
Nicolas, do you think this is something you take a quick look at and help us to identify if this is still actionable or stalled?
Comment 27•6 years ago
|
||
What is described in comment 6 looks like this crashes are similar to Bug 1263794, such as crashing while reading data allocated within a LifoAlloc chunks. However, one of the symptoms Bug 1263794 is the fact that we have plenty of jemalloc poison values in the addresses, which is not the case here.
The crash addresses are reporting -1 (~11%) and 0 (~6%) at being the most frequent addresses. These might be hint of logical issues.
The crash addresses are also reporting pointer reads where the pointers are null values with a single bit set. This is a huge hint toward memory corruption, and as such likely to be outside the JS engine, as concluded in Bug 1263794 and in comment 22.
Using the GC allocation functions as suggested comment 23, or ExecutableMemoryAllocator pre-reserved space, would help figure out the corruption case, but would not change anything in case of bad memory. This would not help as only 2 crashes look like jemalloc poison values over the past 6 months.
However, many of the attempt to investigate LifoAlloc corruptions failed until now (see Bug 1263794), highly hinting towards bad memory.
I think it ok to consider this bug as stalled, but we should try to change the backend memory allocator of LifoAlloc to investigate the evolution of Bug 1263794 crashes.
Comment 28•5 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?
Comment 29•5 years ago
|
||
(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #28)
The leave-open keyword is there and there is no activity for 6 months.
:sdetar, maybe it's time to close this bug?
leave-open
is present because patches were landed to investigate the issue without fixing it.
stalled
is present to indicate that no activity is expected unless new information is added to this bug.
I think we have the right set of keywords, and I do not think this stalled bug can be closed.
Comment 30•5 years ago
|
||
:nbp, you're right and it's fixed now: https://github.com/mozilla/relman-auto-nag/commit/6b049a694295d598aa26c58daed45d6106e074e0
Updated•4 years ago
|
Updated•4 years ago
|
Comment 31•3 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release (startup)
- Top 10 content process crashes on release
For more information, please visit auto_nag documentation.
Updated•3 years ago
|
Comment 32•2 years ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit auto_nag documentation.
Comment 33•2 years ago
|
||
Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Comment 34•2 years ago
|
||
Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Updated•2 years ago
|
Description
•