Closed Bug 1367896 Opened 2 years ago Closed 2 years ago

Crash in js::Scope::clone at "Use FunctionScope::clone"

Categories

(Core :: JavaScript Engine, defect, critical)

All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- unaffected
firefox55 + wontfix
firefox56 + wontfix
firefox57 --- wontfix
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: mccr8, Assigned: tcampbell)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-e3651824-7e12-4240-b84d-ef2710170525.
=============================================================

I only see 7 crashes like this in the last week, across all branches, so it doesn't seem like it is affecting stability much, but it seems odd to hit an intentional crash like this. Maybe an addon is doing something weird.
Any ideas, Shu? Is this something worth fixing?
Flags: needinfo?(shu)
(In reply to Andrew McCreight [:mccr8] from comment #1)
> Any ideas, Shu? Is this something worth fixing?

The bug is definitely worth fixing but I'm not sure how to reproduce it to find the problem.

Each JSScript has a vector of Scopes, which is divided into two parts: the extra-body scopes, or scopes that logically encompass the entire script but are still associated with the script, like GlobalScope and FunctionScope, and the intra-body scopes, like block scopes.

The reason that script cloning cares about this is the extra-body scopes have their own cloning logic and cannot go through the generic Scope::clone logic. So we're getting that crash because for some reason, there's a FunctionScope that's intra-body, which is strange.
Flags: needinfo?(shu)
the signature was regressing in volume during the 55 nightly cycle (after build 20170524030204), so i'm marking the bug accordingly to have it on the radar - in early data for 55.0b it's accounting for ~1% of browser crashes.

reports come with MOZ_CRASH(Use FunctionScope::clone.)
Keywords: regression
OS: Windows 10 → Windows
Hardware: Unspecified → All
[Tracking Requested - why for this release]: topcrash

This is >1% of all crashes on 55 so far, so I think we need to track this more closely.
It looks like there are two different issues with this signature. This one:

https://crash-stats.mozilla.com/report/index/89f58cea-d7d3-4ba1-ac2d-e34b10170623

looks like we're trying to clone a function script as if it were a global script. The only obvious way I can think of for that to happen is if someone used the reuseGlobal feature of the component loader, and still had function scripts in the startup cache after we removed that functionality.

The ones with this signature:

https://crash-stats.mozilla.com/report/index/1003f744-8b6d-4430-83a1-f95b40170623

may be a different problem, or it may be the same problem, and we just happen to hit the abort when we hit the top-level function scope in the scope chain while we're cloning a function script.
Tracking since this is a top crash in beta 55. Is there anything further we could add to help with diagnosis for these crashes (either category described in comment 5)?
Flags: needinfo?(shu)
I looked at this in person a bit more with Kris. He pointed out that a lot of the crash stacks are nonsense -- like pod_malloc calling CopyScript and so forth.

I also pushed a diagnostic patch to try to catch malformed body scope index stuff at JSScript-creation time instead of clone/XDR time. It didn't turn up anything: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7634866810f201801ff3e2c63b90682aab0cdf78

I'm not sure if it's worth pushing the diagnostic patch to beta given the nonsense stacks in the crash reports. I don't have any bright ideas here. :(
Flags: needinfo?(shu)
It might help to add crash annotations for the filenames of the scripts we're cloning. It may be that this is happening for extension scripts that follow certain patterns, so that would give us something to go on.
Kris and Shu, can you go ahead with the extra crash annotations? This is a high volume startup crash on beta, with no user comments and no correlation data.
Flags: needinfo?(shu)
Flags: needinfo?(kmaglione+bmo)
It's also just barely possible that this is related to bug 1373934 and friends.

Either way, I'll add the crash annotations.
Flags: needinfo?(kmaglione+bmo)
Attachment #8883780 - Flags: review?(benjamin)
Comment on attachment 8883780 [details]
Bug 1367896: Include script filename in crash reason.  data-r?bsmedberg

https://reviewboard.mozilla.org/r/154722/#review160348

data-r=me
Attachment #8883780 - Flags: review?(benjamin) → review+
Comment on attachment 8883780 [details]
Bug 1367896: Include script filename in crash reason.  data-r?bsmedberg

https://reviewboard.mozilla.org/r/154722/#review160364

Thank you for the patch.
Attachment #8883780 - Flags: review?(shu) → review+
https://hg.mozilla.org/mozilla-central/rev/12383fa36e6b
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Please nominate this for Beta approval when you get a chance.
Assignee: nobody → kmaglione+bmo
Flags: needinfo?(shu) → needinfo?(kmaglione+bmo)
Status: RESOLVED → REOPENED
Flags: needinfo?(kmaglione+bmo)
Resolution: FIXED → ---
Comment on attachment 8883780 [details]
Bug 1367896: Include script filename in crash reason.  data-r?bsmedberg

Approval Request Comment
[Feature/Bug causing the regression]: Unknown
[User impact if declined]: This patch shouldn't affect user experience at all, but it should give us more information on a top crash which is mostly happening on the beta channel.
[Is this code covered by automated tests?]: No. This branch is not meant to be reachable, so no tests cover it.
[Has the fix been verified in Nightly?]: No.
[Needs manual test from QE? If yes, steps to reproduce]: No.
[List of other uplifts needed for the feature/fix]: None.
[Is the change risky?]: No.
[Why is the change risky/not risky?]: This change only affects behavior in cases where we're already intentionally crashing.
[String changes made/needed]: None.
Attachment #8883780 - Flags: approval-mozilla-beta?
Comment on attachment 8883780 [details]
Bug 1367896: Include script filename in crash reason.  data-r?bsmedberg

diagnostics patch, beta55+
Attachment #8883780 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Kris Maglione [:kmag] from comment #17)
> [Is this code covered by automated tests?]: No. This branch is not meant to
> be reachable, so no tests cover it.
> [Has the fix been verified in Nightly?]: No.
> [Needs manual test from QE? If yes, steps to reproduce]: No.

Setting qe-verify- based on Kris' assessment on manual testing needs.
Flags: qe-verify-
We're getting crashes with filenames now, so it looks like this isn't an encoding or data corruption issue. These are really FunctionScope objects.

The crashes are in a few different files so far, and it looks like all of the crashes with a given filename are from the same installation:

resource://gre/modules/osfile/ospath_win.jsm
resource://gre/modules/CrashManager.jsm
resource:///modules/experiments/Experiments.jsm

The only obvious thing I can see that those have in common is nested functions that are passed to forEach methods, so perhaps this has something to do with the combination of that pattern and the scripts being XDR encoded after execution (as in bug 1373934).
Depends on: 1382329
there are still crash reports in nightly after bug 1382329 has landed, like bp-e7adead8-8d1e-4b86-8612-b67400170727.
Flags: needinfo?(kmaglione+bmo)
It looks like now some of the crashes are null pointer dereferences here [1], so it's seeming more likely that this is some sort of corruption issue.

It also looks like most of the time when this happens, it happens three times in a row on the same installation, for the same script. After that, automatic safe mode should wipe the startup cache, and then the crash seems to stop. So it definitely looks like we're writing bad bytecode, and then crashing when we try to clone it, but not when we decode it.

Still looking into possible causes.

[1]: https://hg.mozilla.org/releases/mozilla-beta/annotate/271221d4286a/js/src/vm/Scope.cpp#l376
Flags: needinfo?(kmaglione+bmo)
This is still happening at fairly high volume in beta, with 342 crashes in beta 10. 
I don't think it should block 56 though, since it's also high volume on 55.
 
I can't see that 57 is affected, but if it is, then we can still take a patch for 57.
I've been looking into crashes signatures today in js::detail::CopyScript and FindScopeIndex and that has led me to this bug. What I'm seeing there is data overwrite/corruption in the |JSScript::data| buffer.

I see a variety of crashes in this area: https://searchfox.org/mozilla-central/rev/2ef8bd8a46a02c68ddbb1d5f25fa254dd7be1fbd/js/src/jsscript.cpp#3521-3533

Some of the behavior I see is:
  - MOZ_CRASH("Scope not found") in FindScopeIndex
  - A lot of read violations of reg + 8, where the value in register is sometimes raw ascii data (like property names). The |original = vector[i];| will access at +8 since i starts at 1.
  - Some violations where the address is partially jemalloc poison

The JSScript::scopes() access is some of the first things to touch script data in the CopyScript function. I believe the behaviors are all consistent with the theory that JSScript::data_ was clobbered, after which |script->scopes()->vector| is now a possibly garbage pointer that may not point to script info.

We could try to assert that |script->scopes()->vector| is in the script data buffer still. This could be done at a few places to narrow narrow down the source of corruption.

MOZ_DIAGNOSITIC_ASSERT(script->scopes()->vector > script->data);
MOZ_DIAGNOSITIC_ASSERT(script->scopes()->vector < script->data + script->dataSize());

One interesting case is all the ASCII/char16_t data that sometime shows up as a pointer is quite likely atom data.

It is unclear if corruption is on XDR store or load side. It could somehow be a thread race due to background XDR or just memory corruption elsewhere.
Duplicate of this bug: 1376090
(In reply to Ted Campbell [:tcampbell] from comment #25)
> I've been looking into crashes signatures today in js::detail::CopyScript
> and FindScopeIndex and that has led me to this bug. What I'm seeing there is
> data overwrite/corruption in the |JSScript::data| buffer.
>...
>   - A lot of read violations of reg + 8, where the value in register is
> sometimes raw ascii data (like property names). The |original = vector[i];|
> will access at +8 since i starts at 1.
>...
> The JSScript::scopes() access is some of the first things to touch script
> data in the CopyScript function. I believe the behaviors are all consistent
> with the theory that JSScript::data_ was clobbered, after which
> |script->scopes()->vector| is now a possibly garbage pointer that may not
> point to script info.

This makes me wonder if it's related to bug 1401939. nbp's hypothesis is that
the misalignment could lead to memory corruption when memcpy ends up using XMM
registers to copy UTF-16 string data. If we're ending up with UTF-16 string data
in in other fields in XDR-decoded scripts, that seems like a plausible cause.

I'm not sure how it would account for ASCII, though. Single-byte strings
shouldn't have any alignment constraints...
Some sample bad-"pointers":

"startup"
"value"
0xe5e5e5e500000000 <- jemalloc free poison
"roce" <- 16-bit char
"ignoreAb" <- ignoreAbsent is a js property
"message"
"subject"
"der"

The short strings were padded with zeroes. A lot of high-address js::detail::CopyScript crashes are actually string data. These seem to be largely startup crashes.
Depends on: 1402167
Assigning to myself to continue tracking this bug. I'm hoping the mitigation in Bug 1418894 will reduce crashes we see on this signature.
Assignee: kmaglione+bmo → tcampbell
Depends on: 1418894
There are zero crashes in 58.0b9 now that Bug 1418894 has landed. These crashes are consistent with the XDR data corruption theory.
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.