Closed Bug 1384544 Opened 5 years ago Closed 4 years ago
Crash in New
This bug was filed from the Socorro interface and is report bp-3a20697d-b17d-4dbf-937b-dd9280170726. ============================================================= There are 18 crashes in nightly 56 and and 26 in beta 55. :kmag, could you investigate please ?
This signature seems to be spiking in 55.0b12.
#20 crash in beta 56. This is often a startup crash. Andrew, can you help find someone to take a look at this in case it is fixable for 56? Thanks.
This code changed in bug 1337578 so maybe jonco has an idea?
Flags: needinfo?(overholt) → needinfo?(jcoppeard)
98% of crashes are reading from address 0xffffffffffffff86. The stack is: NewFunctionClone CloneFunctionReuseScript CloneFunctionObjectIfNotSingleton js::Lambda Interpret
Not a fix. Refactor to add an assertion to JSFunction::setPrefixedBoundFunctionName() that the atom is marked.
Attachment #8907708 - Flags: review?(sphink)
Attachment #8907708 - Flags: review?(sphink) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/971c6a19e3b5 Improve atom marking assertions for JSFunction's atom r=sfink
Naveed - can you get an update on this bug?
Those assertions didn't catch anything and I'm pretty much at a loss to explain what's happening here. anba, is it possible something in bug 1371593 could have caused this?
Flags: needinfo?(nihsanullah) → needinfo?(andrebargull)
(In reply to Jon Coppeard (:jonco) from comment #9) > anba, is it possible something in bug 1371593 could have caused this? I don't think so. bug 1371593 only affects bound functions and those aren't cloneable [1, 2], so they shouldn't appear in NewFunctionClone. bug 1371593 also changes JSFunction::Flags::STABLE_ACROSS_CLONES, but that flag is only used in FrameIter::matchCallee . More importantly the NewFunctionClone crash at address 0xffffffffffffff86 already happens in Fx55, but bug 1371593 only landed in Fx56.  http://searchfox.org/mozilla-central/rev/7ba03cebf1dcf3b4c40c85a4efad42a5022cf1ec/js/src/jsfun.cpp#2218  http://searchfox.org/mozilla-central/rev/7ba03cebf1dcf3b4c40c85a4efad42a5022cf1ec/js/src/jsfun.cpp#2252  http://searchfox.org/mozilla-central/rev/7ba03cebf1dcf3b4c40c85a4efad42a5022cf1ec/js/src/vm/Stack.cpp#1228
Clearing needinfo. I don't think this has anything to do with anything I've worked on.
The volume is quite high on release and beta. Jon, I understand you are a bit stuck but do you know who could help with that? Thanks By the way, esr 52 is affected too. So, not something recent; However, the spike is recent
Jan, do you have any ideas on what might be going wrong here?
Flags: needinfo?(jcoppeard) → needinfo?(jdemooij)
signature has shifted to [@ js::gc::AtomMarkingRuntime::markAtom<T>] in 57.0b6
Crash Signature: [@ NewFunctionClone] → [@ NewFunctionClone] [@ js::gc::AtomMarkingRuntime::markAtom<T>]
I've been looking at crash dumps. The problem here is that the function we're cloning in js::Lambda is bogus - we have an object that doesn't look like a function. It *does* seem to be a valid NativeObject, but it's definitely not a JSFunction. It seems to be an object with 3 fixed slots, storing an int32 0, a string, and another int32 0. The 0xffffff86 is the JSString* type tag of the second slot - where we expect the function's atom. That means we basically have the following possibilities: (1) The bytecode is bogus, either the op (JSOP_LAMBDA) or its immediate (the index in the script's object list). (2) The script's object list contains a bogus object where we expect our JSFunction*. Either way, something about this script is bogus. Since these scripts come from the subscript loader, it might be XDR data that got encoded/decoded incorrectly, or got corrupted.
(In reply to Jan de Mooij [:jandem] from comment #15) > It *does* seem to be a valid NativeObject, but it's definitely not a > JSFunction. It seems to be an object with 3 fixed slots, storing an int32 0, > a string, and another int32 0. This has been confusing me for a while, until I just realized this is a RegExpObject! That makes sense because there's also a (4-byte) 0x0 pointer following these 3 slots (which I thought was just uninitialized memory) but it's the RegExpObject's private slot for the RegExpShared and we only use the first 4 bytes. So some progress - we have a RegExpObject instead of our lambda JSFunction in js::Lambda.
Unassigning myself as I'm not working on this.
Assignee: jcoppeard → nobody
npb, tcampbell: just FYI, see comments 15 and 16 - reminds me of some other XDR/script related memory corruption issues. Most of these crashes are on startup on Win32 - would be great if QA could help test that.
Crash Signature: [@ NewFunctionClone] [@ js::gc::AtomMarkingRuntime::markAtom<T>] → [@ NewFunctionClone] [@ js::gc::AtomMarkingRuntime::markAtom<T>] [@ JSContext::markAtom]
From duplicate bug 1412838: > This crash is #7 top browser crash > in B12: http://bit.ly/2lsUPN8. It was present in Beta 11 but in very small > numbers. > 68% Correlation to: Addon "email@example.com" Version = 0.9.6 > [34.15% vs 17.70% if process_type = null]
EXEC crashes with wildptr addresses -> sec critical
Mark, do you have any ideas how this could be related to followon search? Maybe we're caching code in a way that interacts badly with how we deploy these system addons.
(In reply to Andrew McCreight [:mccr8] from comment #22) > Mark, do you have any ideas how this could be related to followon search? Unfortunately no. We did just land 0.9.6 in beta (bug 1411518) but I don't see what that would trigger. > Maybe we're caching code in a way that interacts badly with how we deploy > these system addons. I'm suspicious about the correlations - all system add-ons are shipped to all users (e.g. see about:support). AFAIK the correlation is based on what add-ons are installed and not about which ones are actually involved in the crash. So if it is correlating system add-ons in the "its installed" manner, then that's going to be bogus. If it is via what caused the stack / it is in the stack, then that might be ok.
The correlation with followonsearch seems likely to be "followonsearch 0.9.6 and whatever makes this more common (or changed the signature) landed at the same time".
Looking at this issue again, it seems the problem is the JSScript has be object data. A common set of these crashes involve JSOP_LAMBDA on the stack which involves calling JSScript::getFunction() and assuming result is a normal function. The NewFunctionClone and markAtom crashes are just a fallout from this bogus data. (Based on something :jandem mentioned in another thread that I can't find..) Now on 32-bit, JSFunction::atom_ is at same offset as RegExpObject::slot[SOURCE_SLOT].tag which should always be 0xFFFFFF86 (the JS::Value tag for an encoded string) which is what we see as crash address in most of these. The RegExpObject and FunctionObject both live in the same array on a script. I'm currently looking into XDR crashes and have seen crashes indicating memory corruption. I'll add more assertions to my local experiment around these function objects. I'm not sure throwing in a RELEASE_ASSERT to js::Lambda that the object is actually a function will get as closer to the source. We would just move the crash a few call frames sooner.
Another point to make is that due to , if XDR data is zero (due to some default buffer or something) we will decode a lambda target as a RegExpObject and set StringValue which will lead to this crash. One thing we should do is not use zero as a valid enum value in XDR. This XDR case statement already cancels the decode on unsupported values. This doesn't identify the underlying XDR bug we are solving, but it should help reduce startup crashes by falling back to slow path when things go bad.  https://searchfox.org/mozilla-central/rev/c9214daa09e3eb8246b1448e469df1652a140825/js/src/jsscript.cpp#318
Assignee: nobody → tcampbell
For MarkAtom (only): A hint: looking at the crash data, this appears to have appeared in 55.0b6 around July 1st. (there are low-level random crashes that predate that, but it clearly first spikes in beta there, followed by a spike when 55 goes to release, and another spike when 56 goes to release). This implies at least the 55 spike is due to an uplift new in 55.0b6. That may seriously narrow down the possible causes, depending on the churn level, and perhaps provide a few suspects to look at more carefully or to land diagnostic asserts against. Also, the spike appears to cause a new address to show up: 0xffffffffffffff86. This may have been fixed in a later beta, as the signature for MarkAtom drops off (a lot, to 0-2 crashes/day)). Then, on Oct 28th, in 57b12, there's a new spike in MarkAtom (again with the unique 0xfffffffffff86), which continues into 57 release. This is far more interesting, as unlike the 55b6 crash, this on continues and gets into release. Someone needs to look at the delta from b11 (I presume, if it shipped) to b12. I *suspect* that the ongoing low-level set of crashes across all version with wildptrs or 0xfffffffffff (basically unknown address) is distinct from the bug causing the spike first in 55.0b6, and then again in 57b12 and later.
Thanks for the analysis. Nothing stands out from either of those changelogs. If my theory about address 0xfffffff86 holds, we may be a long way from the original source of problem (ie a previous browser session). I'm working on a patch in Bug 1418894 to turn the 0xFFFFFF86 crashes from wild-ptr crashes to assertion failures while loading the cache. This should help us determine if these crashes are almost all the result of XDR problems (errors encoding in a previous session, or a buffer / logic error in current startup). In Bug 1390856, we identify a real defect in XDR that could potentially lead to these problems, but I have doubts it is the largest contributer.
Depends on: 1418894
I'm requesting uplift of Bug 1390856 to beta to see if it has an impact on these rates. Another idea that :mconley proposed is to see what happens to XDR cache under startup or shutdown crash. For example, on a start-up crash, is it possible we write a partial cache file and then read by zeroes later. A zero byte at many places in the cache file would lead directly to this crash signature.
(In reply to Ted Campbell [:tcampbell] from comment #30) > I'm requesting uplift of Bug 1390856 to beta to see if it has an impact on > these rates. it doesn't look like there's any improvements in the rates in 58.0b6
Bug 1418894 attempts to detect XDR buffer corruption earlier and fall-back to normal script loading instead of startup cache. Based on theory in Comment 25 / Comment 26, this should greatly reduce the crash signature tracked in this bug. It does not address the underlying reasons why XDR buffer might become corrupt. That is still under investigation. I'm requesting uplift of Bug 1418894 to beta to see if it impacts these crash rates. (Clearing ni? for others since I'm actively investigating this issue)
so far it looks like bug 1418894 had a very positive effect on the signatures in this bug, as they are essentially gone in 58.0b9
There are almost no crashes on 58.0b9. I'm going to say Bug 1418894 resolved this and points the root cause to XDR data corruption.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Whiteboard: [adv-main58+] → [adv-main58+][post-critsmash-triage]
You need to log in before you can comment on or make changes to this bug.