spike in crashes in FinalizeTypedArenas<T> starting in 2015-12-11 nightly

VERIFIED FIXED in Firefox 45

Status

()

Core
JavaScript: GC
P1
normal
VERIFIED FIXED
2 years ago
2 months ago

People

(Reporter: dbaron, Assigned: jandem)

Tracking

(Depends on: 1 bug, {crash, regression, topcrash})

45 Branch
mozilla47
crash, regression, topcrash
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s-, firefox44 unaffected, firefox45+ fixed, firefox46+ fixed, firefox47+ fixed)

Details

(Whiteboard: [rr], crash signature)

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

2 years ago
There's been a spike in crashes in nightly in FinalizeTypedArenas<T>.

These crashes used to show up at a rate of 1 or 2 crashes per nightly build.  The past few days they've been showing up at about 40 crashes per nightly build, which makes them the #2 topcrash on nightly.

Given that the crash regression showed up on nightly but did not show up on aurora, this appears to be a regression related to code change rather than Web content.  In particular, see these two queries separated by release channel, and look at the aggregation by build ID.  (Both specifically query crashes with date > 2015-10-01 so as to include more than just the past week.)
https://crash-stats.mozilla.com/signature/?release_channel=nightly&date=%3E2015-10-01&signature=FinalizeTypedArenas%3CT%3E&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations
https://crash-stats.mozilla.com/signature/?release_channel=aurora&date=%3E2015-10-01&signature=FinalizeTypedArenas%3CT%3E&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#aggregations

It appears the regression started in the 20151211030207 nightly.  The regression window from the previous nightly (20151210030212) to that one is:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b40ba117fa75&tochange=754b4805a65c

[Tracking Requested - why for this release]:  One of the top crashes in nightly.
(Reporter)

Updated

2 years ago
Component: JavaScript Engine → JavaScript: GC
(Reporter)

Updated

2 years ago
Crash Signature: [@ FinalizeTypedArenas<T>] → [@ FinalizeTypedArenas<T>] [@ FinalizeArenas]
There are no checkins from GC peers in that range, so this is likely some form of heap corruption. There's very little touching C++ and most of that isn't interacting with JS. It looks like Shu and Jan may be standing closest.
Flags: needinfo?(shu)
Flags: needinfo?(jdemooij)

Comment 2

2 years ago
Hm, any tips on how to start debugging this short of backing out patches and seeing if the crashes go away? That signature is not helpful.
After discussion with Shu and Jan on IRC, this appears to be totally unactonable.
Flags: needinfo?(shu)
Flags: needinfo?(jdemooij)
(Reporter)

Comment 4

2 years ago
This spiked on aurora a few days later, starting with build 20151217004008.

Pushlog for that range is:
https://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=1d43e2723082dadfbca37e0babe873e8c2ea45eb&tochange=1ba4f6d91e637348dc0cdc7e9d311c4cddbc8a5e
which isn't useful.
status-firefox44: unaffected → affected
(Reporter)

Comment 5

2 years ago
It's possible the source of memory corruption is bug 1232231.

Also, the aurora bump matches when we merged.  See bug 1232231 comment 2.
Depends on: 1232231
https://crash-stats.mozilla.com/report/index/68e3fa2e-ef6b-46db-b920-eeb7d2160106

I triggered this while on the ny times website.. I allowed a google federated login to see some comments on an article.. maybe that will help?
Topcrash, tracking for now.
tracking-firefox45: ? → +
tracking-e10s: --- → ?
tracking-e10s: ? → m9+
Assignee: nobody → terrence
See Also: → bug 1011075
Is there any way to look at what the top crashes on Nightly were in the period before 12-11? I don't see any other GC crashes in the top 30 crashes on Nightly, so maybe this is just a signature change due to inlining.
Created attachment 8715386 [details] [diff] [review]
more_instrumentation_and_fencing-v0.diff

As discussed in IRC, let's find out if the list head is corrupted and add more fencing around the BFS state to see if we can narrow down the problem.
Attachment #8715386 - Flags: review?(emanuel.hoogeveen)
Keywords: leave-open
Comment on attachment 8715386 [details] [diff] [review]
more_instrumentation_and_fencing-v0.diff

Review of attachment 8715386 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine, hope it tells us something.
Attachment #8715386 - Flags: review?(emanuel.hoogeveen) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/8c41c8315eeafe568b8d1ce99fedbd79eee62586
Bug 1232229 - Add some instrumentation and more fencing to ArenaLists; r=ehoogeveen
fwiw - I triggered this on slate.com somehow today. Can't repro.

https://crash-stats.mozilla.com/report/index/990247a6-8b9e-4b14-bd87-659602160203
https://hg.mozilla.org/integration/mozilla-inbound/rev/ad652aac6b74eb862d79a52b9e516531b51c95df
Backed 8c41c8315eea (bug 1232229) for breaking all the things on a CLOSED TREE.
https://hg.mozilla.org/integration/mozilla-inbound/rev/64392b4fdaad6cefea48df5d9fd144006defda3c
Bug 1232229 - Add some instrumentation and more fencing to ArenaLists; r=ehoogeveen
For context on why Terrence relanded this unchanged: I didn't see any crashes locally or on treeherder, and upon doing a new build, Terrence didn't either (he suspects corruption related to his failing HDDs).

Comment 16

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/8c41c8315eea
https://hg.mozilla.org/mozilla-central/rev/ad652aac6b74
https://hg.mozilla.org/mozilla-central/rev/64392b4fdaad
tracking-e10s: m9+ → +
Priority: -- → P1
Tracking for 46 and 47. This is the #2 topcrash for 46 aurora and #5 in 47 nightly.
status-firefox46: --- → affected
status-firefox47: --- → affected
tracking-firefox46: --- → +
tracking-firefox47: --- → +
Blocks: 1246180
So far, nothing has shown up with sweepBackgroundThings or its callers as the signature (none of those have uintptr_t(-1) as the address):

https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20160205030204&version=47.0a1&signature=js%3A%3Agc%3A%3AGCRuntime%3A%3AsweepBackgroundThings#aggregations
https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20160205030204&version=47.0a1&signature=free_impl+|+js%3A%3Agc%3A%3AGCRuntime%3A%3AsweepBackgroundThings#aggregations
https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20160205030204&version=47.0a1&signature=js%3A%3AGCHelperState%3A%3AdoSweep#aggregations

So I think the head of the ArenaList is probably fine - and changing the memory ordering of the BackgroundFinalizeState either fixed the crash, moved it, or we've gotten very lucky, as there haven't been any crashes in FinalizeTypedArenas<T> except for 1 on Android:

https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20160205030204&version=47.0a1&signature=FinalizeTypedArenas%3CT%3E#aggregations

We're still getting a steady supply of FinalizeArenas crashes, however:

https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20160205030204&version=47.0a1&signature=FinalizeArenas#aggregations

... so I think it just moved. Perhaps we could do some strategic checking of ArenaLists on Nightly at points where the GC has touched it. Just dereferencing the whole list should be enough (assuming it doesn't get optimized out), which might not be *too* bad for performance.
(Reporter)

Comment 19

2 years ago
I think I just set this flag wrong in comment 4.
status-firefox44: affected → unaffected

Updated

2 years ago
Crash Signature: [@ FinalizeTypedArenas<T>] [@ FinalizeArenas] → [@ FinalizeTypedArenas<T>] [@ FinalizeArenas] [@ je_free | FinalizeTypedArenas<T>]
(Assignee)

Updated

2 years ago
See Also: → bug 1233944
Terrence, any news on this?
Flags: needinfo?(terrence)
Jan pushed a diagnostic patch in bug 1233944. Crash-stats is giving me "Internal Server Error" at the moment so I can't check to see if that is showing up the problem.
Flags: needinfo?(terrence)
Checking https://crash-stats.mozilla.com/signature/?build_id=%3E%3D20160205030204&version=47.0a1&signature=FinalizeArenas#aggregations I don't think the diagnostic patch shifted it.

If DXR is right (and I think it is), the only places that actually set the "next" field of an ArenaHeader are the following four:

https://dxr.mozilla.org/mozilla-central/rev/ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.cpp#926
https://dxr.mozilla.org/mozilla-central/rev/ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.cpp#2262
https://dxr.mozilla.org/mozilla-central/rev/ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.cpp#2810
https://dxr.mozilla.org/mozilla-central/rev/ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.h#461

The last one is probably too hot, but perhaps we could add an assertion on the first three (checking for uintptr_t(-1)). Does that sound reasonable?
Flags: needinfo?(terrence)
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #22)
> Checking
> https://crash-stats.mozilla.com/signature/
> ?build_id=%3E%3D20160205030204&version=47.
> 0a1&signature=FinalizeArenas#aggregations I don't think the diagnostic patch
> shifted it.
> 
> If DXR is right (and I think it is), the only places that actually set the
> "next" field of an ArenaHeader are the following four:
> 
> https://dxr.mozilla.org/mozilla-central/rev/
> ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.cpp#926
> https://dxr.mozilla.org/mozilla-central/rev/
> ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.cpp#2262
> https://dxr.mozilla.org/mozilla-central/rev/
> ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.cpp#2810
> https://dxr.mozilla.org/mozilla-central/rev/
> ea39d4a6232c278dd8d805608a07cf9f4cc4c76b/js/src/jsgc.h#461
> 
> The last one is probably too hot, but perhaps we could add an assertion on
> the first three (checking for uintptr_t(-1)). Does that sound reasonable?

Wow, thanks for doing that research! Unless the slowdown is truly *massive*, let's just check all of them, at least for as long as we need to solve the quality issue.
Flags: needinfo?(terrence)
(Assignee)

Comment 24

2 years ago
I'm not convinced it's a bogus ArenaHeader; for instance, the OS X crashes look more like a nullptr crash inside JSObject::finalize.

It'd be great if someone could load a minidump in Visual Studio so we could stare at the actual assembly :)
I missed a few instances where we're setting it through a cursor - on the plus side, several of them also write to the same value, so we don't need to check those. The ones involving SortedArenaList are probably the most painful in terms of performance, and I'd be surprised if that was implicated (if any part of the sorting was broken I'd expect a *lot* of crashes), so I'll probably take those out. Going to run at least Octane to see if the effect on performance is noticeable.

(In reply to Jan de Mooij [:jandem] from comment #24)
> I'm not convinced it's a bogus ArenaHeader; for instance, the OS X crashes
> look more like a nullptr crash inside JSObject::finalize.

We might be looking at different crashes - these are almost all on Windows, and most of those have crash address 0xffffffffffffffff (I don't see any with just 0xffffffff, but I don't know if the fact that it's 64-bit means anything).

I missed a few instances where

> It'd be great if someone could load a minidump in Visual Studio so we could
> stare at the actual assembly :)

Agreed on that count!

Comment 26

2 years ago
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #25)
> We might be looking at different crashes - these are almost all on Windows,
> and most of those have crash address 0xffffffffffffffff (I don't see any
> with just 0xffffffff, but I don't know if the fact that it's 64-bit means
> anything).

The fact that we show the address as 0xffffffffffffffff is a bug, or maybe even an artifact of the way x86_64 works (I don't remember exactly how that worked) - you need to check the registers in the JSON output within the "raw dump" tab to find out useful addresses.
It looks to me like those 64bit crashes have "rax": "0x4b4b4b4b4b4b4b4b" while the 32bit ones have an address of 0x4b4b4b4b - which all sounds like JS_SWEPT_TENURED_PATTERN, see https://dxr.mozilla.org/mozilla-central/source/js/public/Utility.h#47
Ack, good catch! In that case we're not even asserting on the right pattern right now!
Created attachment 8719988 [details] [diff] [review]
Instrument setting ArenaHeader::next to catch misuse and fix existing instrumentation.

This *should* check all the places where the ArenaHeader::next field is set, including "indirectly", except where we dereference the value being set already. It may be a bit overzealous, but doesn't move the needle for me on Octane locally. This also fixes the existing assertion to check for the right pattern! Hopefully "uintptr_t(UINT64_C(0x4b4b4b4b4b4b4b))" is the right way to do this - IIUC it should truncate in the expected way on 32-bit.
Attachment #8719988 - Flags: review?(terrence)
Attachment #8715386 - Flags: checkin+
A try run just in case: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2bab7c4a985a
Created attachment 8720000 [details] [diff] [review]
Instrument setting ArenaHeader::next to catch misuse and fix existing instrumentation (v2).

Ugh, did I really manage to mess up the pattern there? It's been a long day. This one should be good (also fixed a line that got too long).
Attachment #8719988 - Attachment is obsolete: true
Attachment #8719988 - Flags: review?(terrence)
Attachment #8720000 - Flags: review?(terrence)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d75c753da55
(Assignee)

Comment 32

2 years ago
(In reply to Jan de Mooij [:jandem] from comment #24)
> It'd be great if someone could load a minidump in Visual Studio so we could
> stare at the actual assembly :)

So, I just looked at the minidump of the following crash:

https://crash-stats.mozilla.com/report/index/4b606ff1-e9aa-4df7-91d8-a369a2160213

FWIW, MSVC with PGO does a ton of inlining and speculative optimizations. For instance, when we call clasp->finalize() it actually checks for (and inlines) proxy_Finalize, SavedFrame::finalize, and XPC_WN_NoHelper_Finalize first. If all that fails, it does the pointer call.

We're crashing in Arena::finalize, because we have a bogus JSObject. Its group_ field is 0x4b4b4b4b. We crash when we try to load its clasp.

(I've no idea why Breakpad thinks the crash address is 0xffffffffffffffff. My best guess is that it's looking at the SliceBudget value on the stack, that one *does* have this value AFAICS.)

// The !isMarked case.

// Load the object's group.
000007FEE11D8DDC  mov         rax,qword ptr [rdi]  

// Load the group's Class.
000007FEE11D8DDF  mov         r15,qword ptr [rax]                   <---------------

// Load and null check clasp->finalize.
000007FEE11D8DE2  mov         rax,qword ptr [r15+48h]  
000007FEE11D8DE6  test        rax,rax  
000007FEE11D8DE9  jne         FinalizeArenas+278h (07FEE11D8E58h)  

// No finalizer. Jump if !clasp->isNative().
000007FEE11D8DEB  test        dword ptr [r15+8],40000h  
000007FEE11D8DF3  jne         FinalizeArenas+248h (07FEE11D8E28h)  

// Jump if nobj->hasDynamicSlots().
000007FEE11D8DF5  mov         rcx,qword ptr [rdi+10h]  
000007FEE11D8DF9  test        rcx,rcx  
000007FEE11D8DFC  jne         FinalizeArenas+392h (07FEE11D8F72h)  

// Jump if nobj->hasDynamicElements().
000007FEE11D8E02  mov         rcx,qword ptr [rdi+18h]  
000007FEE11D8E06  lea         rax,[ReturnFloat32Reg (07FEE3F75B78h)]  
000007FEE11D8E0D  cmp         rcx,rax  
000007FEE11D8E10  jne         FinalizeArenas+6CCh (07FEE11D92ACh)  

// Jump if nobj->shape_->listp == &nobj->shape_.
000007FEE11D8E16  mov         rcx,qword ptr [rdi+8]  
000007FEE11D8E1A  lea         rax,[rdi+8]  
000007FEE11D8E1E  cmp         qword ptr [rcx+20h],rax  
000007FEE11D8E22  je          FinalizeArenas+98Bh (07FEE11D956Bh)  

// Check whether disablePoison has been initialized, set.
000007FEE11D8E28  test        byte ptr [halfDomain+8h (07FEE3F6AF34h)],1  
000007FEE11D8E2F  je          `js::irregexp::RegExpEmpty::GetInstance'::`2'::`dynamic atexit destructor for 'instance''+2DAA8h (07FEE198A108h)  
000007FEE11D8E35  cmp         byte ptr [disablePoison (07FEE3F6AF38h)],0  
000007FEE11D8E3C  jne         FinalizeArenas+26Eh (07FEE11D8E4Eh)  

// Poison with 0x4b.
000007FEE11D8E3E  mov         r8,r13  
000007FEE11D8E41  mov         edx,4Bh  
000007FEE11D8E46  mov         rcx,rdi  
000007FEE11D8E49  call        memset (07FEE19396A4h)
(Assignee)

Comment 33

2 years ago
thingSize seems to be 168, that's 21 words on x64 and that matches OBJECT16_BACKGROUND.

So this object is pretty big (we may have crashes with different object sizes, of course).
(Assignee)

Comment 34

2 years ago
(In reply to Jan de Mooij [:jandem] from comment #33)
> thingSize seems to be 168, that's 21 words on x64 and that matches
> OBJECT16_BACKGROUND.

Hm actually, OBJECT16 is 160 bytes, not 168. So either r8 does not hold the thingSize (maybe something clobbered it), or sizeof(JSObject_Slots16) is different on Windows. Will look into this a bit more.

(Btw, it'd be nice to use some c++ magic to make thingSize a known constant for each alloc kind.)
(Assignee)

Comment 35

2 years ago
(In reply to Jan de Mooij [:jandem] from comment #34)
> Hm actually, OBJECT16 is 160 bytes, not 168. So either r8 does not hold the
> thingSize (maybe something clobbered it)

Nevermind, r8 is also used for isMarked() check so it no longer holds the thingSize when we crash. 

Based on the stack (thingSize is in ebp-0x48), I think it's actually 0x20, which is very small: AllocKind::OBJECT0_BACKGROUND.
(In reply to Jan de Mooij [:jandem] from comment #32)
> (I've no idea why Breakpad thinks the crash address is 0xffffffffffffffff.

This is something to do with x86_64 on Windows, as Robert Kaiser pointed out in comment #26. I knew I remembered that from somewhere, and indeed: bug 974420 has some details.
(Assignee)

Comment 37

2 years ago
Some more data:

* The JSObject* we're finalizing has address 0x27415040. It's the arena's *second* object (it comes after the arena header and some other object). The ArenaCellIterImpl's limit is 0x27416000.

* Our ArenaCellIterImpl has an empty FreeSpan: (0, 0)

So, either:

(1) Since it's the second object, the arena must be completely full (well, except for the first object maybe). This seems a bit unlikely, but also highly suspicious because we're trying to finalize an object that has already been poisoned.

(2) The empty FreeSpan is bogus and we should have skipped this already-finalized object.

Assuming it's a bogus FreeSpan, the question is how that can happen. Maybe we can add some instrumentation.
Comment on attachment 8720000 [details] [diff] [review]
Instrument setting ArenaHeader::next to catch misuse and fix existing instrumentation (v2).

Review of attachment 8720000 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure how likely this is to be at fault after Jan's analysis, but I think it's still worth eliminating as a possibility.
Attachment #8720000 - Flags: review?(terrence) → review+
Okay, let's try it. Maybe we can do something similar for FreeSpan.
Keywords: checkin-needed
Blocks: 1249209

Comment 40

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/e00a02282951
Keywords: checkin-needed

Comment 41

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/e00a02282951
renom'ing. Top crash in beta experiment. Poiru, can you indicate where this is in the top crash list?
tracking-e10s: + → ?
Flags: needinfo?(birunthan)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #42)
> renom'ing. Top crash in beta experiment. Poiru, can you indicate where this
> is in the top crash list?

This is #2 (3.51%) for content processes and #3 (3.35%) acros the board. See bug 1249209 comment 2 for full list.
Flags: needinfo?(birunthan)
FWIW, I got to thinking following Jan's analysis and I don't think it's all that likely that we're setting ArenaHeader::next to the poison pattern directly at any point. More likely, we're pointing to an Arena that gets swept and poisoned, so *its* next field contains the poison value.

Unfortunately I don't really see how we could recover where we started pointing to the offending (now poisoned) Arena. The only thing I can think of would be to use a few bits on each ArenaHeader to indicate the assignment site for the last write to an ArenaHeader's next field. Then we'd have to store that data (e.g. in a temporary variable) so it's available in the minidump.

But considering how few assignment sites there are, maybe we should just stare at all of them and see if a place jumps out where we could be inserting the same Arena into a linked list twice. Of course, that might involve looking at a lot of code :\ It's a pity none of us have ever gotten this crash, since that would open up a lot more options (I wouldn't mind running with some sort of duplicate insertion checking code locally, but I imagine it'd be pretty slow).
Jan pointed out to me that the assembly he analyzed above shows I'm probably barking up the wrong tree with the ArenaHeader::next thing. Also, I realized I actually got one of these crashes earlier today! It didn't register because it was a content crash (and I haven't been running with e10s for very long). So I might be able to help test this locally after all.
(Reporter)

Comment 46

2 years ago
(In reply to Andrew McCreight [:mccr8] from comment #8)
> Is there any way to look at what the top crashes on Nightly were in the
> period before 12-11? I don't see any other GC crashes in the top 30 crashes
> on Nightly, so maybe this is just a signature change due to inlining.

I just regenerated http://dbaron.org/mozilla/crashes-by-build with a 180 day window instead of 31 days, so it should be useful for those dates again.

I don't see any obvious JS-related signatures that disappeared on the 11th.  (I looked for signatures present on both the 10th and 7th, and with 4 or more crashes on the 10th.)  But I skimmed relatively quickly...

Comment 47

2 years ago
I stumbled upon seeing this telemetry spike on 2015-12-04. Not the experience to relate it to anything but know this bug is from the period and outstanding without great idea to resolve. Seems worth a shot posting even of not linked.

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=bucket-0&cumulative=0&end_date=2016-02-18&keys=!__none__!__none__&max_channel_version=nightly%252F47&measure=SHUTDOWN_OK&min_channel_version=nightly%252F44&processType=false&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2016-01-29&trim=1&use_submission_date=0

Comment 48

2 years ago
Created attachment 8722065 [details]
linux-gdb-stacks.txt

Managed to temporarily have bug intermittently. Seems whatever advert has gone now.
First inside debugger on central. See attached.
Was wondering about bug 1230162 but seems not after adding a few outputs.
Tried nighties 7-10 all OK.
Finally bisected twice. with e10s on.

(First didn't seem to start well, build I expected to fail didn't.)
2016-02-22T15:48:10: INFO : Narrowed inbound regression window from [560d36a9, 39c8aabf] (3 revisions) to [560d36a9, a1caa7e7] (2 revisions) (~1 steps left)
2016-02-22T15:48:10: DEBUG : Starting merge handling...
2016-02-22T15:48:10: DEBUG : Using url: https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?changeset=a1caa7e7949fb888023b335f22ea01b05dfb1bc8&full=1
2016-02-22T15:48:11: DEBUG : Found commit message:
Backed out changeset bcb4ebf6ffac (bug 1198459) for bustage
---------------
2016-02-22T16:36:28: INFO : Narrowed inbound regression window from [286ad584, 151ce2b0] (4 revisions) to [6fce35b1, 151ce2b0] (2 revisions) (~1 steps left)
2016-02-22T16:36:28: DEBUG : Starting merge handling...
2016-02-22T16:36:28: DEBUG : Using url: https://hg.mozilla.org/integration/mozilla-inbound/json-pushes?changeset=151ce2b0e3f6b73505be35561f148678577dcbcb&full=1
2016-02-22T16:36:29: DEBUG : Found commit message:
Bug 1225396 part 4 - Remove @@iterator workaround in Codegen.py. r=bz

This maybe one to look into. Intermittent and there were four tests after the bad that I could not crash, followed by not crashing on my central build so can't discount older in the window.
(Assignee)

Comment 49

2 years ago
(In reply to Jonathan Howard from comment #48)
> Managed to temporarily have bug intermittently. Seems whatever advert has
> gone now.

Can you post exact STR, even though they may no longer work?

This is super useful information, thank you!
(Assignee)

Comment 50

2 years ago
OK I managed to catch this in rr today and with some serious IRC debugging help from the GC team we know what's going on. These bugs are impossible to fix without rr.

EnsureParserCreatedClasses is allocating a nursery object for the background-Zone (that's the bug). Then a minor GC on the main thread races with refillFreeListOffMainThread called from the background thread. That messes up the background-Zone's free list and Finalize then thinks an Arena is free, while it's mostly garbage, so we crash.

So this is a regression from bug 1225396, but extremely subtle, and we should add some asserts to catch similar bugs in the future.

More tomorrow.
Depends on: 1225396
Whiteboard: [rr]

Updated

2 years ago
Flags: needinfo?(jmathies)

Comment 51

2 years ago
not e10s specific, untracking.
tracking-e10s: ? → -
Flags: needinfo?(jmathies)
Is it possible that this and bug
Blocks: 1233944
Congrat, we will be watching your progress and take a patch in 45 to fix it! Encore bravo!
Created attachment 8722926 [details] [diff] [review]
bug1232229-add-nursery-asserts

Not a fix, but adds assertions to catch this problem.  We get 28 jit-test failures with this patch applied.
Attachment #8722926 - Flags: review?(terrence)
(Assignee)

Comment 55

2 years ago
Created attachment 8723026 [details] [diff] [review]
Fix

This patch replaces the NewObjectWithGivenProto call with global->createBlankPrototypeInheriting, to match most other places where we create prototype objects.

createBlankPrototypeInheriting calls CreateBlankProto, where we call NewNativeObjectWithGivenProto with a SingletonObject argument.
Assignee: terrence → jdemooij
Status: NEW → ASSIGNED
Attachment #8723026 - Flags: review?(jcoppeard)
Comment on attachment 8723026 [details] [diff] [review]
Fix

Review of attachment 8723026 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8723026 - Flags: review?(jcoppeard) → review+

Comment 57

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/584e6e436e27

Comment 58

2 years ago
If this is safe enough, we'd really like to get this into beta very soon, given that this is one of the topcrashes on 45 and stability on that train has not been good in general so far.
The gtb is tomorrow. Jandem, can we have an uplift request right now?
Flags: needinfo?(jdemooij)
(Assignee)

Comment 60

2 years ago
Comment on attachment 8723026 [details] [diff] [review]
Fix

Approval Request Comment
[Feature/regressing bug #]: Bug 1225396.
[User impact if declined]: Frequent crashes.
[Describe test coverage new/current, TreeHerder]: I don't have a test for this but it does fix the debug asserts added by another patch.
[Risks and why]: Low risk. The patch is oneliner-ish and we have decent tests for this code.
[String/UUID change made/needed]: None.
Flags: needinfo?(jdemooij)
Attachment #8723026 - Flags: approval-mozilla-beta?
Attachment #8723026 - Flags: approval-mozilla-aurora?

Comment 61

2 years ago
Steps eventually was retrying (more precise was in attachment.)
http://www.nytimes.com/2016/01/24/technology/larry-page-google-founder-is-still-innovator-in-chief.html?_r=0 taken from bug 1233944

I think I got lucky on Monday having it occur multiple times. Tried yesterday crashed once but required more effort. Today still took a while to crash once on central build. Added patch and gave that a fair amount of attempts, no crash. Did have one hang for couple seconds that I hadn't experienced any other time trying, maybe a sign the patch worked.
(Back to normal ad-blocked build.)
Comment on attachment 8722926 [details] [diff] [review]
bug1232229-add-nursery-asserts

Review of attachment 8722926 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for taking this!
Attachment #8722926 - Flags: review?(terrence) → review+
Comment on attachment 8723026 [details] [diff] [review]
Fix

Fix for topcrash, please uplift to aurora and beta. 
This should make it into tomorrow's beta 10 build.
Attachment #8723026 - Flags: approval-mozilla-beta?
Attachment #8723026 - Flags: approval-mozilla-beta+
Attachment #8723026 - Flags: approval-mozilla-aurora?
Attachment #8723026 - Flags: approval-mozilla-aurora+
I guess we should back out the speculative patches now?
Flags: needinfo?(emanuel.hoogeveen)
(In reply to Terrence Cole [:terrence] from comment #64)
> I guess we should back out the speculative patches now?

Yes please! I don't think they're going to catch anything, so they just clutter up the code.
Flags: needinfo?(emanuel.hoogeveen)
(Reporter)

Comment 66

2 years ago
https://hg.mozilla.org/releases/mozilla-aurora/rev/1606a71166df
status-firefox46: affected → fixed
(Reporter)

Comment 67

2 years ago
https://hg.mozilla.org/releases/mozilla-beta/rev/7190c35c9e14
status-firefox45: affected → fixed

Comment 68

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/17bfd6a2a529

Comment 69

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/584e6e436e27
https://hg.mozilla.org/integration/mozilla-inbound/rev/7d2bb13d6c37e90963992e9da9428c3aedf2bfaa
Backout e00a02282951 (bug 1232229) as we no longer need the diagnostics.
(Assignee)

Updated

2 years ago
Keywords: leave-open

Comment 71

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/17bfd6a2a529
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
status-firefox47: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
(Reporter)

Comment 72

2 years ago
Should the diagnostics be removed from aurora/beta, or are they harmless?  (Or were they never there in the first place?)
(Reporter)

Comment 73

2 years ago
Also, marking as verified on nightly since the crash is no longer present in crash-stats for today's nightly.
Status: RESOLVED → VERIFIED

Comment 74

2 years ago
backoutbugherder
Merged backout
https://hg.mozilla.org/mozilla-central/rev/7d2bb13d6c37
Depends on: 1251922
The crash stills occurs in 45 beta 10 but the overall rate is quite low (~50 crashes)
(Assignee)

Updated

2 years ago
Duplicate of this bug: 1233944

Updated

a year ago
Depends on: 1251589
Duplicate of this bug: 1234857
Depends on: 1265667
(Assignee)

Updated

a year ago
No longer depends on: 1265667
Version: unspecified → 45 Branch

Comment 78

2 months ago
Hello Team, just for your information, i had multiple crashes in fireforESR 45.9.0 on my Kali Kinux system.
CrashID: https://crash-stats.mozilla.com/report/index/421984b5-13e4-4c7d-9576-4b6371170509
You need to log in before you can comment on or make changes to this bug.