Closed Bug 1296630 Opened 8 years ago Closed 4 years ago

Crash in arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState

Categories

(Core :: General, defect)

49 Branch
x86
Windows 8
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
platform-rel --- -
firefox49 + wontfix
firefox50 --- wontfix
firefox51 --- ?
firefox52 --- ?
firefox53 --- ?

People

(Reporter: marcia, Assigned: dbaron, NeedInfo)

References

Details

(Keywords: crash, regression, Whiteboard: [platform-rel-Intel])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-a7026ba5-3925-447f-8d58-6e3542160819.
=============================================================

Seen while looking at B4 crash stats a new signature to B4 - currently sits at #15: http://bit.ly/2bHyzcs

Not sure where to bucket it. There may be a few other similar crashes that are related, such as the crash following it in arena_dalloc_small | je_free | nsTArray_base<T>::ShrinkCapacity | nsTArray_Impl<T>::ReplaceElementsAt<T> | mozilla::PaintedLayerData::Accumulate
[Tracking Requested - why for this release]:
the scope of this issue might be bigger. this signature seems to be part of an intel cpu specific crash pattern (mentioned in the 2016-08-18 channel meeting) that started in 49.0b4 and unfortunately seems to continue in beta5 judging from early crash data there.

starting in 49.0b4 there was a whole range of new signatures showing up beginning with "arena..." that were coming from "GenuineIntel family 6 model 61 stepping 4 | 4" & "GenuineIntel family 6 model 61 stepping 4 | 2" devices: http://bit.ly/2b5z1mp
all in all they make up ~8% of all crashes in 49.0b4 and seem to happen on windows 7 and above but predominantly (60%) on win8.1.

these were the changes landing in 49.0b4: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b3_RELEASE&tochange=FIREFOX_49_0b4_RELEASE
Crash Signature: [@ arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState] → [@ arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShrinkCapacity | nsTArray_Impl<T>::ReplaceElementsAt<T> | mozilla::PaintedLayerData::Accumulate]…
Keywords: regression
Crash Signature: arena_dalloc_small | je_free | sftkdb_FindObjectsInit] → arena_dalloc_small | je_free | sftkdb_FindObjectsInit] [@ arena_dalloc_small | je_free | _moz_pixman_region32_fini]
Not sure who could dig into this one...
Flags: needinfo?(milan)
Flags: needinfo?(continuation)
Flags: needinfo?(bugs)
Hard to tell if this is a single problem, or a bunch of problems, and what grouping is the correct one, but given that it started in beta, it probably is the same cause.
Jet would look at FrameLayerBuilder::BuildContainerLayerFor ones, I'll find somebody to look at LayerManagerComposite::PostProcessLayers, there are a few other display item/display list related.

Looking at the list of changes in 49b4 (comment 1), there are a few media related changes, so it's probably worth somebody looking at those.  The only other one that stands out is bug 1291016, which is probably fine, but to a casual observer a new, uninitialized variable got introduced, so... (:heycam instead of :jfkthame who's not around right now).
Flags: needinfo?(milan)
Flags: needinfo?(cam)
Flags: needinfo?(ajones)
High volume group of possibly related crashes, new on beta 4, let's call this a blocker for 49.
I don't know anything about layout code, sorry.
Flags: needinfo?(continuation)
If this is an OOM condition then it is probably related to bug 1296453.
(In reply to Milan Sreckovic [:milan] from comment #3)
> The only other one that stands out is bug 1291016, which is probably fine, but to a
> casual observer a new, uninitialized variable got introduced, so... (:heycam
> instead of :jfkthame who's not around right now).

Following up over there.
Flags: needinfo?(cam)
Daniel, and thoughts on this one?
Flags: needinfo?(dholbert)
the crash spike issue has disappeared again in 49.0b6.
in beta 5 crashes coming from "GenuineIntel family 6 model 61 stepping 4" devices generally made up 8.2% of the whole crashing volume, in beta 6 those are back to a "normal" level of 1.2% of all crashes.
we should probably wait and see how beta 7 is performing before considering this solved/untrack it though...
(In reply to David Bolter [:davidb] from comment #8)
> Daniel, and thoughts on this one?

Looks like the backtrace is in layers code, which I'm not super-familiar with. kats or mattwoodrow would perhaps be able to offer more useful opinions/thoughts than I can.  (Comment 9 is encouraging, though; maybe this is fixed? I guess we'll see.)

(Side note, following up on comment 3 / comment 7: jfkthame says over in bug 1291016 that he doesn't think it's connected to this bug.)
Flags: needinfo?(dholbert)
This is a hashtable being torn down inside of ~ContainerState().  ContainerState owns two hash tables, and this could be either one:
>  nsTHashtable<nsRefPtrHashKey<PaintedLayer>> mPaintedLayersAvailableForRecycling;
...and:
>  nsDataHashtable<nsGenericHashKey<MaskLayerKey>, RefPtr<ImageLayer>>
>        mRecycledMaskImageLayers;
https://dxr.mozilla.org/mozilla-central/rev/01748a2b1a463f24efd9cd8abad9ccfd76b037b8/layout/base/FrameLayerBuilder.cpp#1396-1423

We might be putting something bogus in one of those hashtables, and then crashing when the hashtable gets destroyed, or something...  mstange & dvander have "hg blame" for each of those hashtables declarations, so one of them might be a good person to take a look at this, too, if we discover that it's not fixed as hoped in comment 9.  [CC'ing them]
I don't see any playback commits in the regression range in c1.
Flags: needinfo?(ajones)
the crash level of this cpu family still looks normal in 49.0b7, so i think we can close this bug.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
the issue is back again in 49.0b8. i don't understand what's going on :-(
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Bug 1294193 is another bug where there's a strong correlation to "GenuineIntel family 6 model 61 stepping 4 | 4".
See Also: → 1294193
For "arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState":
(91.30% in signature vs 03.81% overall) address = 0xffffffffffffffff
(91.30% in signature vs 05.34% overall) adapter_device_id = 0x1616
(91.30% in signature vs 05.50% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
(47.83% in signature vs 02.74% overall) build_id = 20160814184416
(43.48% in signature vs 08.58% overall) platform_pretty_version = Windows 8.1
(43.48% in signature vs 08.58% overall) platform_version = 6.3.9600
(26.09% in signature vs 03.73% overall) bios_manufacturer = Insyde
(21.74% in signature vs 02.76% overall) Addon "Video DownloadHelper" = true
We seem to be talking about a heap corruption scenario.  I imagine this is some underlying problem that the changes to beta are tickling into higher frequency, rather than actually caused by the changes between beta 3 and beta 4, as well as between beta  7 and beta 8.

The first set of patches is: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b3_RELEASE&tochange=FIREFOX_49_0b4_RELEASE

The second set of patches is: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b7_RELEASE&tochange=FIREFOX_49_0b8_RELEASE

Grasping for straws, there is audio related things in both, but that's weak.
The release that got better (beta 6, see comment 9) contains a fix to bug 1293985, with the "...PLDHashTable::Iterator can't handle modifications while iterating...", so that's starting to look interesting.
Mats, thoughts on this bug?  Since your patch in bug 1293985 correlates with things getting better (then they got worse afterwards), and we're crashing in the PLDHashTable destructor, though you may have some insight.
Flags: needinfo?(mats)
The changes in bug 1292856 may also affect this one if Layer rendering is affected. Jamie: can you chime in on what that patch can change related to memory allocation?
Flags: needinfo?(bugs) → needinfo?(jnicol)
Any correlation with the changes in bug 1293985 seems coincidental to me.
The PLDHashTable object there is different from this one.

The correlations in comment 16 seems unusually strong, so it might be
worth finding hardware that match:
>(91.30% in signature vs 05.34% overall) adapter_device_id = 0x1616
>(91.30% in signature vs 05.50% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
and install:
>(21.74% in signature vs 02.76% overall) Addon "Video DownloadHelper" = true
and test with some of the URLs from crash-stats...
If it's at least semi-reproducible it might be possible to fix.
Flags: needinfo?(mats)
(In reply to Jet Villegas (:jet) from comment #21)
> The changes in bug 1292856 may also affect this one if Layer rendering is
> affected. Jamie: can you chime in on what that patch can change related to
> memory allocation?

That change will in some cases make us *avoid* making a large allocation.
Flags: needinfo?(jnicol)
Crash Signature: arena_dalloc_small | je_free | sftkdb_FindObjectsInit] [@ arena_dalloc_small | je_free | _moz_pixman_region32_fini] → arena_dalloc_small | je_free | sftkdb_FindObjectsInit] [@ arena_dalloc_small | je_free | _moz_pixman_region32_fini] [@ arena_run_tree_insert | arena_dalloc_small | je_free | mozilla::UniquePtr<T>::reset ] [@ arena_dalloc_small | je_free | nsTArray_bas…
After looking at the "cpu info" correlation for the signatures here, I'm leaning
towards a hardware related bug, as Marco suggested in bug 1294193 comment 6.
Many of signatures in this bug have a 100% correlation to
"GenuineIntel family 6 model 61 stepping 4".

dbaron, what do you think about that theory? (given your experience with
the AMD bug)
Flags: needinfo?(dbaron)
Let's keep an eye on this next week and see what Intel says.
(In reply to Mats Palmgren (:mats) from comment #24)
> After looking at the "cpu info" correlation for the signatures here, I'm
> leaning
> towards a hardware related bug, as Marco suggested in bug 1294193 comment 6.
> Many of signatures in this bug have a 100% correlation to
> "GenuineIntel family 6 model 61 stepping 4".

They all do.

But they also seem to have graphics hardware in common.  (e.g., "adapter device id" is mostly 0x1616 with a bit of 0x1606 and a few stragglers; "adapter vendor id" nearly always 0x8086, which is apparently Intel(R) HD Graphics 5500)

I think if you want to claim it's a hardware bug you need to make a much stronger case.

And even if it is, we should still be working to figure out how to fix it.  There was obviously something that made it start happening, so we should figure out what that was and undo it if possible.
Flags: needinfo?(dbaron)
> But they also seem to have graphics hardware in common.

I assumed this was because this chip comes with a builtin GPU.

FWIW, here are a few matches I get from Google on the cpu info string:
Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i7-5650U CPU @ 2.20GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y71 CPU @ 1.20GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y51 CPU @ 1.10GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y31 CPU @ 0.90GHz [x86 Family 6 Model 61 Stepping 4]

Which are all Broadwell, with varying GPUs (source: http://www.cpu-world.com/ )
I think the [@ sse2_blt] crashes looked interesting at first because they have similar CPU pattern and predominance on betas of 49, except they spiked in 49.0b7 and 49.0b8 rather than spiking in 49.0b4, b5, and b8 like many (didn't check all) of the others.
A few other observations:

 * a pretty big portion of the urls reported in crash-stats are either the facebook homepage or youtube videos.  This makes it seem like there's a decent chance that these crashes usually or always occur during video playback

 * a decent portion of the crash reports (half of the ones I sampled?) have the cubeb audio thread being one of the threads contending for the malloc lock at the time the main thread crashes in the allocator (at one of 2 different stacks).  e.g., bp-a68d7849-7be0-4058-9697-842d82160903 thread 65, or bp-4d11b27f-9f69-4cd5-8256-1819d2160903 thread 119).  I suspect this wouldn't be the case if audio weren't playing at the time of the crash.   (I *suspect* the contention is because those threads keep running a little bit after the crash happens, and then crash reporting starts, on the main thread, and thus essentially keep running until they hit a lock that they need to acquire.  I'm not really sure about this, though, i.e., about how long other threads would keep running when one thread crashes.)
Though one other thought.  I looked at the minidump for bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the instruction:
71B248FB 3B C1                cmp         eax,ecx

I just don't see how that instruction can yield EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff, especially when eax and ecx are both 0x01500228.
platform-rel: --- → ?
Whiteboard: [platform-rel-Intel]
49 beta 10 looks affected still, judging on early crash data there.
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #30)
> ...  I suspect this
> wouldn't be the case if audio weren't playing at the time of the crash.

We have audio related patches landing in releases where we could observe a difference in the crash rates (good and bad), so it could be the changes in timing getting us in trouble.
Marking this as a blocker for 49 as it seems very high volume for 49.
The audio playback aspect of this makes me wonder if it's related to bug 1255737. I believe that the speculation there was related to drivers causing audio shutdown badness. https://hg.mozilla.org/releases/mozilla-beta/rev/ab7b68014a1e would have shipped in 49b4.
just to sum up again the impact from this bug that we have seen so far this beta cycle: 
beta 1: unaffected
beta 2: unaffected
beta 3: unaffected
beta 4: affected
beta 5: affected
beta 6: unaffected
beta 7: unaffected
beta 8: affected
(beta 9 not released)
beta 10: affected
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #31)
> Though one other thought.  I looked at the minidump for
> bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the
> instruction:
> 71B248FB 3B C1                cmp         eax,ecx
> 
> I just don't see how that instruction can yield
> EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff,
> especially when eax and ecx are both 0x01500228.

This reminds me of bug 1034706 comment 44, although that was AMD specific.

By the way, we had a similar situation with AMD for 48 Beta (see bug 1290419).
Some builds were affected, some were not. It might depend on the compiler.
(In reply to [:philipp] from comment #36)
> just to sum up again the impact from this bug that we have seen so far this
> beta cycle: 
> beta 1: unaffected
> beta 2: unaffected
> beta 3: unaffected
> beta 4: affected
> beta 5: affected
> beta 6: unaffected
> beta 7: unaffected
> beta 8: affected
> (beta 9 not released)
> beta 10: affected

Well, that doesn't seem to match the bug 1255737 in/out pattern.  We had AsyncShutdownBlocked (causing shutdown hangs in the field) in:
beta 1, 2, 3, 7, 8, 9, 10
We weren't blocking asyncshutdown in:
beta 4, 5, 6
Pasting a response from Adam Moloniewicz, Intel (with his permission):
"As this looks like a heap corruption – have you tried to run the app with Application Verifier engaged ? Especially with Memory and Heap options enabled. You could also use Intel inspector(Intel studio) to analyze memory/threading anomalies. Other ideas that come to my mind could be to force internal SW modules to use isolated heaps instead of using only the global one(which is usually the common case). I’m not aware of the internal architecture so it’s hard to come up with particular ideas but perhaps some custom C++ memory allocator would do the job. This way we could narrow down the root cause.

So far I don’t see any strong evidence that would indicate the graphics UMD modules are a culprit, though indeed, they utilize the process global heap. So not sure how to assist you. Is there any easy way to disable HW rendering acceleration so that the rendering would fall back to WARP renderer instead of HW?"
Assigning to dbaron who agreed to own this exceptionally complex issue...
Assignee: nobody → dbaron
I looked through a bunch of crashes with cpearce.  One interesting point he noticed is that at least some of the ones on youtube were using VP9, which wouldn't go through DXVA, which makes DXVA less likely.  (Another thing making DXVA less likely is that it should be writing to graphics memory.)

On the other hand, he's suspicious of cubeb's writing to audio buffers.
And one other point is that the machines in question do seem to come from multiple manufacturers, given:

Rank 	Bios manufacturer 	Count 	%
1 	Dell Inc. 	4480 	33.43 %
2 	Insyde 	2504 	18.69 %
3 	American Megatrends Inc. 	2336 	17.43 %
4 	Hewlett-Packard 	1792 	13.37 %
5 	LENOVO 	1244 	9.28 %
6 	Insyde Corp. 	721 	5.38 %
7 	INSYDE Corp. 	129 	0.96 %
8 	Lenovo 	91 	0.68 %
9 	TOSHIBA 	37 	0.28 %
10 	Intel Corporation 	30 	0.22 %

from the query https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&product=Firefox&version=49.0b&_sort=-date&_facets=signature&_facets=bios_manufacturer&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=uptime&_columns=app_notes&_columns=graphics_critical_error#facet-bios_manufacturer
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #31)
> Though one other thought.  I looked at the minidump for
> bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the
> instruction:
> 71B248FB 3B C1                cmp         eax,ecx
> 
> I just don't see how that instruction can yield
> EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff,
> especially when eax and ecx are both 0x01500228.

I see this in all the arena_dalloc_small crashes that I looked at.

The sse2_blt crashes are equally weird, crashing on a 'mov edi,ecx' instruction with a write access violation.

Neither of these instruction access memory, so an access violation sounds impossible.

Either the EIP value in the minidump is incorrect (but only for this CPU?), or the we're hitting a CPU bug. I can't think of any other ways this could be possible.

I've had a look at the errata for the 5th gen Intel CPUs, nothing stands out as being this, but there are are a lot so I could easily have missed it.
More bizarre data from the minidump.  I took a look at 9 minidumps (5 for one signature all from beta 10, 4 for a different signature all from beta 10, and also 1 from the first signature but for beta 8):

In all cases:
 * The upper short of EAX, EBX, ECX, and EDI was always the same for a given minidump, but varied between minidumps (0080, 0050, 0080, 0060, 00e0, 0060, 00b0, 0100, 0120, 0080)
 * The lower short of those 4 registers was always: AX: 0228, BX: 0220, CX: 0228, DI: 0040
 * EBP was always 00000040
 * EFLAGS was always either 00210246 or 00010246 (which is consistent with having executed a CMP between two equal values, which is allegedly the instruction we crash on)
 * EDX and ESI looked like pointers to the same area of memory, though differing by a decent amount.  Presumably heap pointers.
 * ESP and EIP looked like pointers to other (different from each other and from EDX/ESI) areas of memory
 * EIP always has the low short the same for a given build, although differing slightly between beta 8 and beta 10

ted, does something like this ring any bells?
Flags: needinfo?(ted)
(In reply to David Bolter [:davidb] from comment #41)
> Links for the tools I mentioned in comment 39:
> https://msdn.microsoft.com/en-us/library/windows/hardware/ff538115(v=vs.85).
> aspx

Likely you all have this installed already as part of the SDK

> https://software.intel.com/en-us/intel-system-studio

Available on all platforms, but starts at $699

> Isolated heaps info:
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa366599(v=vs.85).
> aspx
We should feed dbaron's analysis from comment 49 back to the intel guy as well (and comment 46 and comment 48, or a link to this).  (Perhaps once ted weighs in)
(In reply to Randell Jesup [:jesup] from comment #51)
> We should feed dbaron's analysis from comment 49 back to the intel guy as
> well (and comment 46 and comment 48, or a link to this).  (Perhaps once ted
> weighs in)

Joe from Intel is on this bug - however I don't believe that Adam is a user. I'll see if we can get him signed up via the ML. 

Joe - any others that we should be adding here?
Flags: needinfo?(joseph.k.olivas)
I'm testing on one of these - any insight if there is a usage pattern? Right now, I'm doing videos and general browsing.
(In reply to Desigan Chinniah [:cyberdees] [:dees] [London - GMT] from comment #52)
> Joe - any others that we should be adding here?

I am following this bug closely and feeding back to some people internally here. I can be the main point of contact.
Flags: needinfo?(joseph.k.olivas)
Inspired by bug 1300233 and with some help from Aryx and froydnj on IRC, I should point out that thanks to the combination of bug 1259782 and bug 1270664, Firefox 49 is the first release we're shipping with Visual Studio 2015 rather than 2013.  This upgrade also involved using SSE instructions in generated code, since that can't be turned off in 2015.

So that seems like it could be related to the problems we're seeing here.

I'd still be interested to know more details about what sorts of problems were present in this CPU revision that may have been fixed in microcode updates.
Nothing there rings any particular bells, sorry.
Flags: needinfo?(ted)
All versions of Firefox 49 and newer are currently building on Visual Studio 2015 Update 2. VS2015u3 is out but not deployed (bug 1283203 tracks). Someone may want to comb the release notes for VS2015u3 to see if it fixes anything that could be related to this crash. Also, if getting central (and possibly earlier releases) on VS2015u3 is a good idea, let me know and I can land that.
(In reply to Gregory Szorc [:gps] from comment #57)
> Someone may want to comb the release notes for VS2015u3 to see if it fixes
> anything that could be related to this crash.

The first item in VC++ fixes looks interesting:
https://www.visualstudio.com/news/releasenotes/vs2015-update3-vs#visualcpp

"We now check the access of a deleted trivial copy/move ctor. Without the check, we may incorrectly call the defaulted copy ctor (in which the implementation can be ill-formed) and cause potential runtime bad code generation."
(In reply to Gregory Szorc [:gps] from comment #57)
> if getting central (and possibly earlier releases) on VS2015u3 is a good idea, let me know and I can land that.

The bug fix in comment 58 sounds like we should deploy this upgrade on m-c.
If you upgrade to VS2015u3 you might also want to apply KB3165756. Fixes at least one possible compiler bug:

https://msdn.microsoft.com/en-us/library/mt752379.aspx

>> Issue 3
>> Potential miscompilation of code-calling functions that resemble std::min/std::max on 
>> floating point values.
One other piece of data from the query in comment 44:

Rank 	E10s cohort 	Count 	%
1 	disqualified 	7587 	72.98 %
2 	control 	5509 	52.99 %
3 	test 		4853 	46.68 %
4 	addons 		255 	2.45 %
5 	set2a 		255 	2.45 %
6 	optedout 	26 	0.25 %

So it seems to happen both with and without e10s.
Current status is that while I have a laptop with one of the CPU models in question, as does Milan, neither of us have been able to reproduce the crash.

It's possible that the crash is specific to microcode version.  We're hoping to get that data added to crash-stats soon so we can tell.  The one user who we've been able to contact has version 0x18, while Milan has 0x1D and I have 0x21.

I don't know how to boot with an older version of the microcode than the one that's used by default (which I believe comes from the BIOS).  There are older (and newer) versions made available for use by Linux distros (0x18 is available as part of https://downloadcenter.intel.com/download/24661/Linux-Processor-Microcode-Data-File ), but I don't *think* those are usable with Windows in any way, unless I could somehow use part of the Linux boot process (e.g., grub) to load it and then boot into Windows.  But I think the part of the Linux boot process that loads it happens later, via the kernel, based on reading the manual for iucode-tool(8).

And I'm not even sure if switching to a different microcode version would help, or if playing lots of youtube and other videos are really the right steps to reproduce the crash.
(In reply to David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) from comment #62)
> And I'm not even sure if switching to a different microcode version would
> help

Oh, but one reason to think it would is that the crashes don't occur on Windows 10, and I *believe* Windows 10 loads a more recent version of the microcode.
David, do you see the extra stuff in the app notes if you build with this patch?
Confirmed that attachment 8790490 [details] [diff] [review] adds CpuRevisionStatus in App Notes.
 https://crash-stats.mozilla.com/report/index/eea965fd-bad2-4b7b-b888-24a372160913

Milan asked me to do a try build for Windows, so that we can pass the build to others to test.
With patch on m-c
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3bdc76b4200
With patch on beta
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c6a66c5db3b
I am not sure if the following is related to this bug.

In gecko, hundreds of HTMLMediaElement and MediaDecoders could be piled up even when JS side uses one media element at a time. It could be confirmed with the following url in Bug 1155000. It uses mp4 videos for testing. 
  http://people.mozilla.org/~kbrosnan/tmp/1155000/video-memory-test.html
I think our RC3 build has avoided this crash. Marking 49 as no longer blocked.
How is RC4 looking?
Flags: needinfo?(lhenry)
(In reply to David Bolter [:davidb] from comment #68)
> How is RC4 looking?
good so far in regards to this bug. there's no higher crash volume from systems with a "GenuineIntel family 6 model 61 stepping 4" cpu than normally.
(76.66% in signatures vs 02.28% overall) address = 0xffffffffffffffff
(70.46% in signatures vs 02.52% overall) adapter_device_id = 0x1616
(70.31% in signatures vs 02.52% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
(97.85% in signatures vs 37.76% overall) reason = EXCEPTION_ACCESS_VIOLATION_READ
(49.93% in signatures vs 08.79% overall) platform_version = 6.3.9600
(49.93% in signatures vs 08.79% overall) platform_pretty_version = Windows 8.1
(46.60% in signatures vs 05.83% overall) build_id = 20160829102229
(36.41% in signatures vs 05.47% overall) has dual GPUs = true
(33.01% in signatures vs 03.01% overall) Addon "Kaspersky Protection" = true
(95.49% in signatures vs 68.52% overall) adapter_vendor_id = 0x8086
(33.97% in signatures vs 10.23% overall) bios_manufacturer = Dell Inc.
(30.13% in signatures vs 08.52% overall) GFX_ERROR "Failed 2 buffer db="
(20.61% in signatures vs 03.20% overall) bios_manufacturer = Insyde

Where 'signatures' is every signature starting with 'arena_dalloc_small | je_free'.

Perhaps installing the "Kaspersky Protection" (light_plugin_ACF0E80077C511E59DED005056C00008@kaspersky.com)
addon might help in reproducing the issue?
the issue is back again in 50.0b1.
See Also: → 1305120
Crash Signature: mozilla::layers::ContainerLayerProperties::ComputeChangeInternal ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::DestructRange | mozilla::PaintedLayerData::~PaintedLayerData ] → mozilla::layers::ContainerLayerProperties::ComputeChangeInternal ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::DestructRange | mozilla::PaintedLayerData::~PaintedLayerData ] [@ arena_dalloc_small | je_free | nsT…
Crash Signature: mozilla::detail::RunnableMethodImpl<T>::`scalar deleting destructor'' ] [@ arena_dalloc_small | je_free | nsCSSValue::DoReset ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayList… → nsCSSValue::DoReset ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ]
Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report

Maybe we land this, even in the wrong place, as it should be easy to uplift and sounds like we have more instances of the problem showing up.
Attachment #8790490 - Flags: review?(dvander)
We could make it an annotation (bug 1305120), so it would be easier to use with Socorro and SuperSearch.
Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report

Review of attachment 8790490 [details] [diff] [review]:
-----------------------------------------------------------------

Is there any reason this can't be in gfxWindowsPlatform?

::: gfx/thebes/gfxPlatform.cpp
@@ +682,5 @@
> +      }
> +
> +      if (cpuUpdateRevision > 0) {
> +        nsAutoCString revAndStatus;
> +        revAndStatus.AppendPrintf("CpuRevisionStatus(0x%x:0x%x) ",

nit: can use nsPrintfCString here
Attachment #8790490 - Flags: review?(dvander) → review+
Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report

The patch in bug 1305120 (same code, different place) is probably more appropriate - separate annotation field and in a better file.
Attachment #8790490 - Attachment is obsolete: true
Looks like in 50.0b3 we have another signature (jemalloc_crash) strongly correlated with cpu_info = GenuineIntel family 6 model 61 stepping 4.
Crash Signature: nsCSSValue::DoReset ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ] → nsCSSValue::DoReset ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ] [@ jemalloc_crash ]
See Also: → 1306621
Interestingly, so far the 'jemalloc_crash' (strongly correlated with Intel CPUs) is gone in 50.0b4 at the same time as js::NativeObject::setSlotWithType (bug 1307285), which is strongly correlated to AMD CPUs.
See Also: → 1307285
Depends on: 1305888
Rank: 1
50.0b7 is affected by this bug again, these are the microcode facets:
https://crash-stats.mozilla.com/search/?signature=^arena&cpu_info=^GenuineIntel family 6 model 61 stepping 4&version=50.0b7&product=Firefox&process_type=browser&date=>2016-10-14&_sort=-date&_facets=signature&_facets=platform_pretty_version&_facets=cpu_info&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=platform_pretty_version&_columns=cpu_microcode_version#facet-cpu_microcode_version
linkified: http://bit.ly/2e92Qz4
So based on comparing this list:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&version=50.0b7&product=Firefox&date=%3E%3D2016-10-10T20%3A04%3A00.000Z&date=%3C2016-10-17T20%3A04%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=app_notes&_columns=graphics_critical_error#facet-cpu_microcode_version
which is the microcode versions of the 50.0b7 crashes that are *mostly* this bug, with this list:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=OOM%20%7C%20small&product=Firefox&version=50.0b&date=%3E%3D2016-10-10T20%3A04%3A00.000Z&date=%3C2016-10-17T20%3A04%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=app_notes&_columns=graphics_critical_error#facet-cpu_microcode_version
which is the microcode versions of the OOM | small crashes on the affected CPU family, it seems reasonably clear that:

These microcode versions are affected:
0xe (rare), 0x11, 0x12, 0x13, 0x16, 0x18, 0x19

These microcode versions are not affected:
0x1d, 0x1f, 0x21, 0x22
This would *seem* to imply that revs before 0x1d (and 0xe or above, roughly) are the ones affected.

Joe, any thoughts?
Flags: needinfo?(joseph.k.olivas)
Just so I understand what's going on:

OOM is used just to basically get what versions are out there (everyone hits OOM), while the other shows which ones are hitting this crash.

Is this correct? I do see 0x1d, 0x1f and 0x21 in the bad crash, but very low numbers.

I'll leave ni open for now.
Yes, I was using small out-of-memory crashes to try to establish a baseline distribution of the microcode versions among our users.  (The OOM crashes are smaller numbers than this crash, but I think big enough to get a usable baseline.)
(In reply to Joe Olivas from comment #84)
> Is this correct? I do see 0x1d, 0x1f and 0x21 in the bad crash, but very low
> numbers.

Yes -- I think the problem there is that the query isn't perfect.  It's a query on signature substring, which catches some other crashes.  Those looked like they had very different patterns of signatures than the ones this bug covers, so I didn't actually go through and check the minidumps to verify that they were not showing the patterns in comment 31 and comment 49.
until we have a proper fix in place for this problem could we do something similar like with the whole websense saga and have those installations with the known affected intel cpu+microcode versions identify themselves in the update ping?
so in case of an emergency (like a dot release on the firefox release channel that is affected by this bug, because we cannot first test it with a wider beta audience like we do with rc builds) we would at least be able to retroactively disable automatic updates just for those crashing configurations...
See Also: → 1310478
(In reply to David Baron :dbaron: ⌚️UTC+8 from comment #82)
>...
> 
> These microcode versions are affected:
> 0xe (rare), 0x11, 0x12, 0x13, 0x16, 0x18, 0x19

These are still the only ones showing.

> 
> These microcode versions are not affected:
> 0x1d, 0x1f, 0x21, 0x22

These still haven't shown.
Are these signatures still showing up newer releases?
Flags: needinfo?(mozillamarcia.knous)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #91)
> Are these signatures still showing up newer releases?

A cursory manual look did not show anything later than 50, but I will need info on Marco to answer for sure.
Flags: needinfo?(mozillamarcia.knous) → needinfo?(mcastelluccio)
50, 51.0b up to 12, 52.0a2 and 53.0a1 are currently unaffected, but the signatures are build-dependent so they might reappear in the future.
Flags: needinfo?(mcastelluccio)
platform-rel: + → -

Closing because no crashes reported for 12 weeks.

Status: REOPENED → RESOLVED
Closed: 8 years ago4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: