Bugzilla

Comment 1

•

8 years ago

[Tracking Requested - why for this release]:
the scope of this issue might be bigger. this signature seems to be part of an intel cpu specific crash pattern (mentioned in the 2016-08-18 channel meeting) that started in 49.0b4 and unfortunately seems to continue in beta5 judging from early crash data there.

starting in 49.0b4 there was a whole range of new signatures showing up beginning with "arena..." that were coming from "GenuineIntel family 6 model 61 stepping 4 | 4" & "GenuineIntel family 6 model 61 stepping 4 | 2" devices: http://bit.ly/2b5z1mp
all in all they make up ~8% of all crashes in 49.0b4 and seem to happen on windows 7 and above but predominantly (60%) on win8.1.

these were the changes landing in 49.0b4: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b3_RELEASE&tochange=FIREFOX_49_0b4_RELEASE

tracking-firefox49: --- → ?

Keywords: regression

Marcia Knous [:marcia]

Reporter

Updated

•

8 years ago

Milan Sreckovic [:milan] (needinfo for best results)

Comment 2

•

8 years ago

Not sure who could dig into this one...

Flags: needinfo?(milan)

Flags: needinfo?(continuation)

Flags: needinfo?(bugs)

Comment 3

•

8 years ago

Hard to tell if this is a single problem, or a bunch of problems, and what grouping is the correct one, but given that it started in beta, it probably is the same cause.
Jet would look at FrameLayerBuilder::BuildContainerLayerFor ones, I'll find somebody to look at LayerManagerComposite::PostProcessLayers, there are a few other display item/display list related.

Looking at the list of changes in 49b4 (comment 1), there are a few media related changes, so it's probably worth somebody looking at those.  The only other one that stands out is bug 1291016, which is probably fine, but to a casual observer a new, uninitialized variable got introduced, so... (:heycam instead of :jfkthame who's not around right now).

Flags: needinfo?(milan)

Flags: needinfo?(cam)

Flags: needinfo?(ajones)

Andrew McCreight [:mccr8]

Comment 4

•

8 years ago

High volume group of possibly related crashes, new on beta 4, let's call this a blocker for 49.

tracking-firefox49: ? → blocking

tracking-firefox50: --- → ?

Comment 5

•

8 years ago

I don't know anything about layout code, sorry.

Flags: needinfo?(continuation)

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Comment 6

•

8 years ago

If this is an OOM condition then it is probably related to bug 1296453.

Cameron McCormack (:heycam)

Comment 7

•

8 years ago

(In reply to Milan Sreckovic [:milan] from comment #3)
> The only other one that stands out is bug 1291016, which is probably fine, but to a
> casual observer a new, uninitialized variable got introduced, so... (:heycam
> instead of :jfkthame who's not around right now).

Following up over there.

Flags: needinfo?(cam)

Comment 8

•

8 years ago

Daniel, and thoughts on this one?

Flags: needinfo?(dholbert)

Comment 9

•

8 years ago

the crash spike issue has disappeared again in 49.0b6.
in beta 5 crashes coming from "GenuineIntel family 6 model 61 stepping 4" devices generally made up 8.2% of the whole crashing volume, in beta 6 those are back to a "normal" level of 1.2% of all crashes.
we should probably wait and see how beta 7 is performing before considering this solved/untrack it though...

Daniel Holbert [:dholbert]

Updated

•

8 years ago

tracking-firefox49: blocking → +

Comment 10

•

8 years ago

(In reply to David Bolter [:davidb] from comment #8)
> Daniel, and thoughts on this one?

Looks like the backtrace is in layers code, which I'm not super-familiar with. kats or mattwoodrow would perhaps be able to offer more useful opinions/thoughts than I can.  (Comment 9 is encouraging, though; maybe this is fixed? I guess we'll see.)

(Side note, following up on comment 3 / comment 7: jfkthame says over in bug 1291016 that he doesn't think it's connected to this bug.)

Daniel Holbert [:dholbert]

Updated

•

8 years ago

Flags: needinfo?(dholbert)

Daniel Holbert [:dholbert]

Comment 11

•

8 years ago

This is a hashtable being torn down inside of ~ContainerState().  ContainerState owns two hash tables, and this could be either one:
>  nsTHashtable<nsRefPtrHashKey<PaintedLayer>> mPaintedLayersAvailableForRecycling;
...and:
>  nsDataHashtable<nsGenericHashKey<MaskLayerKey>, RefPtr<ImageLayer>>
>        mRecycledMaskImageLayers;
https://dxr.mozilla.org/mozilla-central/rev/01748a2b1a463f24efd9cd8abad9ccfd76b037b8/layout/base/FrameLayerBuilder.cpp#1396-1423

We might be putting something bogus in one of those hashtables, and then crashing when the hashtable gets destroyed, or something...  mstange & dvander have "hg blame" for each of those hashtables declarations, so one of them might be a good person to take a look at this, too, if we discover that it's not fixed as hoped in comment 9.  [CC'ing them]

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Comment 12

•

8 years ago

I don't see any playback commits in the regression range in c1.

Flags: needinfo?(ajones)

Comment 13

•

8 years ago

the crash level of this cpu family still looks normal in 49.0b7, so i think we can close this bug.

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox49: affected → ---

status-firefox50: affected → ---

tracking-firefox50: ? → ---

Resolution: --- → WORKSFORME

Comment 14

•

8 years ago

the issue is back again in 49.0b8. i don't understand what's going on :-(

Status: RESOLVED → REOPENED

status-firefox49: --- → affected

Resolution: WORKSFORME → ---

Milan Sreckovic [:milan] (needinfo for best results)

Comment 15

•

8 years ago

Bug 1294193 is another bug where there's a strong correlation to "GenuineIntel family 6 model 61 stepping 4 | 4".

Comment 16

•

8 years ago

For "arena_dalloc_small | je_free | PLDHashTable::~PLDHashTable | mozilla::ContainerState::~ContainerState":
(91.30% in signature vs 03.81% overall) address = 0xffffffffffffffff
(91.30% in signature vs 05.34% overall) adapter_device_id = 0x1616
(91.30% in signature vs 05.50% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
(47.83% in signature vs 02.74% overall) build_id = 20160814184416
(43.48% in signature vs 08.58% overall) platform_pretty_version = Windows 8.1
(43.48% in signature vs 08.58% overall) platform_version = 6.3.9600
(26.09% in signature vs 03.73% overall) bios_manufacturer = Insyde
(21.74% in signature vs 02.76% overall) Addon "Video DownloadHelper" = true

Comment 17

•

8 years ago

We seem to be talking about a heap corruption scenario.  I imagine this is some underlying problem that the changes to beta are tickling into higher frequency, rather than actually caused by the changes between beta 3 and beta 4, as well as between beta  7 and beta 8.

The first set of patches is: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b3_RELEASE&tochange=FIREFOX_49_0b4_RELEASE

The second set of patches is: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b7_RELEASE&tochange=FIREFOX_49_0b8_RELEASE

Grasping for straws, there is audio related things in both, but that's weak.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 18

•

8 years ago

The release that got better (beta 6, see comment 9) contains a fix to bug 1293985, with the "...PLDHashTable::Iterator can't handle modifications while iterating...", so that's starting to look interesting.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 19

•

8 years ago

Mats, thoughts on this bug?  Since your patch in bug 1293985 correlates with things getting better (then they got worse afterwards), and we're crashing in the PLDHashTable destructor, though you may have some insight.

Flags: needinfo?(mats)

Comment 20

•

8 years ago

Attached image Overall 49.0b8 vs signatures starting with `arena_dalloc_small | je_free` on 49.0b8 — Details

Jet Villegas (inactive)

Comment 21

•

8 years ago

The changes in bug 1292856 may also affect this one if Layer rendering is affected. Jamie: can you chime in on what that patch can change related to memory allocation?

Flags: needinfo?(bugs) → needinfo?(jnicol)

Comment 22

•

8 years ago

Any correlation with the changes in bug 1293985 seems coincidental to me.
The PLDHashTable object there is different from this one.

The correlations in comment 16 seems unusually strong, so it might be
worth finding hardware that match:
>(91.30% in signature vs 05.34% overall) adapter_device_id = 0x1616
>(91.30% in signature vs 05.50% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
and install:
>(21.74% in signature vs 02.76% overall) Addon "Video DownloadHelper" = true
and test with some of the URLs from crash-stats...
If it's at least semi-reproducible it might be possible to fix.

Flags: needinfo?(mats)

Jamie Nicol [:jnicol]

Comment 23

•

8 years ago

(In reply to Jet Villegas (:jet) from comment #21)
> The changes in bug 1292856 may also affect this one if Layer rendering is
> affected. Jamie: can you chime in on what that patch can change related to
> memory allocation?

That change will in some cases make us *avoid* making a large allocation.

Flags: needinfo?(jnicol)

Updated

•

8 years ago

Comment 24

•

8 years ago

After looking at the "cpu info" correlation for the signatures here, I'm leaning
towards a hardware related bug, as Marco suggested in bug 1294193 comment 6.
Many of signatures in this bug have a 100% correlation to
"GenuineIntel family 6 model 61 stepping 4".

dbaron, what do you think about that theory? (given your experience with
the AMD bug)

Flags: needinfo?(dbaron)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 25

•

8 years ago

Let's keep an eye on this next week and see what Intel says.

Assignee

Comment 26

•

8 years ago

(In reply to Mats Palmgren (:mats) from comment #24)
> After looking at the "cpu info" correlation for the signatures here, I'm
> leaning
> towards a hardware related bug, as Marco suggested in bug 1294193 comment 6.
> Many of signatures in this bug have a 100% correlation to
> "GenuineIntel family 6 model 61 stepping 4".

They all do.

But they also seem to have graphics hardware in common.  (e.g., "adapter device id" is mostly 0x1616 with a bit of 0x1606 and a few stragglers; "adapter vendor id" nearly always 0x8086, which is apparently Intel(R) HD Graphics 5500)

I think if you want to claim it's a hardware bug you need to make a much stronger case.

And even if it is, we should still be working to figure out how to fix it.  There was obviously something that made it start happening, so we should figure out what that was and undo it if possible.

Flags: needinfo?(dbaron)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 27

•

8 years ago

Possibly of interest:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&product=Firefox&version=49.0b
which is the topcrash list for 49.0 betas for this CPU model only.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 28

•

8 years ago

> But they also seem to have graphics hardware in common.

I assumed this was because this chip comes with a builtin GPU.

FWIW, here are a few matches I get from Google on the cpu info string:
Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) i7-5650U CPU @ 2.20GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y71 CPU @ 1.20GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y51 CPU @ 1.10GHz [x86 Family 6 Model 61 Stepping 4]
Intel(R) Core(TM) M-5Y31 CPU @ 0.90GHz [x86 Family 6 Model 61 Stepping 4]

Which are all Broadwell, with varying GPUs (source: http://www.cpu-world.com/ )

Assignee

Comment 29

•

8 years ago

I think the [@ sse2_blt] crashes looked interesting at first because they have similar CPU pattern and predominance on betas of 49, except they spiked in 49.0b7 and 49.0b8 rather than spiking in 49.0b4, b5, and b8 like many (didn't check all) of the others.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Updated

•

8 years ago

Comment 30

•

8 years ago

A few other observations:

 * a pretty big portion of the urls reported in crash-stats are either the facebook homepage or youtube videos.  This makes it seem like there's a decent chance that these crashes usually or always occur during video playback

 * a decent portion of the crash reports (half of the ones I sampled?) have the cubeb audio thread being one of the threads contending for the malloc lock at the time the main thread crashes in the allocator (at one of 2 different stacks).  e.g., bp-a68d7849-7be0-4058-9697-842d82160903 thread 65, or bp-4d11b27f-9f69-4cd5-8256-1819d2160903 thread 119).  I suspect this wouldn't be the case if audio weren't playing at the time of the crash.   (I *suspect* the contention is because those threads keep running a little bit after the crash happens, and then crash reporting starts, on the main thread, and thus essentially keep running until they hit a lock that they need to acquire.  I'm not really sure about this, though, i.e., about how long other threads would keep running when one thread crashes.)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 31

•

8 years ago

Though one other thought.  I looked at the minidump for bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the instruction:
71B248FB 3B C1                cmp         eax,ecx

I just don't see how that instruction can yield EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff, especially when eax and ecx are both 0x01500228.

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Updated

•

8 years ago

platform-rel: --- → ?

Whiteboard: [platform-rel-Intel]

Milan Sreckovic [:milan] (needinfo for best results)

Comment 32

•

8 years ago

49 beta 10 looks affected still, judging on early crash data there.

Comment 33

•

8 years ago

(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #30)
> ...  I suspect this
> wouldn't be the case if audio weren't playing at the time of the crash.

We have audio related patches landing in releases where we could observe a difference in the crash rates (good and bad), so it could be the changes in timing getting us in trouble.

Ryan VanderMeulen [:RyanVM]

Comment 34

•

8 years ago

Marking this as a blocker for 49 as it seems very high volume for 49.

tracking-firefox49: + → blocking

Comment 35

•

8 years ago

The audio playback aspect of this makes me wonder if it's related to bug 1255737. I believe that the speculation there was related to drivers causing audio shutdown badness. https://hg.mozilla.org/releases/mozilla-beta/rev/ab7b68014a1e would have shipped in 49b4.

Comment 36

•

8 years ago

just to sum up again the impact from this bug that we have seen so far this beta cycle: 
beta 1: unaffected
beta 2: unaffected
beta 3: unaffected
beta 4: affected
beta 5: affected
beta 6: unaffected
beta 7: unaffected
beta 8: affected
(beta 9 not released)
beta 10: affected

Comment 37

•

8 years ago

(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #31)
> Though one other thought.  I looked at the minidump for
> bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the
> instruction:
> 71B248FB 3B C1                cmp         eax,ecx
> 
> I just don't see how that instruction can yield
> EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff,
> especially when eax and ecx are both 0x01500228.

This reminds me of bug 1034706 comment 44, although that was AMD specific.

By the way, we had a similar situation with AMD for 48 Beta (see bug 1290419).
Some builds were affected, some were not. It might depend on the compiler.

Comment 38

•

8 years ago

(In reply to [:philipp] from comment #36)
> just to sum up again the impact from this bug that we have seen so far this
> beta cycle: 
> beta 1: unaffected
> beta 2: unaffected
> beta 3: unaffected
> beta 4: affected
> beta 5: affected
> beta 6: unaffected
> beta 7: unaffected
> beta 8: affected
> (beta 9 not released)
> beta 10: affected

Well, that doesn't seem to match the bug 1255737 in/out pattern.  We had AsyncShutdownBlocked (causing shutdown hangs in the field) in:
beta 1, 2, 3, 7, 8, 9, 10
We weren't blocking asyncshutdown in:
beta 4, 5, 6

Comment 39

•

8 years ago

Pasting a response from Adam Moloniewicz, Intel (with his permission):
"As this looks like a heap corruption – have you tried to run the app with Application Verifier engaged ? Especially with Memory and Heap options enabled. You could also use Intel inspector(Intel studio) to analyze memory/threading anomalies. Other ideas that come to my mind could be to force internal SW modules to use isolated heaps instead of using only the global one(which is usually the common case). I’m not aware of the internal architecture so it’s hard to come up with particular ideas but perhaps some custom C++ memory allocator would do the job. This way we could narrow down the root cause.

So far I don’t see any strong evidence that would indicate the graphics UMD modules are a culprit, though indeed, they utilize the process global heap. So not sure how to assist you. Is there any easy way to disable HW rendering acceleration so that the rendering would fall back to WARP renderer instead of HW?"

Comment 40

•

8 years ago

Useful query from Marco that shows the RC1 does not look affected at the same volume. We can use this to check the RC2 as well once we release it.  

https://crash-stats.mozilla.com/search/?signature=%5Earena_dalloc_small%20%7C%20je_free%20%7C&product=Firefox&_sort=-date&_facets=signature&_facets=version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Comment 41

•

8 years ago

Links for the tools I mentioned in comment 39:
https://msdn.microsoft.com/en-us/library/windows/hardware/ff538115(v=vs.85).aspx
https://software.intel.com/en-us/intel-system-studio

Isolated heaps info:
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366599(v=vs.85).aspx

Johnny Stenback (:jst)

Comment 42

•

8 years ago

Assigning to dbaron who agreed to own this exceptionally complex issue...

Assignee: nobody → dbaron

Updated

•

8 years ago

platform-rel: ? → +

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 43

•

8 years ago

One other point of interest is that crashes on nightly first showed up in build 2016-06-18, although they weren't frequent enough to happen in every build following that:
https://crash-stats.mozilla.com/search/?date=%3E2016-05-01&cpu_info=GenuineIntel%20family%206%20model%2061%20stepping%204&release_channel=nightly&signature=free_impl&signature=je_free&product=Firefox&_sort=build_id&_sort=-date&_facets=signature&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=platform_pretty_version&_columns=install_time&_columns=url#crash-reports

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 44

•

8 years ago

Query I've been using for crashes on beta:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&product=Firefox&version=49.0b&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=app_notes&_columns=graphics_critical_error#facet-signature

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 45

•

8 years ago

I looked through a bunch of crashes with cpearce.  One interesting point he noticed is that at least some of the ones on youtube were using VP9, which wouldn't go through DXVA, which makes DXVA less likely.  (Another thing making DXVA less likely is that it should be writing to graphics memory.)

On the other hand, he's suspicious of cubeb's writing to audio buffers.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 46

•

8 years ago

I think the right CPUs include:
http://ark.intel.com/products/85213/Intel-Core-i5-5300U-Processor-3M-Cache-up-to-2_90-GHz
and maybe also:
http://ark.intel.com/products/85212/Intel-Core-i5-5200U-Processor-3M-Cache-up-to-2_70-GHz
based on the graphics device ID being 0x1616.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 47

•

8 years ago

And one other point is that the machines in question do seem to come from multiple manufacturers, given:

Rank 	Bios manufacturer 	Count 	%
1 	Dell Inc. 	4480 	33.43 %
2 	Insyde 	2504 	18.69 %
3 	American Megatrends Inc. 	2336 	17.43 %
4 	Hewlett-Packard 	1792 	13.37 %
5 	LENOVO 	1244 	9.28 %
6 	Insyde Corp. 	721 	5.38 %
7 	INSYDE Corp. 	129 	0.96 %
8 	Lenovo 	91 	0.68 %
9 	TOSHIBA 	37 	0.28 %
10 	Intel Corporation 	30 	0.22 %

from the query https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&product=Firefox&version=49.0b&_sort=-date&_facets=signature&_facets=bios_manufacturer&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=uptime&_columns=app_notes&_columns=graphics_critical_error#facet-bios_manufacturer

Matt Woodrow (:mattwoodrow)

Comment 48

•

8 years ago

(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #31)
> Though one other thought.  I looked at the minidump for
> bp-0d55d0d0-99a5-4899-8aa4-f48da2160903, and allegedly we crash on the
> instruction:
> 71B248FB 3B C1                cmp         eax,ecx
> 
> I just don't see how that instruction can yield
> EXCEPTION_ACCESS_VIOLATION_READ with a crash address of 0xffffffff,
> especially when eax and ecx are both 0x01500228.

I see this in all the arena_dalloc_small crashes that I looked at.

The sse2_blt crashes are equally weird, crashing on a 'mov edi,ecx' instruction with a write access violation.

Neither of these instruction access memory, so an access violation sounds impossible.

Either the EIP value in the minidump is incorrect (but only for this CPU?), or the we're hitting a CPU bug. I can't think of any other ways this could be possible.

I've had a look at the errata for the 5th gen Intel CPUs, nothing stands out as being this, but there are are a lot so I could easily have missed it.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 49

•

8 years ago

More bizarre data from the minidump.  I took a look at 9 minidumps (5 for one signature all from beta 10, 4 for a different signature all from beta 10, and also 1 from the first signature but for beta 8):

In all cases:
 * The upper short of EAX, EBX, ECX, and EDI was always the same for a given minidump, but varied between minidumps (0080, 0050, 0080, 0060, 00e0, 0060, 00b0, 0100, 0120, 0080)
 * The lower short of those 4 registers was always: AX: 0228, BX: 0220, CX: 0228, DI: 0040
 * EBP was always 00000040
 * EFLAGS was always either 00210246 or 00010246 (which is consistent with having executed a CMP between two equal values, which is allegedly the instruction we crash on)
 * EDX and ESI looked like pointers to the same area of memory, though differing by a decent amount.  Presumably heap pointers.
 * ESP and EIP looked like pointers to other (different from each other and from EDX/ESI) areas of memory
 * EIP always has the low short the same for a given build, although differing slightly between beta 8 and beta 10

ted, does something like this ring any bells?

Flags: needinfo?(ted)

Comment 50

•

8 years ago

(In reply to David Bolter [:davidb] from comment #41)
> Links for the tools I mentioned in comment 39:
> https://msdn.microsoft.com/en-us/library/windows/hardware/ff538115(v=vs.85).
> aspx

Likely you all have this installed already as part of the SDK

> https://software.intel.com/en-us/intel-system-studio

Available on all platforms, but starts at $699

> Isolated heaps info:
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa366599(v=vs.85).
> aspx

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Comment 51

•

8 years ago

We should feed dbaron's analysis from comment 49 back to the intel guy as well (and comment 46 and comment 48, or a link to this).  (Perhaps once ted weighs in)

Comment 52

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #51)
> We should feed dbaron's analysis from comment 49 back to the intel guy as
> well (and comment 46 and comment 48, or a link to this).  (Perhaps once ted
> weighs in)

Joe from Intel is on this bug - however I don't believe that Adam is a user. I'll see if we can get him signed up via the ML. 

Joe - any others that we should be adding here?

Flags: needinfo?(joseph.k.olivas)

Milan Sreckovic [:milan] (needinfo for best results)

Comment 53

•

8 years ago

I'm testing on one of these - any insight if there is a usage pattern? Right now, I'm doing videos and general browsing.

Joe Olivas

Comment 54

•

8 years ago

(In reply to Desigan Chinniah [:cyberdees] [:dees] [London - GMT] from comment #52)
> Joe - any others that we should be adding here?

I am following this bug closely and feeding back to some people internally here. I can be the main point of contact.

Flags: needinfo?(joseph.k.olivas)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 55

•

8 years ago

Inspired by bug 1300233 and with some help from Aryx and froydnj on IRC, I should point out that thanks to the combination of bug 1259782 and bug 1270664, Firefox 49 is the first release we're shipping with Visual Studio 2015 rather than 2013.  This upgrade also involved using SSE instructions in generated code, since that can't be turned off in 2015.

So that seems like it could be related to the problems we're seeing here.

I'd still be interested to know more details about what sorts of problems were present in this CPU revision that may have been fixed in microcode updates.

(not currently active) Ted Mielczarek

Comment 56

•

8 years ago

Nothing there rings any particular bells, sorry.

Flags: needinfo?(ted)

Gregory Szorc [:gps]

Comment 57

•

8 years ago

All versions of Firefox 49 and newer are currently building on Visual Studio 2015 Update 2. VS2015u3 is out but not deployed (bug 1283203 tracks). Someone may want to comb the release notes for VS2015u3 to see if it fixes anything that could be related to this crash. Also, if getting central (and possibly earlier releases) on VS2015u3 is a good idea, let me know and I can land that.

Jet Villegas (inactive)

Comment 58

•

8 years ago

(In reply to Gregory Szorc [:gps] from comment #57)
> Someone may want to comb the release notes for VS2015u3 to see if it fixes
> anything that could be related to this crash.

The first item in VC++ fixes looks interesting:
https://www.visualstudio.com/news/releasenotes/vs2015-update3-vs#visualcpp

"We now check the access of a deleted trivial copy/move ctor. Without the check, we may incorrectly call the defaulted copy ctor (in which the implementation can be ill-formed) and cause potential runtime bad code generation."

Jet Villegas (inactive)

Comment 59

•

8 years ago

(In reply to Gregory Szorc [:gps] from comment #57)
> if getting central (and possibly earlier releases) on VS2015u3 is a good idea, let me know and I can land that.

The bug fix in comment 58 sounds like we should deploy this upgrade on m-c.

Frank-Rainer Grahl (:frg)

Comment 60

•

8 years ago

If you upgrade to VS2015u3 you might also want to apply KB3165756. Fixes at least one possible compiler bug:

https://msdn.microsoft.com/en-us/library/mt752379.aspx

>> Issue 3
>> Potential miscompilation of code-calling functions that resemble std::min/std::max on 
>> floating point values.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 61

•

8 years ago

One other piece of data from the query in comment 44:

Rank 	E10s cohort 	Count 	%
1 	disqualified 	7587 	72.98 %
2 	control 	5509 	52.99 %
3 	test 		4853 	46.68 %
4 	addons 		255 	2.45 %
5 	set2a 		255 	2.45 %
6 	optedout 	26 	0.25 %

So it seems to happen both with and without e10s.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 62

•

8 years ago

Current status is that while I have a laptop with one of the CPU models in question, as does Milan, neither of us have been able to reproduce the crash.

It's possible that the crash is specific to microcode version.  We're hoping to get that data added to crash-stats soon so we can tell.  The one user who we've been able to contact has version 0x18, while Milan has 0x1D and I have 0x21.

I don't know how to boot with an older version of the microcode than the one that's used by default (which I believe comes from the BIOS).  There are older (and newer) versions made available for use by Linux distros (0x18 is available as part of https://downloadcenter.intel.com/download/24661/Linux-Processor-Microcode-Data-File ), but I don't *think* those are usable with Windows in any way, unless I could somehow use part of the Linux boot process (e.g., grub) to load it and then boot into Windows.  But I think the part of the Linux boot process that loads it happens later, via the kernel, based on reading the manual for iucode-tool(8).

And I'm not even sure if switching to a different microcode version would help, or if playing lots of youtube and other videos are really the right steps to reproduce the crash.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 63

•

8 years ago

(In reply to David Baron :dbaron: ⌚️UTC-7 (busy September 14-25) from comment #62)
> And I'm not even sure if switching to a different microcode version would
> help

Oh, but one reason to think it would is that the crashes don't occur on Windows 10, and I *believe* Windows 10 loads a more recent version of the microcode.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 64

•

8 years ago

Attached patch A bit of a hack to get the update revision/signature into app notes of the crash report (obsolete) — Details — Splinter Review

David, do you see the extra stuff in the app notes if you build with this patch?

Sotaro Ikeda [:sotaro]

Comment 65

•

8 years ago

Confirmed that attachment 8790490 [details] [diff] [review] adds CpuRevisionStatus in App Notes.
 https://crash-stats.mozilla.com/report/index/eea965fd-bad2-4b7b-b888-24a372160913

Milan asked me to do a try build for Windows, so that we can pass the build to others to test.
With patch on m-c
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3bdc76b4200
With patch on beta
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c6a66c5db3b

Sotaro Ikeda [:sotaro]

Comment 66

•

8 years ago

I am not sure if the following is related to this bug.

In gecko, hundreds of HTMLMediaElement and MediaDecoders could be piled up even when JS side uses one media element at a time. It could be confirmed with the following url in Bug 1155000. It uses mp4 videos for testing. 
  http://people.mozilla.org/~kbrosnan/tmp/1155000/video-memory-test.html

Comment 67

•

8 years ago

I think our RC3 build has avoided this crash. Marking 49 as no longer blocked.

tracking-firefox49: blocking → +

Comment 68

•

8 years ago

How is RC4 looking?

Flags: needinfo?(lhenry)

Comment 69

•

8 years ago

(In reply to David Bolter [:davidb] from comment #68)
> How is RC4 looking?
good so far in regards to this bug. there's no higher crash volume from systems with a "GenuineIntel family 6 model 61 stepping 4" cpu than normally.

Comment 70

•

8 years ago

Looking good. You can compare to beta 10 here, https://mozilla.github.io/stab-crashes/compare-betas.html?beta1=49.0b&beta2=49.0b99

Flags: needinfo?(lhenry)

Comment 71

•

8 years ago

(76.66% in signatures vs 02.28% overall) address = 0xffffffffffffffff
(70.46% in signatures vs 02.52% overall) adapter_device_id = 0x1616
(70.31% in signatures vs 02.52% overall) cpu_info = GenuineIntel family 6 model 61 stepping 4 | 4
(97.85% in signatures vs 37.76% overall) reason = EXCEPTION_ACCESS_VIOLATION_READ
(49.93% in signatures vs 08.79% overall) platform_version = 6.3.9600
(49.93% in signatures vs 08.79% overall) platform_pretty_version = Windows 8.1
(46.60% in signatures vs 05.83% overall) build_id = 20160829102229
(36.41% in signatures vs 05.47% overall) has dual GPUs = true
(33.01% in signatures vs 03.01% overall) Addon "Kaspersky Protection" = true
(95.49% in signatures vs 68.52% overall) adapter_vendor_id = 0x8086
(33.97% in signatures vs 10.23% overall) bios_manufacturer = Dell Inc.
(30.13% in signatures vs 08.52% overall) GFX_ERROR "Failed 2 buffer db="
(20.61% in signatures vs 03.20% overall) bios_manufacturer = Insyde

Where 'signatures' is every signature starting with 'arena_dalloc_small | je_free'.

Perhaps installing the "Kaspersky Protection" (light_plugin_ACF0E80077C511E59DED005056C00008@kaspersky.com)
addon might help in reproducing the issue?

Comment 72

•

8 years ago

the issue is back again in 50.0b1.

status-firefox50: --- → affected

Updated

•

8 years ago

Updated

•

8 years ago

Crash Signature: mozilla::layers::ContainerLayerProperties::ComputeChangeInternal ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::DestructRange | mozilla::PaintedLayerData::~PaintedLayerData ] → mozilla::layers::ContainerLayerProperties::ComputeChangeInternal ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::DestructRange | mozilla::PaintedLayerData::~PaintedLayerData ] [@ arena_dalloc_small | je_free | nsT…

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Comment 73

•

8 years ago

Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report

Maybe we land this, even in the wrong place, as it should be easy to uplift and sounds like we have more instances of the problem showing up.

Attachment #8790490 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 74

•

8 years ago

We could make it an annotation (bug 1305120), so it would be easier to use with Socorro and SuperSearch.

Comment 75

•

8 years ago

Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report

Review of attachment 8790490 [details] [diff] [review]:
-----------------------------------------------------------------

Is there any reason this can't be in gfxWindowsPlatform?

::: gfx/thebes/gfxPlatform.cpp
@@ +682,5 @@
> +      }
> +
> +      if (cpuUpdateRevision > 0) {
> +        nsAutoCString revAndStatus;
> +        revAndStatus.AppendPrintf("CpuRevisionStatus(0x%x:0x%x) ",

nit: can use nsPrintfCString here

Attachment #8790490 - Flags: review?(dvander) → review+

Milan Sreckovic [:milan] (needinfo for best results)

Comment 76

•

8 years ago

Comment on attachment 8790490 [details] [diff] [review]
A bit of a hack to get the update revision/signature into app notes of the crash report

The patch in bug 1305120 (same code, different place) is probably more appropriate - separate annotation field and in a better file.

Attachment #8790490 - Attachment is obsolete: true

Comment 77

•

8 years ago

Looks like in 50.0b3 we have another signature (jemalloc_crash) strongly correlated with cpu_info = GenuineIntel family 6 model 61 stepping 4.

Crash Signature: nsCSSValue::DoReset ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ] → nsCSSValue::DoReset ] [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | nsTArray_Impl<T>::RemoveElementsAt | mozilla::DisplayListClipState::GetCurrentCombinedClip ] [@ jemalloc_crash ]

Updated

•

8 years ago

Comment 78

•

8 years ago

Interestingly, so far the 'jemalloc_crash' (strongly correlated with Intel CPUs) is gone in 50.0b4 at the same time as js::NativeObject::setSlotWithType (bug 1307285), which is strongly correlated to AMD CPUs.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

8 years ago

Updated

•

8 years ago

Depends on: 1305888

Assignee

Comment 79

•

8 years ago

This query might be useful for finding crashes with microcode info:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&build_id=%3E%3D20161007000000&product=Firefox&date=%3E%3D2016-10-06T20%3A29%3A00.000Z&date=%3C2016-10-13T20%3A29%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=cpu_microcode_version#crash-reports
but so far none of the crashes in that query are crashes that are clearly associated with this bug.

Jenn Chaulk (:jchaulk)

Updated

•

8 years ago

Rank: 1

Comment 80

•

8 years ago

50.0b7 is affected by this bug again, these are the microcode facets:
https://crash-stats.mozilla.com/search/?signature=^arena&cpu_info=^GenuineIntel family 6 model 61 stepping 4&version=50.0b7&product=Firefox&process_type=browser&date=>2016-10-14&_sort=-date&_facets=signature&_facets=platform_pretty_version&_facets=cpu_info&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=platform_pretty_version&_columns=cpu_microcode_version#facet-cpu_microcode_version

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 81

•

8 years ago

linkified: http://bit.ly/2e92Qz4

Assignee

Comment 82

•

8 years ago

So based on comparing this list:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=~je_free&signature=~sse2_blt&version=50.0b7&product=Firefox&date=%3E%3D2016-10-10T20%3A04%3A00.000Z&date=%3C2016-10-17T20%3A04%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=app_notes&_columns=graphics_critical_error#facet-cpu_microcode_version
which is the microcode versions of the 50.0b7 crashes that are *mostly* this bug, with this list:
https://crash-stats.mozilla.com/search/?cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping%204&signature=OOM%20%7C%20small&product=Firefox&version=50.0b&date=%3E%3D2016-10-10T20%3A04%3A00.000Z&date=%3C2016-10-17T20%3A04%3A00.000Z&_sort=-date&_facets=signature&_facets=cpu_microcode_version&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform_pretty_version&_columns=app_notes&_columns=graphics_critical_error#facet-cpu_microcode_version
which is the microcode versions of the OOM | small crashes on the affected CPU family, it seems reasonably clear that:

These microcode versions are affected:
0xe (rare), 0x11, 0x12, 0x13, 0x16, 0x18, 0x19

These microcode versions are not affected:
0x1d, 0x1f, 0x21, 0x22

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 83

•

8 years ago

This would *seem* to imply that revs before 0x1d (and 0xe or above, roughly) are the ones affected.

Joe, any thoughts?

Flags: needinfo?(joseph.k.olivas)

Joe Olivas

Comment 84

•

8 years ago

Just so I understand what's going on:

OOM is used just to basically get what versions are out there (everyone hits OOM), while the other shows which ones are hitting this crash.

Is this correct? I do see 0x1d, 0x1f and 0x21 in the bad crash, but very low numbers.

I'll leave ni open for now.

Assignee

Comment 85

•

8 years ago

Yes, I was using small out-of-memory crashes to try to establish a baseline distribution of the microcode versions among our users.  (The OOM crashes are smaller numbers than this crash, but I think big enough to get a usable baseline.)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Assignee

Comment 86

•

8 years ago

(In reply to Joe Olivas from comment #84)
> Is this correct? I do see 0x1d, 0x1f and 0x21 in the bad crash, but very low
> numbers.

Yes -- I think the problem there is that the query isn't perfect.  It's a query on signature substring, which catches some other crashes.  Those looked like they had very different patterns of signatures than the ones this bug covers, so I didn't actually go through and check the minidumps to verify that they were not showing the patterns in comment 31 and comment 49.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 87

•

8 years ago

until we have a proper fix in place for this problem could we do something similar like with the whole websense saga and have those installations with the known affected intel cpu+microcode versions identify themselves in the update ping?
so in case of an emergency (like a dot release on the firefox release channel that is affected by this bug, because we cannot first test it with a wider beta audience like we do with rc builds) we would at least be able to retroactively disable automatic updates just for those crashing configurations...

Assignee

Updated

•

8 years ago

Comment 88

•

8 years ago

I filed bug 1311515 on comment 87.

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

Comment 89

•

8 years ago

(In reply to David Baron :dbaron: ⌚️UTC+8 from comment #82)
>...
> 
> These microcode versions are affected:
> 0xe (rare), 0x11, 0x12, 0x13, 0x16, 0x18, 0x19

These are still the only ones showing.

> 
> These microcode versions are not affected:
> 0x1d, 0x1f, 0x21, 0x22

These still haven't shown.

Comment 90

•

8 years ago

Too late to fix in 50.1.0 release

status-firefox50: affected → wontfix

status-firefox51: --- → ?

status-firefox52: --- → ?

Ryan VanderMeulen [:RyanVM]

Comment 91

•

7 years ago

Are these signatures still showing up newer releases?

status-firefox49: affected → wontfix

status-firefox53: --- → ?

Flags: needinfo?(mozillamarcia.knous)

Marcia Knous [:marcia]

Reporter

Comment 92

•

7 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #91)
> Are these signatures still showing up newer releases?

A cursory manual look did not show anything later than 50, but I will need info on Marco to answer for sure.

Flags: needinfo?(mozillamarcia.knous) → needinfo?(mcastelluccio)