Closed Bug 772330 Opened 12 years ago Closed 9 years ago

layout crashes with AuthenticAMD Family 20 (0x14), Models 1 and 2 CPUs (also shows as AMD Radeon HD 6xxx series), spiking at various times

Categories

(Core :: Layout, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: away)

References

(Depends on 1 open bug, Blocks 3 open bugs)

Details

(Keywords: crash, topcrash-win)

Attachments

(6 files)

We've had lots of layout crashes associated with the AMD Radeon HD 6xxx series graphics drivers.  I think they've all (or mostly) briefly spiked and then gone away.  Odds are they have the same underlying cause.

This is a meta-bug to track the problem.

(One question of interest is whether they ever go away, or just keep moving signature constantly and stay around all the time.)
(In reply to David Baron [:dbaron] from comment #0)
> (One question of interest is whether they ever go away, or just keep moving
> signature constantly and stay around all the time.)
They go away, then come back later, sometimes two builds after, other times hundreds builds after. They don't usually stay more than one build.
How do you know that?  Maybe most of the time they're spread between a large number of low-frequency signatures, and occasionally they concentrate on a single signature.  Is there a way to verify that that's not happening?  (Back when we generated CSV files, I could have, but I don't see those anymore.)
So another theory is that this driver is doing some sort of binary patching or hooking in that was designed for a particular version of Firefox, but the check that they're doing to make sure they have the right version relies on a very small amount of variable data, such that it has a significant false positive rate.  So they do some sort of binary patching or hooking in on certain Firefox versions with an appearance of randomness.  If this is the case it's only a matter of time before the pattern matches on a release build.
(In reply to David Baron [:dbaron] from comment #2)
> How do you know that?  Maybe most of the time they're spread between a large
> number of low-frequency signatures
It impacts the crash ratio.

> and occasionally they concentrate on a single signature.
When this issue happens, there are about a half dozens of crash signatures.

(In reply to David Baron [:dbaron] from comment #3)
> If this is the case it's only a matter of time before the pattern matches on a
> release build.
It has already happened in Fx 11.0 (see bug 700288 comment 24).
Bug 768383 was an instance of this that showed up in FF14b9 and went away in FF14b10; I examined the minidump and it was an almost impossible crash (null deref after null-check with the intervening code being fairly well defined). I chalked it up to a weird PGO fluke, but it's also possible that the driver is overwriting a stack location or register. But if that were the case I'd expect to see the crashes spread out more. And there weren't any graphics calls nested in this stack frame, at least that I could see.

I'm a bit stumped by this one. It's the sort of thing that I'd love to catch in record and replay but we probably can't have those graphics drivers in a VM anyway.
(In reply to Scoobidiver from comment #4)
> (In reply to David Baron [:dbaron] from comment #2)
> > How do you know that?  Maybe most of the time they're spread between a large
> > number of low-frequency signatures
> It impacts the crash ratio.

Ah, ok, so I don't need to gather data from https://crash-analysis.mozilla.com/crash_analysis/
(In reply to David Baron [:dbaron] from comment #3)
> If this is the case it's only a matter of time before the pattern matches on a
> release build.
10.0.6 ESR is affected!
Blocks: 837371
Blocks: 839270
It might be useful to try to figure out what's similar about the builds that are affected that isn't a characteristic of the unaffected builds.  Does anybody happen to have a list of the affected builds?
So if we have any contacts at AMD, it might be worth asking them what regression might have been introduced on their end between (probably, though we don't have 100% confidence in these ranges):

  version 8.17.10.1047 and 8.17.10.1052 of aticfx32.dll
  version 8.17.10.310 and 8.17.10.318 of atidxx32.dll
  version 8.14.1.6150 and 8.14.1.6160 of atiuxpag.dll

(I got these ranges from the correlations for bug 839270; the third is consistent with bug 714320 comment 26 from over a year ago.)
roc did investigation of another minidump in bug 839270 comment 22.
(In reply to David Baron [:dbaron] from comment #9)
> So if we have any contacts at AMD, it might be worth asking them what

oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD.
Summary of bug 839270 comment #22: We seem to do an unexpected jump forward by a short distance when we reach a specific point in our code, jumping into the middle of an instruction in another function. This doesn't always happen or the browser couldn't even start, but when it does happen it always happens in the same place in libxul for all the crash reports in that bug (even though those are different addresses since libxul is moved by ASLR). In bug 839270 the jump originates from a small leaf function which has clearly been compiled correctly and cannot be causing the jump itself.

Whatever's causing this must be very subtle and is almost certainly unrelated to the Gecko code implicated by the crash stacks.

I have some contacts at AMD too. I'll try them.
I got minidumps for some of the other crash bugs.

Bug 700288 is similar to bug 839270 --- we're in a small leaf function (UnionRectEdges), and inexplicably jump to the middle of an instruction (in this case in the same function though). However, the address within libxul is different (and nowhere near) the address for the crash in bug 839270.

Bug 714320 affects AddChild, like bug 839280, but I'm not sure what's going on there. See https://bugzilla.mozilla.org/show_bug.cgi?id=714320#c79.

Bug 722024 is like bug 700288. It looks like we're crashing in UnionRectEdges with an inexplicable jump forward past the end of the function, into int3 padding in that case.

In summary, the code address where we go wrong seems to vary between libxul builds (but is at the same location in libxul for all regardless of ASLR). I bet the varying impact of these crashes depends on exactly which function (if any) gets cursed.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12)
> I have some contacts at AMD too. I'll try them.

Email sent.
One question that might be helpful to answer: do we ever see these crashes in more than one function for a given libxul build?
I *think* that we're seeing it in only one function per build, but one would probably need to look through all the dependent bugs and compare the builds where those happen.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16)
> I *think* that we're seeing it in only one function per build, but one would
> probably need to look through all the dependent bugs and compare the builds
> where those happen.

Actually, scratch that. We have at least three different signatures for bug 839270 in 19.0b5 alone.
(In reply to David Baron [:dbaron] from comment #11)
> (In reply to David Baron [:dbaron] from comment #9)
> > So if we have any contacts at AMD, it might be worth asking them what
> 
> oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD.

The people I know are the same people Robert emailed. Unfortunately I don't think we've heard back yet.
Blocks: 830531
Depends on: 845970
Blocks: 806071
Blocks: 854820
Blocks: 863714
Blocks: 865701
Given the crashes tracked here are something highly visible when they explode and are a continuous subject of tracking by stability and release management, I'll invoke the "bugs that spearhead investigation or fixes across a large collection of crashes" clause of https://wiki.mozilla.org/CrashKill/Topcrash on this meta tracker bug and add the topcrash keyword here. We should not use it on individual signatures, though, as we know that's per-build fluctuations anyhow.
Keywords: topcrash
I have a system with a Radeon HD 6310 (it's an iGPU of AMD E-350) which is used daily as an HTPC.

Bug 840161 blacklists window-acceleration and d2d-acceleration on this GPU due to this bug.

FWIW, I didn't have any crash with layers.acceleration.force-enabled=true neither with FX22 (main browser), nor in nightly builds which I update regularly. I tried also gfx.direct2d.force-enabled=true without crashes, but typically it's not on since it degrades performance sometimes.

If I can help tests in any way, please use me. My gfx about:support info is available at bug 840161 comment 15.
Blocks: 902349
Assignee: nobody → dmajor
TL;DR - We have a lot of observations but are far from a solution. Here's the story so far.

On 21.0b4, the bug manifests as a crash usually near xul!mozilla::dom::DocumentBinding::CreateInterfaceObjects. The specific instruction offset and the nature of the crash (access violation, invalid instruction, privileged instruction, etc.) can vary.

I can not-very-reliably repro this on the netbook named "MOZILLA-RD6310" by opening up some youtube videos in one window, then opening another window with nbcnews.com and mousing around and reloading until it crashes. It can take anywhere from a minute to an hour or more.

After the crash, everything seems as if xul!nsStyleContext::AddChild+0x12 (xul+0x7d760) had been corrupted to contain an instruction reading "call CreateInterfaceObjects+0x20 (xul+0xa9b01)". There are several reasons for believing this. First, the top of the stack contains AddChild+0x17, as if a return address had been pushed during a call instruction (five bytes). Second, AddChild+0x12 is a valid instruction reachable in the original binary, but AddChild+0x17 is in the middle of an instruction and could never be a return address without corruption. Third, CreateInterfaceObjects+0x20 is also in the middle of an instruction, so it could not be a valid branch target in an unmodified binary. The affected locations are always offsets from xul.dll, so the absolute values change based on xul's base. 

Here's where it gets suspicious: by the time we notice the crash, the memory at AddChild+0x12 appears to have its original values. So we can't definitively prove whether the bug is indeed the corruption described above, or some other badness that happens to have the same symptoms. It's possible that the driver is modifying the xul.dll memory (perhaps as a write-test) and quickly modifying it back to the original value. There are other possibilities like a hardware issue in the instruction fetch, but that seems less likely. 

Assuming that the driver is modifying memory, it would have to touch five bytes, more than it could typically do with regular 32bit operations:
89 08 c3 83 c0 are the bytes at xul+0x7d760 originally.
e8 9c c3 02 00 are the bytes that would cause our theorized call.

Memory access breakpoints on the affected addresses don't trigger. Presumably that's because the driver accesses that physical memory via a different virtual-to-physical mapping (hardware breakpoints are based on virtual address). I tried dumping the driver's address mappings to see what other address it might be using, but there were so many mappings for that region that it's not practical to go chasing them all down.

Another complication is that the memory at CreateInterfaceObjects+0x20 changes each time you load Firefox. That memory just so happens to contain an absolute address of a global variable (sPrefCachesInited). The Windows loader patches up the address based on where xul.dll gets based each time. What this means is, if we execute AddChild+0x20, occasionally it look like an innocuous instruction, so we continue on to 0x21 and so on. Depending on the interpretation of that memory, we crash in different ways and at different offsets. Usually it's plus-twenty-something, but in a few cases I've seen it continue on for dozens of instructions and jmp far away to mozjs. Also, sometimes those instructions contain a "pop" so that AddChild+0x17 is no longer on our stack.

I've tried detouring AddChild in several places, adding instructions that verify AddChild+0x12 before executing them. If the verification were to fail then we'd have solid proof of memory corruption. Unfortunately, I haven't been able to hit the crash after doing this. Either my reading of those values interferes with the execution of the scenario, or I just haven't waited long enough on the unreliable repro, can't really say. [Note: This detouring is not a fix that we can apply to source code; I can only do it in the debugger with after-the-fact knowledge of what function fails on this build]

All of the above applies to 21.0b4 only. The crash is not machine-specific (same functions affected on our netbook and various user crash dumps) but it is build-specific, since function layout changes with each compilation. I need to do more digging in the other bugs to see whether the victim is always xul+0x7d760, or at least some predictable location. If so, maybe we could play some tricks with the linker to avoid putting anything critical there.
I think this is a CPU bug. I don't say that lightly, because generally hardware is the last thing you should blame, but that's where the evidence is pointing. 

https://bugzilla.mozilla.org/show_bug.cgi?id=830531#c72

100% of 71760 crashes in bug 865701 occured on the two CPU models affected by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2). Those models have combined CPU+GPU on the same chip, which would explain why this appeared to correlate with ATI drivers. 

http://support.amd.com/us/Processor_TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf

Erratum 688 is the only major bug that applies to both Models 1 and 2, and it just might be the issue that we're hitting. Our case of AddChild in bug 865701 meets the requirement of "after a not-taken branch that ends on the last byte of an aligned quad-word" and the "internal timing conditions" might explain the variability that we've seen. There is a workaround listed, but it requires BIOS authors to modify undocumented bits in the processor's instruction cache settings.

Our netbook is Family 20 Model 1, and I confirmed that PCI configuration register D18F4x164[2] = 0, indicating that this rev of the silicon does not have the fix for 688. I also confirmed that MSRC001_1021[14] = 0 and MSRC001_1021[3] = 0, indicating that my BIOS has not applied AMD's workaround. 

Unfortunately, installing KB2818604 from Windows Update didn't stop the crashes. I don't have a good explanation. Maybe that patch was for something else on the errata sheet. But after using a kernel debugger to mimic AMD's BIOS workaround (don't try this at home), I don't crash anymore. Or at least I haven't crashed yet -- the repro is unreliable to begin with, so I want to give it a few more attempts.
Wow. Your analysis is very impressive.
I don't suppose we can read those configuration registers and get them into crash dumps?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #24)
> I don't suppose we can read those configuration registers and get them into
> crash dumps?

MSRs and PCI config need kernel privilege. We would have to write a driver to read them.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #23)
> Wow. Your analysis is very impressive.

Agreed.

An interesting followup question:  is there a way we could examine a binary to determine whether it would trigger this bug?  (If we could, then we could reject builds that would trigger it, perhaps even during the build process.)
Summary: layout crashes with AMD Radeon HD 6xxx series, spiking at various times → layout crashes with AuthenticAMD Family 20 (0x14), Models 1 and 2 CPUs (also shows as AMD Radeon HD 6xxx series), spiking at various times
(In reply to David Baron [:dbaron] (needinfo? me; away Aug 28 - Sep 3) from comment #26)
> An interesting followup question:  is there a way we could examine a binary
> to determine whether it would trigger this bug?  (If we could, then we could
> reject builds that would trigger it, perhaps even during the build process.)

I imagine that the bug depends at least as much on the runtime call patterns and control flow as on the static contents of the binary.
> 100% of 71760 crashes in bug 865701 occured on the two CPU models affected
> by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2).

How did you collect this data? I know you asked me about this I wasn't able to run that query yesterday, but I did run a query today which shows different results:

For the date period 2013-04-25 through 2013-05-04 with the 21.0b4 builds, I selected all crashes with the following signatures associated with bug 865701:

            'mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**)',
            'JSCompartment::getNewType(JSContext*, js::Class*, js::TaggedProto, JSFunction*)',
            'JS_GetCompartmentPrincipals(JSCompartment*)',
            'nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*)',
            'nsFrameManager::ReResolveStyleContext(nsPresContext*, nsIFrame*, nsIContent*, nsStyleChangeList*, nsChangeHint, nsChangeHint, nsRestyleHint, mozilla::css::RestyleTracker&, nsFrameManager::DesiredA11yNotifications, nsTArray<nsIContent*>&, TreeMatchConte...',

The AuthenticAMD processors you mention are certainly the most common, but there are other Intel and AMD processor models which experience the same crash signatures. I'll attach the data by CPU and by signature/CPU. I'll also run this for the Firefox 19.0 crash (bug 830531) because IIRC the distribution was different.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #28)
> How did you collect this data? I know you asked me about this I wasn't able
> to run that query yesterday, but I did run a query today which shows
> different results:

My search only included DocumentBinding::CreateInterfaceObjects at the top of the stack. I've spot-checked a few dozen reports from the other signatures you listed. In getNewType and JS_GetCompartmentPrincipals, reports from AMD family 20 all went through AddChild or CreateInterfaceObjects, and other CPUs didn't. There might be other crashes getting mixed in to those signatures. For ReparentStyleContext and ReResolveStyleContext, the stacks are quite scattered on both Intel and AMD processors. There may be several root causes there. Maybe CreateInterfaceObjects was just by luck a good filter, in that no other crashes managed to sneak in.

I'd be curious to see whether we can say the same about the 19.0 crash.
(In reply to David Major [:dmajor] from comment #31)
> I'd be curious to see whether we can say the same about the 19.0 crash.

From April 25 to April 30 (I admit they're not good dates for 19.0, but that's what I had handy), I see 562 hits for TlsGetValue in 19.0. 560 of those are AMD family 20, and my spot-checks all showed XPC_WN_Helper_NewResolve on the stack. The remaining two reports from other processors had different stacks.
Impressive analysis. Looking forward this issue being handled when possible.
(In reply to David Major [:dmajor] from comment #22)
> Unfortunately, installing KB2818604 from Windows Update didn't stop the
> crashes. I don't have a good explanation. Maybe that patch was for something
> else on the errata sheet. But after using a kernel debugger to mimic AMD's
> BIOS workaround (don't try this at home), I don't crash anymore. Or at least
> I haven't crashed yet -- the repro is unreliable to begin with, so I want to
> give it a few more attempts.

Did you do those additional attempts?

Maybe we can supply a kernel module that applies this change? Extreme perhaps, but what else can we do? The maintenance service runs with administrator privileges so I assume we can do this.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #34)
> Did you do those additional attempts?
Yes. I gave it several attempts on Friday, and I let the news site self-refresh over the weekend. It hasn't hit the crash so far.

> Maybe we can supply a kernel module that applies this change? Extreme
> perhaps, but what else can we do? The maintenance service runs with
> administrator privileges so I assume we can do this.
The trouble with my debugger hack is that half of the time it hangs the machine. I'm not surprised -- it's probably pretty dangerous to mess with cache settings when the system is already running. I'm guessing that's why the document says it should be done during BIOS.
Comment on attachment 798856 [details]
amd-cpus-19.grouped.csv by CPU only

>AuthenticAMD family 20 model 2 stepping 0 | 2,294856
>AuthenticAMD family 20 model 1 stepping 0 | 2,4791
>AuthenticAMD family 20 model 1 stepping 0 | 1,163
>GenuineIntel family 6 model 23 stepping 10 | 2,55
>GenuineIntel family 6 model 15 stepping 13 | 2,36
>AuthenticAMD family 20 model 2 stepping 0 | 1,26
>GenuineIntel family 6 model 28 stepping 2 | 2,24
>GenuineIntel family 6 model 42 stepping 7 | 4,22

Given the fast drop-off after the "AuthenticAMD family 20" CPUs, the others might be crashes that just happen to be in the same function/signature but are unrelated to this specific issue.

BTW, any idea what those numbers after the pipe actually are?
I believe those are the number of cores.
Here's the equivalent data for 21.0b4 by graphics vendor instead of by CPU:

0x1002 (AMD),108430
0x0000 (unknown/bad data),306
0x10de (nvidia),195
0x8086 (intel),175
0x1039 (SIS),5
0x5333 (S3),5
0x1106 (VIA),5
0x300b (?),1
Depends on: 921569
Depends on: 921609
Keywords: topcrashtopcrash-win
Depends on: 945439
Blocks: 1011075
Blocks: 1131831
36 rc1 has this defect. We built a second rc before going live.
Depends on: 1155836
38b2 & 38b5 were affected too.
In 38.0b8, we also have that crash with this signature: https://crash-stats.mozilla.com/report/list?signature=nsDisplayItem%3A%3AZIndex%28%29
Blocks: 1160317
38.0b9 was also impacted.
Adding a dependency on bug 1156135. We may need to detect this CPU/BIOS combination and alert the user at runtime.
Blocks: 945439
Depends on: 1156135
No longer depends on: 945439
Bug 1155836 attempted to fix one of the major places where this happens.
(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #50)
> Bug 1155836 attempted to fix one of the major places where this happens.

And FWIW, I think we have not seen it since then. Doesn't mean we can declare victory but at least it looks like the frequency of those issues has decreased over what we saw in the 38.0 beta cycle.
I'm going to call this fixed by bug 1155836.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
David, you just rock! I am really impressed by your work!
(In reply to dmajor (away) from comment #22)
> http://support.amd.com/us/Processor_TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf

The URL for this is now:
http://support.amd.com/TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf

The full text of Erratum 688 is:

688 Processor May Cause Unpredictable Program Behavior Under
Highly Specific Branch Conditions

Description
Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly
update the branch status when a taken branch occurs where the first or second instruction after the
branch is an indirect call or jump. This may cause the processor to update the rIP (the instruction
pointer register) after a not-taken branch that ends on the last byte of an aligned quad-word such that
it appears the processor skips, and does not execute, one or more instructions. The new updated rIP
due to this erratum may not be at an instruction boundary.

Potential Effect on System
Unpredictable program behavior, possibly leading to a program error or system error. It is also
possible that the processor may hang or recognize an exception (for example, a #GP or #UD
exception), however AMD has not observed this effect.

Suggested Workaround
BIOS should set MSRC001_1021[14] = 1b and MSRC001_1021[3] = 1b. This workaround is
required only when bit 2 of Fixed Errata Status Register (D18F4x164[2]) = 0b.

Fix Planned
Yes
So after debugging bug 1266626 which appears to be a form of this crash in build 3 for 46.0 (which we're not using; it's in crash stats as 46.0b99 from its use on the beta channel, though), I thought I'd look to see if we'd shipped other forms of this bug in release recently.

I did this by doing crash-stats queries with:
cpu_info=%5EAuthenticAMD+family+20+model+1&cpu_info=%5EAuthenticAMD+family+20+model+2
tacked on to see if anything interesting popped out.

So far the only interesting thing that I've found is that it appears we shipped a form of this bug that crashes in nsFrame::DisplayBorderBackgroundOutline in 43.0.2 and 43.0.3 (and also, older, 37.0.2).
In 47.0b8 this showed up again as crashes in mozilla::FramePropertyTable::GetInternal.
In 47.0b3 we had crashes in ValueToNameOrSymbolId and js::ValueToId<T>.
nsCSSOffsetState::InitOffsets seems like another variant of this signature, based on the 7-6 Nightly.
Blocks: 1312270
Blocks: 1331253
Blocks: 1316022
(In reply to David Major [:dmajor] from comment #22)
> Unfortunately, installing KB2818604 from Windows Update didn't stop the
> crashes. I don't have a good explanation. Maybe that patch was for something
> else on the errata sheet. But after using a kernel debugger to mimic AMD's
> BIOS workaround (don't try this at home), I don't crash anymore. Or at least
> I haven't crashed yet -- the repro is unreliable to begin with, so I want to
> give it a few more attempts.

Which I guess makes sense considering recommended workaround is setting the required disabling bits in IC_CFG MSR *only* after having checked PCI configuration space for Errata register. 
Something I'm not sure you can ask the CPU alone to do. Definitively not in just a few lines of assembly. 

KB2818604 is just a dll with microcodes of all AMD cpus updated to Q1 2013 then (when latest 0x5000029 and 0x5000119 revisions were released respectively for Bobcat ON-B0 and ON-C0 steppings). 
According to this, they both only bring a fix for erratum 784
https://anonscm.debian.org/cgit/users/hmh/amd64-microcode.git/commit/microcode_amd.bin.README?id=9b4f1804855407f5ba2ce58ef428dfba226f3652

Your kernel debugging trickery was also easily reproduced with msr-tools in this interesting thread https://patchwork.kernel.org/patch/9390769/
And they didn't seem to have any kind of instability.
Depends on: 1335925

Mirh, your patchwork link is dead. Did the content just move or is it gone?

Flags: needinfo?(mirh)
See Also: → 1746733
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: