<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 1

•

13 years ago

(In reply to David Baron [:dbaron] from comment #0) > (One question of interest is whether they ever go away, or just keep moving > signature constantly and stay around all the time.) They go away, then come back later, sometimes two builds after, other times hundreds builds after. They don't usually stay more than one build.

Reporter

Comment 2

•

13 years ago

How do you know that? Maybe most of the time they're spread between a large number of low-frequency signatures, and occasionally they concentrate on a single signature. Is there a way to verify that that's not happening? (Back when we generated CSV files, I could have, but I don't see those anymore.)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 3

•

13 years ago

So another theory is that this driver is doing some sort of binary patching or hooking in that was designed for a particular version of Firefox, but the check that they're doing to make sure they have the right version relies on a very small amount of variable data, such that it has a significant false positive rate. So they do some sort of binary patching or hooking in on certain Firefox versions with an appearance of randomness. If this is the case it's only a matter of time before the pattern matches on a release build.

Comment 4

•

13 years ago

(In reply to David Baron [:dbaron] from comment #2) > How do you know that? Maybe most of the time they're spread between a large > number of low-frequency signatures It impacts the crash ratio. > and occasionally they concentrate on a single signature. When this issue happens, there are about a half dozens of crash signatures. (In reply to David Baron [:dbaron] from comment #3) > If this is the case it's only a matter of time before the pattern matches on a > release build. It has already happened in Fx 11.0 (see bug 700288 comment 24).

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 5

•

13 years ago

Bug 768383 was an instance of this that showed up in FF14b9 and went away in FF14b10; I examined the minidump and it was an almost impossible crash (null deref after null-check with the intervening code being fairly well defined). I chalked it up to a weird PGO fluke, but it's also possible that the driver is overwriting a stack location or register. But if that were the case I'd expect to see the crashes spread out more. And there weren't any graphics calls nested in this stack frame, at least that I could see. I'm a bit stumped by this one. It's the sort of thing that I'd love to catch in record and replay but we probably can't have those graphics drivers in a VM anyway.

Reporter

Comment 6

•

13 years ago

(In reply to Scoobidiver from comment #4) > (In reply to David Baron [:dbaron] from comment #2) > > How do you know that? Maybe most of the time they're spread between a large > > number of low-frequency signatures > It impacts the crash ratio. Ah, ok, so I don't need to gather data from https://crash-analysis.mozilla.com/crash_analysis/

Comment 7

•

13 years ago

(In reply to David Baron [:dbaron] from comment #3) > If this is the case it's only a matter of time before the pattern matches on a > release build. 10.0.6 ESR is affected!

Updated

•

13 years ago

Blocks: 837371

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Updated

•

13 years ago

Blocks: 839270

Reporter

Comment 8

•

13 years ago

It might be useful to try to figure out what's similar about the builds that are affected that isn't a characteristic of the unaffected builds. Does anybody happen to have a list of the affected builds?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 9

•

13 years ago

So if we have any contacts at AMD, it might be worth asking them what regression might have been introduced on their end between (probably, though we don't have 100% confidence in these ranges): version 8.17.10.1047 and 8.17.10.1052 of aticfx32.dll version 8.17.10.310 and 8.17.10.318 of atidxx32.dll version 8.14.1.6150 and 8.14.1.6160 of atiuxpag.dll (I got these ranges from the correlations for bug 839270; the third is consistent with bug 714320 comment 26 from over a year ago.)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 10

•

13 years ago

roc did investigation of another minidump in bug 839270 comment 22.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Comment 11

•

13 years ago

(In reply to David Baron [:dbaron] from comment #9) > So if we have any contacts at AMD, it might be worth asking them what oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 12

•

13 years ago

Summary of bug 839270 comment #22: We seem to do an unexpected jump forward by a short distance when we reach a specific point in our code, jumping into the middle of an instruction in another function. This doesn't always happen or the browser couldn't even start, but when it does happen it always happens in the same place in libxul for all the crash reports in that bug (even though those are different addresses since libxul is moved by ASLR). In bug 839270 the jump originates from a small leaf function which has clearly been compiled correctly and cannot be causing the jump itself. Whatever's causing this must be very subtle and is almost certainly unrelated to the Gecko code implicated by the crash stacks. I have some contacts at AMD too. I'll try them.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 13

•

13 years ago

I got minidumps for some of the other crash bugs. Bug 700288 is similar to bug 839270 --- we're in a small leaf function (UnionRectEdges), and inexplicably jump to the middle of an instruction (in this case in the same function though). However, the address within libxul is different (and nowhere near) the address for the crash in bug 839270. Bug 714320 affects AddChild, like bug 839280, but I'm not sure what's going on there. See https://bugzilla.mozilla.org/show_bug.cgi?id=714320#c79. Bug 722024 is like bug 700288. It looks like we're crashing in UnionRectEdges with an inexplicable jump forward past the end of the function, into int3 padding in that case. In summary, the code address where we go wrong seems to vary between libxul builds (but is at the same location in libxul for all regardless of ASLR). I bet the varying impact of these crashes depends on exactly which function (if any) gets cursed.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 14

•

13 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #12) > I have some contacts at AMD too. I'll try them. Email sent.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 15

•

13 years ago

One question that might be helpful to answer: do we ever see these crashes in more than one function for a given libxul build?

Comment 16

•

13 years ago

I *think* that we're seeing it in only one function per build, but one would probably need to look through all the dependent bugs and compare the builds where those happen.

Joe Drew (not getting mail)

Comment 17

•

13 years ago

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #16) > I *think* that we're seeing it in only one function per build, but one would > probably need to look through all the dependent bugs and compare the builds > where those happen. Actually, scratch that. We have at least three different signatures for bug 839270 in 19.0b5 alone.

Comment 18

•

13 years ago

(In reply to David Baron [:dbaron] from comment #11) > (In reply to David Baron [:dbaron] from comment #9) > > So if we have any contacts at AMD, it might be worth asking them what > > oh, and bug 700288 comment 35 suggests Joe does have contacts at AMD. The people I know are the same people Robert emailed. Unfortunately I don't think we've heard back yet.

Updated

•

13 years ago

Blocks: 830531

Updated

•

13 years ago

Depends on: 845970

Updated

•

13 years ago

Blocks: 806071

Updated

•

13 years ago

Blocks: 854820

Updated

•

13 years ago

Blocks: 863714

Updated

•

13 years ago

Blocks: 865701

Comment 19

•

13 years ago

Given the crashes tracked here are something highly visible when they explode and are a continuous subject of tracking by stability and release management, I'll invoke the "bugs that spearhead investigation or fixes across a large collection of crashes" clause of https://wiki.mozilla.org/CrashKill/Topcrash on this meta tracker bug and add the topcrash keyword here. We should not use it on individual signatures, though, as we know that's per-build fluctuations anyhow.

Keywords: topcrash

Avi Halachmi (:avih)

Comment 20

•

12 years ago

I have a system with a Radeon HD 6310 (it's an iGPU of AMD E-350) which is used daily as an HTPC. Bug 840161 blacklists window-acceleration and d2d-acceleration on this GPU due to this bug. FWIW, I didn't have any crash with layers.acceleration.force-enabled=true neither with FX22 (main browser), nor in nightly builds which I update regularly. I tried also gfx.direct2d.force-enabled=true without crashes, but typically it's not on since it degrades performance sometimes. If I can help tests in any way, please use me. My gfx about:support info is available at bug 840161 comment 15.

Updated

•

12 years ago

Blocks: 902349

Updated

•

12 years ago

Assignee: nobody → dmajor

Assignee

Comment 21

•

12 years ago

TL;DR - We have a lot of observations but are far from a solution. Here's the story so far. On 21.0b4, the bug manifests as a crash usually near xul!mozilla::dom::DocumentBinding::CreateInterfaceObjects. The specific instruction offset and the nature of the crash (access violation, invalid instruction, privileged instruction, etc.) can vary. I can not-very-reliably repro this on the netbook named "MOZILLA-RD6310" by opening up some youtube videos in one window, then opening another window with nbcnews.com and mousing around and reloading until it crashes. It can take anywhere from a minute to an hour or more. After the crash, everything seems as if xul!nsStyleContext::AddChild+0x12 (xul+0x7d760) had been corrupted to contain an instruction reading "call CreateInterfaceObjects+0x20 (xul+0xa9b01)". There are several reasons for believing this. First, the top of the stack contains AddChild+0x17, as if a return address had been pushed during a call instruction (five bytes). Second, AddChild+0x12 is a valid instruction reachable in the original binary, but AddChild+0x17 is in the middle of an instruction and could never be a return address without corruption. Third, CreateInterfaceObjects+0x20 is also in the middle of an instruction, so it could not be a valid branch target in an unmodified binary. The affected locations are always offsets from xul.dll, so the absolute values change based on xul's base. Here's where it gets suspicious: by the time we notice the crash, the memory at AddChild+0x12 appears to have its original values. So we can't definitively prove whether the bug is indeed the corruption described above, or some other badness that happens to have the same symptoms. It's possible that the driver is modifying the xul.dll memory (perhaps as a write-test) and quickly modifying it back to the original value. There are other possibilities like a hardware issue in the instruction fetch, but that seems less likely. Assuming that the driver is modifying memory, it would have to touch five bytes, more than it could typically do with regular 32bit operations: 89 08 c3 83 c0 are the bytes at xul+0x7d760 originally. e8 9c c3 02 00 are the bytes that would cause our theorized call. Memory access breakpoints on the affected addresses don't trigger. Presumably that's because the driver accesses that physical memory via a different virtual-to-physical mapping (hardware breakpoints are based on virtual address). I tried dumping the driver's address mappings to see what other address it might be using, but there were so many mappings for that region that it's not practical to go chasing them all down. Another complication is that the memory at CreateInterfaceObjects+0x20 changes each time you load Firefox. That memory just so happens to contain an absolute address of a global variable (sPrefCachesInited). The Windows loader patches up the address based on where xul.dll gets based each time. What this means is, if we execute AddChild+0x20, occasionally it look like an innocuous instruction, so we continue on to 0x21 and so on. Depending on the interpretation of that memory, we crash in different ways and at different offsets. Usually it's plus-twenty-something, but in a few cases I've seen it continue on for dozens of instructions and jmp far away to mozjs. Also, sometimes those instructions contain a "pop" so that AddChild+0x17 is no longer on our stack. I've tried detouring AddChild in several places, adding instructions that verify AddChild+0x12 before executing them. If the verification were to fail then we'd have solid proof of memory corruption. Unfortunately, I haven't been able to hit the crash after doing this. Either my reading of those values interferes with the execution of the scenario, or I just haven't waited long enough on the unreliable repro, can't really say. [Note: This detouring is not a fix that we can apply to source code; I can only do it in the debugger with after-the-fact knowledge of what function fails on this build] All of the above applies to 21.0b4 only. The crash is not machine-specific (same functions affected on our netbook and various user crash dumps) but it is build-specific, since function layout changes with each compilation. I need to do more digging in the other bugs to see whether the victim is always xul+0x7d760, or at least some predictable location. If so, maybe we could play some tricks with the linker to avoid putting anything critical there.

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 22

•

12 years ago

I think this is a CPU bug. I don't say that lightly, because generally hardware is the last thing you should blame, but that's where the evidence is pointing. https://bugzilla.mozilla.org/show_bug.cgi?id=830531#c72 100% of 71760 crashes in bug 865701 occured on the two CPU models affected by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2). Those models have combined CPU+GPU on the same chip, which would explain why this appeared to correlate with ATI drivers. http://support.amd.com/us/Processor_TechDocs/47534_14h_Mod_00h-0Fh_Rev_Guide.pdf Erratum 688 is the only major bug that applies to both Models 1 and 2, and it just might be the issue that we're hitting. Our case of AddChild in bug 865701 meets the requirement of "after a not-taken branch that ends on the last byte of an aligned quad-word" and the "internal timing conditions" might explain the variability that we've seen. There is a workaround listed, but it requires BIOS authors to modify undocumented bits in the processor's instruction cache settings. Our netbook is Family 20 Model 1, and I confirmed that PCI configuration register D18F4x164[2] = 0, indicating that this rev of the silicon does not have the fix for 688. I also confirmed that MSRC001_1021[14] = 0 and MSRC001_1021[3] = 0, indicating that my BIOS has not applied AMD's workaround. Unfortunately, installing KB2818604 from Windows Update didn't stop the crashes. I don't have a good explanation. Maybe that patch was for something else on the errata sheet. But after using a kernel debugger to mimic AMD's BIOS workaround (don't try this at home), I don't crash anymore. Or at least I haven't crashed yet -- the repro is unreliable to begin with, so I want to give it a few more attempts.

Comment 23

•

12 years ago

Wow. Your analysis is very impressive.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 24

•

12 years ago

I don't suppose we can read those configuration registers and get them into crash dumps?

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 25

•

12 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #24) > I don't suppose we can read those configuration registers and get them into > crash dumps? MSRs and PCI config need kernel privilege. We would have to write a driver to read them.

Reporter

Comment 26

•

12 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #23) > Wow. Your analysis is very impressive. Agreed. An interesting followup question: is there a way we could examine a binary to determine whether it would trigger this bug? (If we could, then we could reject builds that would trigger it, perhaps even during the build process.)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Reporter

Updated

•

12 years ago

Summary: layout crashes with AMD Radeon HD 6xxx series, spiking at various times → layout crashes with AuthenticAMD Family 20 (0x14), Models 1 and 2 CPUs (also shows as AMD Radeon HD 6xxx series), spiking at various times

Assignee

Comment 27

•

12 years ago

(In reply to David Baron [:dbaron] (needinfo? me; away Aug 28 - Sep 3) from comment #26) > An interesting followup question: is there a way we could examine a binary > to determine whether it would trigger this bug? (If we could, then we could > reject builds that would trigger it, perhaps even during the build process.) I imagine that the bug depends at least as much on the runtime call patterns and control flow as on the static contents of the binary.

Comment 28

•

12 years ago

> 100% of 71760 crashes in bug 865701 occured on the two CPU models affected > by that microcode update (AuthenticAMD Family 20 (0x14), Models 1 and 2). How did you collect this data? I know you asked me about this I wasn't able to run that query yesterday, but I did run a query today which shows different results: For the date period 2013-04-25 through 2013-05-04 with the 21.0b4 builds, I selected all crashes with the following signatures associated with bug 865701: 'mozilla::dom::DocumentBinding::CreateInterfaceObjects(JSContext*, JSObject*, JSObject**)', 'JSCompartment::getNewType(JSContext*, js::Class*, js::TaggedProto, JSFunction*)', 'JS_GetCompartmentPrincipals(JSCompartment*)', 'nsStyleSet::ReparentStyleContext(nsStyleContext*, nsStyleContext*, mozilla::dom::Element*)', 'nsFrameManager::ReResolveStyleContext(nsPresContext*, nsIFrame*, nsIContent*, nsStyleChangeList*, nsChangeHint, nsChangeHint, nsRestyleHint, mozilla::css::RestyleTracker&, nsFrameManager::DesiredA11yNotifications, nsTArray<nsIContent*>&, TreeMatchConte...', The AuthenticAMD processors you mention are certainly the most common, but there are other Intel and AMD processor models which experience the same crash signatures. I'll attach the data by CPU and by signature/CPU. I'll also run this for the Firefox 19.0 crash (bug 830531) because IIRC the distribution was different.

Comment 29

•

12 years ago

Attached file amd-cpus-21.0b4.grouped.csv (by CPU only) — Details

Comment 30

•

12 years ago

Attached file amd-cpus-21.0b4 (by signature and CPU, unsorted) — Details

Assignee

Comment 31

•

12 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #28) > How did you collect this data? I know you asked me about this I wasn't able > to run that query yesterday, but I did run a query today which shows > different results: My search only included DocumentBinding::CreateInterfaceObjects at the top of the stack. I've spot-checked a few dozen reports from the other signatures you listed. In getNewType and JS_GetCompartmentPrincipals, reports from AMD family 20 all went through AddChild or CreateInterfaceObjects, and other CPUs didn't. There might be other crashes getting mixed in to those signatures. For ReparentStyleContext and ReResolveStyleContext, the stacks are quite scattered on both Intel and AMD processors. There may be several root causes there. Maybe CreateInterfaceObjects was just by luck a good filter, in that no other crashes managed to sneak in. I'd be curious to see whether we can say the same about the 19.0 crash.

Robert O'Callahan (:roc) (email my personal email if necessary)

Assignee

Comment 32

•

12 years ago

(In reply to David Major [:dmajor] from comment #31) > I'd be curious to see whether we can say the same about the 19.0 crash. From April 25 to April 30 (I admit they're not good dates for 19.0, but that's what I had handy), I see 562 hits for TlsGetValue in 19.0. 560 of those are AMD family 20, and my spot-checks all showed XPC_WN_Helper_NewResolve on the stack. The remaining two reports from other processors had different stacks.

Avi Halachmi (:avih)

Comment 33

•

12 years ago

Impressive analysis. Looking forward this issue being handled when possible.

Comment 34

•

12 years ago

(In reply to David Major [:dmajor] from comment #22) > Unfortunately, installing KB2818604 from Windows Update didn't stop the > crashes. I don't have a good explanation. Maybe that patch was for something > else on the errata sheet. But after using a kernel debugger to mimic AMD's > BIOS workaround (don't try this at home), I don't crash anymore. Or at least > I haven't crashed yet -- the repro is unreliable to begin with, so I want to > give it a few more attempts. Did you do those additional attempts? Maybe we can supply a kernel module that applies this change? Extreme perhaps, but what else can we do? The maintenance service runs with administrator privileges so I assume we can do this.

Comment 35

•

12 years ago

Attached file amd-cpus-19 by signature and CPU, unsorted — Details

Comment 36

•

12 years ago

Attached file amd-cpus-19.grouped.csv by CPU only — Details

Assignee

Comment 37

•

12 years ago

(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #34) > Did you do those additional attempts? Yes. I gave it several attempts on Friday, and I let the news site self-refresh over the weekend. It hasn't hit the crash so far. > Maybe we can supply a kernel module that applies this change? Extreme > perhaps, but what else can we do? The maintenance service runs with > administrator privileges so I assume we can do this. The trouble with my debugger hack is that half of the time it hangs the machine. I'm not surprised -- it's probably pretty dangerous to mess with cache settings when the system is already running. I'm guessing that's why the document says it should be done during BIOS.

Comment 38

•

12 years ago

Comment on attachment 798856 [details] amd-cpus-19.grouped.csv by CPU only >AuthenticAMD family 20 model 2 stepping 0 | 2,294856 >AuthenticAMD family 20 model 1 stepping 0 | 2,4791 >AuthenticAMD family 20 model 1 stepping 0 | 1,163 >GenuineIntel family 6 model 23 stepping 10 | 2,55 >GenuineIntel family 6 model 15 stepping 13 | 2,36 >AuthenticAMD family 20 model 2 stepping 0 | 1,26 >GenuineIntel family 6 model 28 stepping 2 | 2,24 >GenuineIntel family 6 model 42 stepping 7 | 4,22 Given the fast drop-off after the "AuthenticAMD family 20" CPUs, the others might be crashes that just happen to be in the same function/signature but are unrelated to this specific issue. BTW, any idea what those numbers after the pipe actually are?

Comment 39

•

12 years ago

I believe those are the number of cores.

Comment 40

•

12 years ago

Attached file amd-gfx-21.0b4 by signature/graphics card, unsorted — Details

Comment 41

•

12 years ago

Attached file amd-gfx-21.0b4.sorted by graphics card, sorted — Details

Here's the equivalent data for 21.0b4 by graphics vendor instead of by CPU: 0x1002 (AMD),108430 0x0000 (unknown/bad data),306 0x10de (nvidia),195 0x8086 (intel),175 0x1039 (SIS),5 0x5333 (S3),5 0x1106 (VIA),5 0x300b (?),1

Assignee

Updated

•

12 years ago

Depends on: 921569

Assignee

Updated

•

12 years ago

Depends on: 921609

Alex Keybl [:akeybl]

Updated

•

12 years ago

Keywords: topcrash → topcrash-win

Assignee

Updated

•

12 years ago

Depends on: 945439

Assignee

Updated

•

12 years ago

Blocks: 1011075

Assignee

Updated

•

11 years ago

Blocks: 1131831

Comment 43

•

11 years ago

36 rc1 has this defect. We built a second rc before going live.

Comment 44

•

11 years ago

https://crash-stats.mozilla.com/report/list?product=Firefox&range_value=7&range_unit=days&date=2015-03-03&signature=nsIFrame%3A%3AStylePosition%28%29&version=Firefox%3A38.0a2 is another instance of this on the 2015-03-01 Dev Edition build.

Comment 45

•

11 years ago

https://crash-stats.mozilla.com/report/list?signature=nsStyleContext%3A%3ADoGetStylePosition%28bool%29 seems to be a 38.0b2 instance of this crash.

Assignee

Updated

•

11 years ago

Depends on: 1155836

Comment 46

•

11 years ago

38b2 & 38b5 were affected too.

Comment 47

•

11 years ago

In 38.0b8, we also have that crash with this signature: https://crash-stats.mozilla.com/report/list?signature=nsDisplayItem%3A%3AZIndex%28%29

Mats Palmgren (inactive)

Updated

•

11 years ago

Blocks: 1160317

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 48

•

11 years ago

38.0b9 was also impacted.

Jet Villegas (inactive)

Comment 49

•

11 years ago

Adding a dependency on bug 1156135. We may need to detect this CPU/BIOS combination and alert the user at runtime.

Blocks: 945439

Depends on: 1156135
No longer depends on: 945439

Reporter

Comment 50

•

10 years ago

Bug 1155836 attempted to fix one of the major places where this happens.

Comment 52

•

10 years ago

(In reply to David Baron [:dbaron] ⏰UTC-7 from comment #50) > Bug 1155836 attempted to fix one of the major places where this happens. And FWIW, I think we have not seen it since then. Doesn't mean we can declare victory but at least it looks like the frequency of those issues has decreased over what we saw in the 38.0 beta cycle.

Assignee

Comment 53

•

10 years ago

I'm going to call this fixed by bug 1155836.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED