Open Bug 1606138 Opened 2 months ago Updated 1 month ago

Crash in [@ RetainedDisplayListBuilder::AttemptPartialUpdate]

Categories

(External Software Affecting Firefox :: Other, defect, major)

x86
Windows 7
defect
Not set
major

Tracking

(firefox72+ wontfix, firefox73- fix-optional, firefox74- fix-optional)

Tracking Status
firefox72 + wontfix
firefox73 - fix-optional
firefox74 - fix-optional

People

(Reporter: philipp, Unassigned)

Details

(Keywords: crash, regression, topcrash)

Crash Data

[Tracking Requested - why for this release]:

This bug is for crash report bp-7c5cd9a0-d09d-4272-b1d7-ba1aa0191227.

Top 10 frames of crashing thread:

0  @0x35d158 
1 xul.dll RetainedDisplayListBuilder::AttemptPartialUpdate layout/painting/RetainedDisplayListBuilder.cpp:1481
2 xul.dll nsLayoutUtils::PaintFrame layout/base/nsLayoutUtils.cpp:3956
3 xul.dll mozilla::PresShell::Paint layout/base/PresShell.cpp:6031
4 xul.dll void nsViewManager::ProcessPendingUpdatesPaint view/nsViewManager.cpp:461
5 xul.dll void nsViewManager::ProcessPendingUpdatesForView view/nsViewManager.cpp:396
6 xul.dll nsViewManager::ProcessPendingUpdates view/nsViewManager.cpp:1019
7 xul.dll void nsRefreshDriver::Tick layout/base/nsRefreshDriver.cpp:2176
8 xul.dll void mozilla::RefreshDriverTimer::TickRefreshDrivers layout/base/nsRefreshDriver.cpp:350
9 xul.dll void mozilla::RefreshDriverTimer::Tick layout/base/nsRefreshDriver.cpp:367

this crash signature is spiking up in 72.0b11 from users of the 32bit version of the browser on windows 7. this is currently accounting for 66% of content crashes on the latest beta (but hitting individual users repeatedly).

this far it's only occurring on 3 cpu models - Cpu info facet:
GenuineIntel family 6 model 23 stepping 10 154 88.00 %
GenuineIntel family 6 model 23 stepping 6 12 6.86 %
GenuineIntel family 6 model 15 stepping 11 9 5.14 %

We may want to call this a blocker, certainly worth fixing in RC if possible

Some crashing urls:
https://sts.karnataka.gov.in/SATS/#
https://xat.com/Trade
http://www.pcpndt.karnataka.gov.in/

Looking at the pushlog, I don't see anything Graphics or DisplayList related.
Maybe Bug 1605867? Or bug 1603714?
Bob, Aaron, what do you think?

Flags: needinfo?(matt.woodrow)
Flags: needinfo?(bobowencode)
Flags: needinfo?(aklotz)

Randell, if you're around, can you take a look to see if you can identify what may have regressed this?

Flags: needinfo?(rjesup)

My patch was backed out for that build.

Flags: needinfo?(aklotz)

(In reply to Liz Henry (:lizzard) from comment #3)
...

Maybe Bug 1605867? Or bug 1603714?
Bob, Aaron, what do you think?

Anything is possible with sandboxing, but I'd be very surprised if that were down to my patch, particularly with the win7, 32-bit, certain CPU model details.

Flags: needinfo?(bobowencode)
Crash Signature: [@ RetainedDisplayListBuilder::AttemptPartialUpdate] → [@ RetainedDisplayListBuilder::AttemptPartialUpdate] [@ nsLayoutUtils::PaintFrame ]
Crash Signature: [@ RetainedDisplayListBuilder::AttemptPartialUpdate] [@ nsLayoutUtils::PaintFrame ] → [@ RetainedDisplayListBuilder::AttemptPartialUpdate] [@ nsLayoutUtils::PaintFrame ] [@ mozilla::LauncherMain] [@ trunc | RetainedDisplayListBuilder::AttemptPartialUpdate] [@ SdbInitDatabase] [@ firefox.exe@0x2bff]

This has escalated in volume in 3 days and now has 18728 crashes/3351 installs. Marking as blocking.

Strong correlation to one CPU:
(84.55% in signature vs 25.15% overall) CPU Info = GenuineIntel family 6 model 23 stepping 10 [84.55% vs 32.87% if cpu_arch = x86]
(100.0% in signature vs 49.49% overall) moz_crash_reason = null

Glandium: This appears to be some sort of compiler/compiled-code change triggering a CPU bug, from the correlations and the lack of any obvious changes at the crash location. Perhaps something like the code's alignment to the page, or a branches' alignment, or a structure size change.
dmajor: I realize this isn't your responsibility, but perhaps you have some ideas what could be going wrong here? (note dmajor is on PTO with requests off, so not NI'ing)

lizzard: there's no obvious or even non-obvious cause from the changelogs; no changes anywhere near the crash location in a long time. Possibly just generating a b12 will make it go away - worth a try.
Note: glandium is out until Jan 7... Not sure who might be able to cover on this sort of thing with him and dmajor out. NIing erahm and sylveste for possible help here

Flags: needinfo?(sledru)
Flags: needinfo?(rjesup)
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(erahm)

(In reply to Randell Jesup [:jesup] (PTO thru 1/1, needinfo me) from comment #8)

lizzard: there's no obvious or even non-obvious cause from the changelogs; no changes anywhere near the crash location in a long time. Possibly just generating a b12 will make it go away - worth a try.

72 rc1 is building today and will be on the beta channel tomorrow, so fingers crossed for that.

I hope a rebuild will fix that tomorrow. If not, I will bother more people (I already pinged a few of them)

Flags: needinfo?(sledru)

Just stumbled into this bug: I had a look at the crashes and they're all coming from Penryn or Merom processors, i.e. the Core 2 Duo family. They're old and more importantly they are subject to software-based mitigations for Spectre/Meltdown in the Windows kernel. It might be worth keeping this in mind since it would happen almost only on those processors, all other non-ancient ones have microcode-based mitigations.

Looking into the crashes themselves I see this pattern coming up very often (but not always):

  • The crash reason is EXCEPTION_ACCESS_VIOLATION_EXEC, we're trying to execute something that's not executable
  • eax contains something like 0x00<num>c
  • The crashing address is 0x00<num>8

So we're trying to make an indirect call to something that's pointed by eax but before trying to run it we subtract 4 from the address.

Does this ring any bells? Is it likely to find an indirect call at the bottom of those stacks?

It's late now over here but tomorrow I'll try having a look at the stack data directly.

Just another detail I had missed: the crashing address is very close to what's in esp so it looks like we're trying to jump into the stack!? More precisely the crashing address is 244 bytes ahead of the stack pointer.

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

Just another detail I had missed: the crashing address is very close to what's in esp so it looks like we're trying to jump into the stack!? More precisely the crashing address is 244 bytes ahead of the stack pointer.

I was just scanning through the code and saw you mentioned

  • eax contains something like 0x00<num>c
  • The crashing address is 0x00<num>8

At the end of xul!nsLayoutUtils::AddExtraBackgroundItems (which is what we appear to have returned from with the messed up instruction pointer) there is some code for the construction of the Maybe<TemporaryDisplayListBuilder> I think from [1]

xul!mozilla::Maybe<TemporaryDisplayListBuilder>::emplace<nsIFrame *&,nsDisplayListBuilderMode &,const bool &>:
11f9ecb0 55              push    ebp
11f9ecb1 89e5            mov     ebp,esp
11f9ecb3 57              push    edi
11f9ecb4 56              push    esi
11f9ecb5 8b4508          mov     eax,dword ptr [ebp+8]
11f9ecb8 89ce            mov     esi,ecx
11f9ecba 8b4d0c          mov     ecx,dword ptr [ebp+0Ch]
11f9ecbd 0fb639          movzx   edi,byte ptr [ecx]
11f9ecc0 0fb600          movzx   eax,byte ptr [eax]
11f9ecc3 89f1            mov     ecx,esi
11f9ecc5 6a00            push    0
11f9ecc7 57              push    edi
11f9ecc8 50              push    eax
11f9ecc9 ff32            push    dword ptr [edx]
11f9eccb e830aefafd      call    xul!nsDisplayListBuilder::nsDisplayListBuilder (0ff49b00)
11f9ecd0 8d86ac100000    lea     eax,[esi+10ACh]
11f9ecd6 c786a810000054bbfc13 mov dword ptr [esi+10A8h],offset xul!nsDisplayList::`vftable' (13fcbb54)

esi = eax - 10AC and the vftable would point to eax - 4
So, could that be what we are trying to call?

[1] https://searchfox.org/mozilla-central/rev/029d9d2477ef0232bb08db94696badddec4d5bda/layout/base/nsLayoutUtils.cpp#3770

Another data point: central, beta, and release are all using clang-9, on m-c we've cherry-picked a patch to help with some non-deterministic build outputs (bug 1597901).

Flags: needinfo?(erahm)

So far so good on the rc build...

Keywords: topcrash

(In reply to Julien Cristau [:jcristau] from comment #15)

So far so good on the rc build...

Do we know if anyone has attempted to reproduce this crash in the RC build? Matt is out on year end PTO so we shouldn't expect to hear from him until Monday at the earliest.

I can look if someone wants to send me a minidump.

I spent some time looking at the stack values from a minidump. I didn't really get anywhere. I can tell that we executed the prologue of nsLayoutUtils::AddExtraBackgroundItems, at least up to sub esp,78h. It's not clear whether we finished executing it (and reted to the wrong address) or whether the failure happened during the function. I don't see any obvious things like off-by-ones when popping saved registers that would lead to the current register state. It certainly seems like it could be a cpu-related problem, but I can't pinpoint exactly what.

(In reply to Thomas Elin [:relaas] from comment #16)

Do we know if anyone has attempted to reproduce this crash in the RC build?

I assume not; this only affects certain CPUs from roughly 2007.

Thanks to the holiday season I happen to have a C2D machine under my desk (left by a relative for a Windows update). I'll try and see if I can reproduce the issue locally. That being said we'd be surprisingly unlucky if we managed to find a not-yet-known defect in CPUs over 10 years old.

I don't have a C2D machine unfortunately, so thanks for offering to investigate further!

Flags: needinfo?(matt.woodrow)

I couldn't reproduce the issue on the C2D machine I have available no matter how hard I tried and then discovered why: it's running Windows 10. All these crashes are coming from Windows 7 which reinforces my belief that this is not a processor bug but rather a Windows kernel bug related to the meltdown/spectre mitigations that are required by these older processors.

I'm not familiar with this code but I couldn't find any obvious system calls in or around the call to AddExtraBackgroundItems(), could there be some? Any user/kernel transaction would be interesting. Windows 7 apparently does even font rendering in the kernel so that might be a potential culprit.

David do we have some contacts at Microsoft we could talk to to debug this? I'll try and see if I can setup Windows 7 on the machine I have in the hope of being able to reproduce the issue.

Flags: needinfo?(dmajor)

I dug out a copy of 32-bit Windows 7 and now I'm burning it on a DVD+RW to install it on a spare hard drive. Party like it's 2010.

Andrei, I don't suppose SV has a machine with one of those:

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #7)

Strong correlation to one CPU:
(84.55% in signature vs 25.15% overall) CPU Info = GenuineIntel family 6 model 23 stepping 10 [84.55% vs 32.87% if cpu_arch = x86]

?

Flags: needinfo?(andrei.vaida)
Flags: needinfo?(mh+mozilla)

I successfully reproduced this crash on a C2D E6550 (family 6 model 15 stepping 11) running the 32-bit version of Firefox 72 beta 11 on a fully patched 32-bit Windows 7 installation. My testing points to this being most likely a Windows kernel bug so I suggest moving the crash to the "External software affecting Firefox" module (and informing Microsoft). Here's the nitty gritty details:

On this setup I can reproduce the crash by browsing to: https://xentry.daimler.com/home/

Scrolling the page or hovering the mouse over the elements causes the crash to happen almost immediately.

I then disabled Meltdown mitigations via these commands (run from an elevated prompt):

reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 3 /f

reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 3 /f

I rebooted the machine for the changes to be applied then I used SpecuCheck to confirm that the mitigations were correctly disabled.

With the Meltdown mitigations disabled I couldn't reproduce the crash, no matter how hard I tried.

Great work Gabriele! stpeter, would you be the right person to hit up MS?

Flags: needinfo?(stpeter)
Flags: needinfo?(dmajor)
Flags: needinfo?(andrei.vaida)
Component: Web Painting → Other
Product: Core → External Software Affecting Firefox
Version: 72 Branch → unspecified

Note that this month's security patches are the last updates Microsoft is committing to ship to Windows 7, so it's not clear what MS will be willing to do here even if they do acknowledge that it's their bug.

Discussed during triage. We will continue tracking for 72 and see how this plays out.

I've reached out to our friends at Microsoft via our coordination list, cc'ing Randell.

Flags: needinfo?(stpeter)

Gabriele and dmajor have provided the info requested by microsoft; dmajor's analysis may be useful to them. Thanks peter

You need to log in before you can comment on or make changes to this bug.