Crash in [@ RetainedDisplayListBuilder::AttemptPartialUpdate]
Categories
(External Software Affecting Firefox :: Other, defect, P5)
Tracking
(firefox72+ wontfix, firefox73- wontfix, firefox74- wontfix)
People
(Reporter: philipp, Unassigned)
Details
(Keywords: crash, regression)
Crash Data
[Tracking Requested - why for this release]:
This bug is for crash report bp-7c5cd9a0-d09d-4272-b1d7-ba1aa0191227.
Top 10 frames of crashing thread:
0 @0x35d158
1 xul.dll RetainedDisplayListBuilder::AttemptPartialUpdate layout/painting/RetainedDisplayListBuilder.cpp:1481
2 xul.dll nsLayoutUtils::PaintFrame layout/base/nsLayoutUtils.cpp:3956
3 xul.dll mozilla::PresShell::Paint layout/base/PresShell.cpp:6031
4 xul.dll void nsViewManager::ProcessPendingUpdatesPaint view/nsViewManager.cpp:461
5 xul.dll void nsViewManager::ProcessPendingUpdatesForView view/nsViewManager.cpp:396
6 xul.dll nsViewManager::ProcessPendingUpdates view/nsViewManager.cpp:1019
7 xul.dll void nsRefreshDriver::Tick layout/base/nsRefreshDriver.cpp:2176
8 xul.dll void mozilla::RefreshDriverTimer::TickRefreshDrivers layout/base/nsRefreshDriver.cpp:350
9 xul.dll void mozilla::RefreshDriverTimer::Tick layout/base/nsRefreshDriver.cpp:367
this crash signature is spiking up in 72.0b11 from users of the 32bit version of the browser on windows 7. this is currently accounting for 66% of content crashes on the latest beta (but hitting individual users repeatedly).
this far it's only occurring on 3 cpu models - Cpu info facet:
GenuineIntel family 6 model 23 stepping 10 154 88.00 %
GenuineIntel family 6 model 23 stepping 6 12 6.86 %
GenuineIntel family 6 model 15 stepping 11 9 5.14 %
Comment 1•6 years ago
|
||
We may want to call this a blocker, certainly worth fixing in RC if possible
Comment 2•6 years ago
|
||
Here's the patches that went into beta 11: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_72_0b10_RELEASE&tochange=FIREFOX_72_0b11_RELEASE
Comment 3•6 years ago
|
||
Some crashing urls:
https://sts.karnataka.gov.in/SATS/#
https://xat.com/Trade
http://www.pcpndt.karnataka.gov.in/
Looking at the pushlog, I don't see anything Graphics or DisplayList related.
Maybe Bug 1605867? Or bug 1603714?
Bob, Aaron, what do you think?
Comment 4•6 years ago
|
||
Randell, if you're around, can you take a look to see if you can identify what may have regressed this?
Comment 6•6 years ago
|
||
(In reply to Liz Henry (:lizzard) from comment #3)
...
Maybe Bug 1605867? Or bug 1603714?
Bob, Aaron, what do you think?
Anything is possible with sandboxing, but I'd be very surprised if that were down to my patch, particularly with the win7, 32-bit, certain CPU model details.
Updated•6 years ago
|
| Reporter | ||
Updated•6 years ago
|
Comment 7•6 years ago
|
||
This has escalated in volume in 3 days and now has 18728 crashes/3351 installs. Marking as blocking.
Strong correlation to one CPU:
(84.55% in signature vs 25.15% overall) CPU Info = GenuineIntel family 6 model 23 stepping 10 [84.55% vs 32.87% if cpu_arch = x86]
(100.0% in signature vs 49.49% overall) moz_crash_reason = null
Updated•6 years ago
|
Comment 8•6 years ago
|
||
Glandium: This appears to be some sort of compiler/compiled-code change triggering a CPU bug, from the correlations and the lack of any obvious changes at the crash location. Perhaps something like the code's alignment to the page, or a branches' alignment, or a structure size change.
dmajor: I realize this isn't your responsibility, but perhaps you have some ideas what could be going wrong here? (note dmajor is on PTO with requests off, so not NI'ing)
lizzard: there's no obvious or even non-obvious cause from the changelogs; no changes anywhere near the crash location in a long time. Possibly just generating a b12 will make it go away - worth a try.
Note: glandium is out until Jan 7... Not sure who might be able to cover on this sort of thing with him and dmajor out. NIing erahm and sylveste for possible help here
Comment 9•6 years ago
|
||
(In reply to Randell Jesup [:jesup] (PTO thru 1/1, needinfo me) from comment #8)
lizzard: there's no obvious or even non-obvious cause from the changelogs; no changes anywhere near the crash location in a long time. Possibly just generating a b12 will make it go away - worth a try.
72 rc1 is building today and will be on the beta channel tomorrow, so fingers crossed for that.
Comment 10•6 years ago
|
||
I hope a rebuild will fix that tomorrow. If not, I will bother more people (I already pinged a few of them)
Comment 11•6 years ago
|
||
Just stumbled into this bug: I had a look at the crashes and they're all coming from Penryn or Merom processors, i.e. the Core 2 Duo family. They're old and more importantly they are subject to software-based mitigations for Spectre/Meltdown in the Windows kernel. It might be worth keeping this in mind since it would happen almost only on those processors, all other non-ancient ones have microcode-based mitigations.
Looking into the crashes themselves I see this pattern coming up very often (but not always):
- The crash reason is
EXCEPTION_ACCESS_VIOLATION_EXEC, we're trying to execute something that's not executable - eax contains something like
0x00<num>c - The crashing address is
0x00<num>8
So we're trying to make an indirect call to something that's pointed by eax but before trying to run it we subtract 4 from the address.
Does this ring any bells? Is it likely to find an indirect call at the bottom of those stacks?
It's late now over here but tomorrow I'll try having a look at the stack data directly.
Comment 12•6 years ago
|
||
Just another detail I had missed: the crashing address is very close to what's in esp so it looks like we're trying to jump into the stack!? More precisely the crashing address is 244 bytes ahead of the stack pointer.
Comment 13•6 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #12)
Just another detail I had missed: the crashing address is very close to what's in esp so it looks like we're trying to jump into the stack!? More precisely the crashing address is 244 bytes ahead of the stack pointer.
I was just scanning through the code and saw you mentioned
- eax contains something like 0x00<num>c
- The crashing address is 0x00<num>8
At the end of xul!nsLayoutUtils::AddExtraBackgroundItems (which is what we appear to have returned from with the messed up instruction pointer) there is some code for the construction of the Maybe<TemporaryDisplayListBuilder> I think from [1]
xul!mozilla::Maybe<TemporaryDisplayListBuilder>::emplace<nsIFrame *&,nsDisplayListBuilderMode &,const bool &>:
11f9ecb0 55 push ebp
11f9ecb1 89e5 mov ebp,esp
11f9ecb3 57 push edi
11f9ecb4 56 push esi
11f9ecb5 8b4508 mov eax,dword ptr [ebp+8]
11f9ecb8 89ce mov esi,ecx
11f9ecba 8b4d0c mov ecx,dword ptr [ebp+0Ch]
11f9ecbd 0fb639 movzx edi,byte ptr [ecx]
11f9ecc0 0fb600 movzx eax,byte ptr [eax]
11f9ecc3 89f1 mov ecx,esi
11f9ecc5 6a00 push 0
11f9ecc7 57 push edi
11f9ecc8 50 push eax
11f9ecc9 ff32 push dword ptr [edx]
11f9eccb e830aefafd call xul!nsDisplayListBuilder::nsDisplayListBuilder (0ff49b00)
11f9ecd0 8d86ac100000 lea eax,[esi+10ACh]
11f9ecd6 c786a810000054bbfc13 mov dword ptr [esi+10A8h],offset xul!nsDisplayList::`vftable' (13fcbb54)
esi = eax - 10AC and the vftable would point to eax - 4
So, could that be what we are trying to call?
Comment 14•6 years ago
|
||
Another data point: central, beta, and release are all using clang-9, on m-c we've cherry-picked a patch to help with some non-deterministic build outputs (bug 1597901).
Comment 16•6 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #15)
So far so good on the rc build...
Do we know if anyone has attempted to reproduce this crash in the RC build? Matt is out on year end PTO so we shouldn't expect to hear from him until Monday at the earliest.
Comment 17•6 years ago
|
||
I can look if someone wants to send me a minidump.
Comment 18•6 years ago
|
||
I spent some time looking at the stack values from a minidump. I didn't really get anywhere. I can tell that we executed the prologue of nsLayoutUtils::AddExtraBackgroundItems, at least up to sub esp,78h. It's not clear whether we finished executing it (and reted to the wrong address) or whether the failure happened during the function. I don't see any obvious things like off-by-ones when popping saved registers that would lead to the current register state. It certainly seems like it could be a cpu-related problem, but I can't pinpoint exactly what.
Comment 19•6 years ago
|
||
(In reply to Thomas Elin [:relaas] from comment #16)
Do we know if anyone has attempted to reproduce this crash in the RC build?
I assume not; this only affects certain CPUs from roughly 2007.
Comment 20•6 years ago
|
||
Thanks to the holiday season I happen to have a C2D machine under my desk (left by a relative for a Windows update). I'll try and see if I can reproduce the issue locally. That being said we'd be surprisingly unlucky if we managed to find a not-yet-known defect in CPUs over 10 years old.
Comment 21•6 years ago
|
||
I don't have a C2D machine unfortunately, so thanks for offering to investigate further!
Comment 22•6 years ago
|
||
I couldn't reproduce the issue on the C2D machine I have available no matter how hard I tried and then discovered why: it's running Windows 10. All these crashes are coming from Windows 7 which reinforces my belief that this is not a processor bug but rather a Windows kernel bug related to the meltdown/spectre mitigations that are required by these older processors.
I'm not familiar with this code but I couldn't find any obvious system calls in or around the call to AddExtraBackgroundItems(), could there be some? Any user/kernel transaction would be interesting. Windows 7 apparently does even font rendering in the kernel so that might be a potential culprit.
David do we have some contacts at Microsoft we could talk to to debug this? I'll try and see if I can setup Windows 7 on the machine I have in the hope of being able to reproduce the issue.
Comment 23•6 years ago
|
||
I dug out a copy of 32-bit Windows 7 and now I'm burning it on a DVD+RW to install it on a spare hard drive. Party like it's 2010.
Comment 24•6 years ago
|
||
Andrei, I don't suppose SV has a machine with one of those:
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #7)
Strong correlation to one CPU:
(84.55% in signature vs 25.15% overall) CPU Info = GenuineIntel family 6 model 23 stepping 10 [84.55% vs 32.87% if cpu_arch = x86]
?
Updated•6 years ago
|
Comment 25•6 years ago
|
||
I successfully reproduced this crash on a C2D E6550 (family 6 model 15 stepping 11) running the 32-bit version of Firefox 72 beta 11 on a fully patched 32-bit Windows 7 installation. My testing points to this being most likely a Windows kernel bug so I suggest moving the crash to the "External software affecting Firefox" module (and informing Microsoft). Here's the nitty gritty details:
On this setup I can reproduce the crash by browsing to: https://xentry.daimler.com/home/
Scrolling the page or hovering the mouse over the elements causes the crash to happen almost immediately.
I then disabled Meltdown mitigations via these commands (run from an elevated prompt):
reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 3 /f
reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 3 /f
I rebooted the machine for the changes to be applied then I used SpecuCheck to confirm that the mitigations were correctly disabled.
With the Meltdown mitigations disabled I couldn't reproduce the crash, no matter how hard I tried.
Comment 26•6 years ago
|
||
Great work Gabriele! stpeter, would you be the right person to hit up MS?
Updated•6 years ago
|
Comment 27•6 years ago
|
||
Note that this month's security patches are the last updates Microsoft is committing to ship to Windows 7, so it's not clear what MS will be willing to do here even if they do acknowledge that it's their bug.
Comment 28•6 years ago
|
||
Discussed during triage. We will continue tracking for 72 and see how this plays out.
Updated•6 years ago
|
Comment 29•6 years ago
|
||
I've reached out to our friends at Microsoft via our coordination list, cc'ing Randell.
Comment 30•6 years ago
|
||
Gabriele and dmajor have provided the info requested by microsoft; dmajor's analysis may be useful to them. Thanks peter
Updated•6 years ago
|
Comment 32•6 years ago
|
||
There's nothing actionable for us here, this is Microsoft's bug. Maybe Gabriele or Dmajor know is there's any outcome.
Updated•6 years ago
|
Comment 33•6 years ago
|
||
I haven't heard anything from Microsoft. There's not much we can do here and luckily it hasn't manifested itself anymore so maybe we should just close it.
Updated•5 years ago
|
Comment 34•3 years ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit auto_nag documentation.
Comment 35•3 years ago
|
||
The original problem here is gone, we still see crash volume because the signatures are fairly generic. Closing this.
Description
•