1606138 - Crash in [@ RetainedDisplayListBuilder::AttemptPartialUpdate]

Reporter

Description

•

6 years ago

[Tracking Requested - why for this release]:

This bug is for crash report bp-7c5cd9a0-d09d-4272-b1d7-ba1aa0191227.

Top 10 frames of crashing thread:

0  @0x35d158 
1 xul.dll RetainedDisplayListBuilder::AttemptPartialUpdate layout/painting/RetainedDisplayListBuilder.cpp:1481
2 xul.dll nsLayoutUtils::PaintFrame layout/base/nsLayoutUtils.cpp:3956
3 xul.dll mozilla::PresShell::Paint layout/base/PresShell.cpp:6031
4 xul.dll void nsViewManager::ProcessPendingUpdatesPaint view/nsViewManager.cpp:461
5 xul.dll void nsViewManager::ProcessPendingUpdatesForView view/nsViewManager.cpp:396
6 xul.dll nsViewManager::ProcessPendingUpdates view/nsViewManager.cpp:1019
7 xul.dll void nsRefreshDriver::Tick layout/base/nsRefreshDriver.cpp:2176
8 xul.dll void mozilla::RefreshDriverTimer::TickRefreshDrivers layout/base/nsRefreshDriver.cpp:350
9 xul.dll void mozilla::RefreshDriverTimer::Tick layout/base/nsRefreshDriver.cpp:367

this crash signature is spiking up in 72.0b11 from users of the 32bit version of the browser on windows 7. this is currently accounting for 66% of content crashes on the latest beta (but hitting individual users repeatedly).

this far it's only occurring on 3 cpu models - Cpu info facet:
GenuineIntel family 6 model 23 stepping 10 154 88.00 %
GenuineIntel family 6 model 23 stepping 6 12 6.86 %
GenuineIntel family 6 model 15 stepping 11 9 5.14 %

Liz Henry (:lizzard) (privacy/fingerprinting team)

Comment 1

•

6 years ago

We may want to call this a blocker, certainly worth fixing in RC if possible

tracking-firefox72: ? → +

Liz Henry (:lizzard) (privacy/fingerprinting team)

Comment 2

•

6 years ago

Here's the patches that went into beta 11: https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_72_0b10_RELEASE&tochange=FIREFOX_72_0b11_RELEASE

Liz Henry (:lizzard) (privacy/fingerprinting team)

Comment 3

•

6 years ago

Some crashing urls:
https://sts.karnataka.gov.in/SATS/#
https://xat.com/Trade
http://www.pcpndt.karnataka.gov.in/

Looking at the pushlog, I don't see anything Graphics or DisplayList related.
Maybe Bug 1605867? Or bug 1603714?
Bob, Aaron, what do you think?

Flags: needinfo?(matt.woodrow)

Flags: needinfo?(bobowencode)

Flags: needinfo?(aklotz)

Liz Henry (:lizzard) (privacy/fingerprinting team)

Comment 4

•

6 years ago

Randell, if you're around, can you take a look to see if you can identify what may have regressed this?

Flags: needinfo?(rjesup)

(No longer employed by Mozilla) Aaron Klotz

Comment 5

•

6 years ago

My patch was backed out for that build.

Flags: needinfo?(aklotz)

Bob Owen (:bobowen)

Comment 6

•

6 years ago

(In reply to Liz Henry (:lizzard) from comment #3)
...

Maybe Bug 1605867? Or bug 1603714?
Bob, Aaron, what do you think?

Anything is possible with sandboxing, but I'd be very surprised if that were down to my patch, particularly with the win7, 32-bit, certain CPU model details.

Flags: needinfo?(bobowencode)

Marcia Knous [:marcia]

Updated

•

6 years ago

Crash Signature: [@ RetainedDisplayListBuilder::AttemptPartialUpdate] → [@ RetainedDisplayListBuilder::AttemptPartialUpdate] [@ nsLayoutUtils::PaintFrame ]

[:philipp]

Reporter

Updated

•

6 years ago

Crash Signature: [@ RetainedDisplayListBuilder::AttemptPartialUpdate] [@ nsLayoutUtils::PaintFrame ] → [@ RetainedDisplayListBuilder::AttemptPartialUpdate] [@ nsLayoutUtils::PaintFrame ] [@ mozilla::LauncherMain] [@ trunc | RetainedDisplayListBuilder::AttemptPartialUpdate] [@ SdbInitDatabase] [@ firefox.exe@0x2bff]

Marcia Knous [:marcia]

Comment 7

•

6 years ago

This has escalated in volume in 3 days and now has 18728 crashes/3351 installs. Marking as blocking.

Strong correlation to one CPU:
(84.55% in signature vs 25.15% overall) CPU Info = GenuineIntel family 6 model 23 stepping 10 [84.55% vs 32.87% if cpu_arch = x86]
(100.0% in signature vs 49.49% overall) moz_crash_reason = null

tracking-firefox72: + → blocking

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Severity: normal → major

Randell Jesup [:jesup] (needinfo me)

Comment 8

•

6 years ago

Glandium: This appears to be some sort of compiler/compiled-code change triggering a CPU bug, from the correlations and the lack of any obvious changes at the crash location. Perhaps something like the code's alignment to the page, or a branches' alignment, or a structure size change.
dmajor: I realize this isn't your responsibility, but perhaps you have some ideas what could be going wrong here? (note dmajor is on PTO with requests off, so not NI'ing)

lizzard: there's no obvious or even non-obvious cause from the changelogs; no changes anywhere near the crash location in a long time. Possibly just generating a b12 will make it go away - worth a try.
Note: glandium is out until Jan 7... Not sure who might be able to cover on this sort of thing with him and dmajor out. NIing erahm and sylveste for possible help here

Flags: needinfo?(sledru)

Flags: needinfo?(rjesup)

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(erahm)

Julien Cristau [:jcristau]

Comment 9

•

6 years ago

(In reply to Randell Jesup [:jesup] (PTO thru 1/1, needinfo me) from comment #8)

lizzard: there's no obvious or even non-obvious cause from the changelogs; no changes anywhere near the crash location in a long time. Possibly just generating a b12 will make it go away - worth a try.

72 rc1 is building today and will be on the beta channel tomorrow, so fingers crossed for that.

Sylvestre Ledru [:Sylvestre]

Comment 10

•

6 years ago

I hope a rebuild will fix that tomorrow. If not, I will bother more people (I already pinged a few of them)

Flags: needinfo?(sledru)

Gabriele Svelto [:gsvelto]

Comment 11

•

6 years ago

Just stumbled into this bug: I had a look at the crashes and they're all coming from Penryn or Merom processors, i.e. the Core 2 Duo family. They're old and more importantly they are subject to software-based mitigations for Spectre/Meltdown in the Windows kernel. It might be worth keeping this in mind since it would happen almost only on those processors, all other non-ancient ones have microcode-based mitigations.

Looking into the crashes themselves I see this pattern coming up very often (but not always):

The crash reason is EXCEPTION_ACCESS_VIOLATION_EXEC, we're trying to execute something that's not executable
eax contains something like 0x00<num>c
The crashing address is 0x00<num>8

So we're trying to make an indirect call to something that's pointed by eax but before trying to run it we subtract 4 from the address.

Does this ring any bells? Is it likely to find an indirect call at the bottom of those stacks?

It's late now over here but tomorrow I'll try having a look at the stack data directly.

Gabriele Svelto [:gsvelto]

Comment 12

•

6 years ago

Just another detail I had missed: the crashing address is very close to what's in esp so it looks like we're trying to jump into the stack!? More precisely the crashing address is 244 bytes ahead of the stack pointer.

Bob Owen (:bobowen)

Comment 13

•

6 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #12)

Just another detail I had missed: the crashing address is very close to what's in esp so it looks like we're trying to jump into the stack!? More precisely the crashing address is 244 bytes ahead of the stack pointer.

I was just scanning through the code and saw you mentioned

eax contains something like 0x00<num>c
The crashing address is 0x00<num>8

At the end of xul!nsLayoutUtils::AddExtraBackgroundItems (which is what we appear to have returned from with the messed up instruction pointer) there is some code for the construction of the Maybe<TemporaryDisplayListBuilder> I think from [1]

xul!mozilla::Maybe<TemporaryDisplayListBuilder>::emplace<nsIFrame *&,nsDisplayListBuilderMode &,const bool &>:
11f9ecb0 55              push    ebp
11f9ecb1 89e5            mov     ebp,esp
11f9ecb3 57              push    edi
11f9ecb4 56              push    esi
11f9ecb5 8b4508          mov     eax,dword ptr [ebp+8]
11f9ecb8 89ce            mov     esi,ecx
11f9ecba 8b4d0c          mov     ecx,dword ptr [ebp+0Ch]
11f9ecbd 0fb639          movzx   edi,byte ptr [ecx]
11f9ecc0 0fb600          movzx   eax,byte ptr [eax]
11f9ecc3 89f1            mov     ecx,esi
11f9ecc5 6a00            push    0
11f9ecc7 57              push    edi
11f9ecc8 50              push    eax
11f9ecc9 ff32            push    dword ptr [edx]
11f9eccb e830aefafd      call    xul!nsDisplayListBuilder::nsDisplayListBuilder (0ff49b00)
11f9ecd0 8d86ac100000    lea     eax,[esi+10ACh]
11f9ecd6 c786a810000054bbfc13 mov dword ptr [esi+10A8h],offset xul!nsDisplayList::`vftable' (13fcbb54)

esi = eax - 10AC and the vftable would point to eax - 4
So, could that be what we are trying to call?

[1] https://searchfox.org/mozilla-central/rev/029d9d2477ef0232bb08db94696badddec4d5bda/layout/base/nsLayoutUtils.cpp#3770

Eric Rahm [:erahm]

Comment 14

•

6 years ago

Another data point: central, beta, and release are all using clang-9, on m-c we've cherry-picked a patch to help with some non-deterministic build outputs (bug 1597901).

Flags: needinfo?(erahm)

Julien Cristau [:jcristau]

Comment 15

•

6 years ago

So far so good on the rc build...

Keywords: topcrash

Thomas Elin [:relaas]

Comment 16

•

6 years ago

(In reply to Julien Cristau [:jcristau] from comment #15)

So far so good on the rc build...

Do we know if anyone has attempted to reproduce this crash in the RC build? Matt is out on year end PTO so we shouldn't expect to hear from him until Monday at the earliest.

(Away)

Comment 17

•

6 years ago

I can look if someone wants to send me a minidump.

(Away)

Comment 18

•

6 years ago

I spent some time looking at the stack values from a minidump. I didn't really get anywhere. I can tell that we executed the prologue of nsLayoutUtils::AddExtraBackgroundItems, at least up to sub esp,78h. It's not clear whether we finished executing it (and reted to the wrong address) or whether the failure happened during the function. I don't see any obvious things like off-by-ones when popping saved registers that would lead to the current register state. It certainly seems like it could be a cpu-related problem, but I can't pinpoint exactly what.

(Away)

Comment 19

•

6 years ago

(In reply to Thomas Elin [:relaas] from comment #16)

Do we know if anyone has attempted to reproduce this crash in the RC build?

I assume not; this only affects certain CPUs from roughly 2007.

Gabriele Svelto [:gsvelto]

Comment 20

•

6 years ago

Thanks to the holiday season I happen to have a C2D machine under my desk (left by a relative for a Windows update). I'll try and see if I can reproduce the issue locally. That being said we'd be surprisingly unlucky if we managed to find a not-yet-known defect in CPUs over 10 years old.

Matt Woodrow (:mattwoodrow)

Comment 21

•

6 years ago

I don't have a C2D machine unfortunately, so thanks for offering to investigate further!

Flags: needinfo?(matt.woodrow)

Gabriele Svelto [:gsvelto]

Comment 22

•

6 years ago

I couldn't reproduce the issue on the C2D machine I have available no matter how hard I tried and then discovered why: it's running Windows 10. All these crashes are coming from Windows 7 which reinforces my belief that this is not a processor bug but rather a Windows kernel bug related to the meltdown/spectre mitigations that are required by these older processors.

I'm not familiar with this code but I couldn't find any obvious system calls in or around the call to AddExtraBackgroundItems(), could there be some? Any user/kernel transaction would be interesting. Windows 7 apparently does even font rendering in the kernel so that might be a potential culprit.

David do we have some contacts at Microsoft we could talk to to debug this? I'll try and see if I can setup Windows 7 on the machine I have in the hope of being able to reproduce the issue.

Flags: needinfo?(dmajor)

Gabriele Svelto [:gsvelto]

Comment 23

•

6 years ago

I dug out a copy of 32-bit Windows 7 and now I'm burning it on a DVD+RW to install it on a spare hard drive. Party like it's 2010.

Julien Cristau [:jcristau]

Comment 24

•

6 years ago

Andrei, I don't suppose SV has a machine with one of those:

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #7)

Strong correlation to one CPU:
(84.55% in signature vs 25.15% overall) CPU Info = GenuineIntel family 6 model 23 stepping 10 [84.55% vs 32.87% if cpu_arch = x86]

?

Flags: needinfo?(andrei.vaida)

Mike Hommey [:glandium]

Updated

•

6 years ago

Flags: needinfo?(mh+mozilla)

Gabriele Svelto [:gsvelto]

Comment 25

•

6 years ago

I successfully reproduced this crash on a C2D E6550 (family 6 model 15 stepping 11) running the 32-bit version of Firefox 72 beta 11 on a fully patched 32-bit Windows 7 installation. My testing points to this being most likely a Windows kernel bug so I suggest moving the crash to the "External software affecting Firefox" module (and informing Microsoft). Here's the nitty gritty details:

On this setup I can reproduce the crash by browsing to: https://xentry.daimler.com/home/

Scrolling the page or hovering the mouse over the elements causes the crash to happen almost immediately.

I then disabled Meltdown mitigations via these commands (run from an elevated prompt):

reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 3 /f

reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 3 /f

I rebooted the machine for the changes to be applied then I used SpecuCheck to confirm that the mitigations were correctly disabled.

With the Meltdown mitigations disabled I couldn't reproduce the crash, no matter how hard I tried.

Randell Jesup [:jesup] (needinfo me)

Comment 26

•

6 years ago

Great work Gabriele! stpeter, would you be the right person to hit up MS?

Flags: needinfo?(stpeter)

Flags: needinfo?(dmajor)

Flags: needinfo?(andrei.vaida)

Randell Jesup [:jesup] (needinfo me)

Updated

•

6 years ago

Component: Web Painting → Other

Product: Core → External Software Affecting Firefox

Version: 72 Branch → unspecified

Ryan VanderMeulen (PTO, back 6-April)

Comment 27

•

6 years ago

Note that this month's security patches are the last updates Microsoft is committing to ship to Windows 7, so it's not clear what MS will be willing to do here even if they do acknowledge that it's their bug.

Marcia Knous [:marcia]

Comment 28

•

6 years ago

Discussed during triage. We will continue tracking for 72 and see how this plays out.

Marcia Knous [:marcia]

Updated

•

6 years ago

tracking-firefox72: blocking → +

Peter Saint-Andre [:stpeter]

Comment 29

•

6 years ago

I've reached out to our friends at Microsoft via our coordination list, cc'ing Randell.

Flags: needinfo?(stpeter)

Randell Jesup [:jesup] (needinfo me)

Comment 30

•

6 years ago

Gabriele and dmajor have provided the info requested by microsoft; dmajor's analysis may be useful to them. Thanks peter

Ryan VanderMeulen (PTO, back 6-April)

Updated

•

6 years ago

status-firefox72: affected → wontfix

status-firefox73: --- → fix-optional

status-firefox74: --- → fix-optional

tracking-firefox73: --- → -

tracking-firefox74: --- → -

Andrew Overholt [:overholt]

Comment 31

•

6 years ago

Is this a P3/Backlog kind of bug?

Flags: needinfo?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Comment 32

•

6 years ago

There's nothing actionable for us here, this is Microsoft's bug. Maybe Gabriele or Dmajor know is there's any outcome.

Flags: needinfo?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Updated

•

6 years ago

Priority: -- → P5

Gabriele Svelto [:gsvelto]

Comment 33

•

6 years ago

I haven't heard anything from Microsoft. There's not much we can do here and luckily it hasn't manifested itself anymore so maybe we should just close it.

Ryan VanderMeulen (PTO, back 6-April)

Updated

•

5 years ago

status-firefox73: fix-optional → wontfix

status-firefox74: fix-optional → wontfix

BugBot [:suhaib / :marco/ :calixte]

Comment 34

•

3 years ago

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

Gabriele Svelto [:gsvelto]

Comment 35

•

3 years ago

The original problem here is gone, we still see crash volume because the signatures are fairly generic. Closing this.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → WONTFIX