1294963 - Crash in mozilla::DisplayListClipState::AutoSaveRestore::AutoSaveRestore

Reporter

Description

•

9 years ago

[Tracking Requested - why for this release]: New crash that emerged in B3. This bug was filed from the Socorro interface and is report bp-d7c7980c-e494-4805-8111-061802160813. ============================================================= http://bit.ly/2b5Icy1. Reports all seem to have AMD graphic cards. Possible regression Bug 1281759?

Flags: needinfo?(jdemooij)

Ethan Lin[:ethlin]

Updated

•

9 years ago

Whiteboard: [gfx-noted]

Sylvestre Ledru [:Sylvestre]

Comment 1

•

9 years ago

[Tracking Requested - why for this release]: I guess you meant 49 ni liz if she wants to backout bug 1281759

status-firefox48: affected → ---

status-firefox49: --- → affected

tracking-firefox48: ? → ---

tracking-firefox49: --- → ?

Flags: needinfo?(lhenry)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 2

•

9 years ago

Seems like a good idea. I asked jandem in bug 1281759.

tracking-firefox49: ? → +

Flags: needinfo?(lhenry)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 3

•

9 years ago

#1 top crash on beta 3, with 3864 crashes in the last week. That's really bad and yes let's back out the patch that caused it!

Liz Henry (:lizzard) (relman/hg->git project)

Updated

•

9 years ago

tracking-firefox49: + → blocking

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

9 years ago

Blocks: amdbug

Component: Graphics → JavaScript Engine: JIT

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

9 years ago

Keywords: regression

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Comment 4

•

9 years ago

Why do you think this is a regresion from bug 1281759 rather than just yet another instance of bug 772330?

Flags: needinfo?(lhenry)

Jan de Mooij [:jandem]

Comment 5

•

9 years ago

I don't see how bug 1281759 is related. All it does is (1) set a flag if we're using CPU family X model Y (2) if that flag is set, insert some NOP instructions. It's pretty straight-forward. So I agree with comment 4 that this is likely another instance of bug 772330 (and bug 1281759 fixed another instance of this in JIT code). The one here seems very frequent though. If there's no fix and no other backout candidate we can try to back out bug 1281759 I guess, but I don't see how it's related and it won't tell us much because the bug may not show up in all builds.

Flags: needinfo?(jdemooij)

Andrew McCreight [:mccr8]

Comment 6

•

9 years ago

Did we see a corresponding decrease in the JIT-y variants of the AMD bug in that beta? Maybe you fixed those crashes but they all just ended up here instead?

Liz Henry (:lizzard) (relman/hg->git project)

Comment 7

•

9 years ago

Oh, ok. I think I see the issue, I was a bit hasty. So, if this was the AMD build issue, then our beta 4 build (without backing out) should not see this high crash rate. We should wait and see what things look like with beta 4, which should release a bit later today.

Flags: needinfo?(lhenry)

[:philipp]

Comment 8

•

9 years ago

(In reply to Andrew McCreight [:mccr8] from comment #6) > Did we see a corresponding decrease in the JIT-y variants of the AMD bug in > that beta? Maybe you fixed those crashes but they all just ended up here > instead? i don't think so: 49.0b2 had 500 crashes from the infamous "AuthenticAMD family 20" cpu family, whereas 49.0b3 has registered 4600 crashes (~4000 just for this bug's signature) from "AuthenticAMD family 20" cpus so far. so those [@ mozilla::DisplayListClipState::AutoSaveRestore::AutoSaveRestore] crashes seemed to occur in addition to the "normal" crashing level of that cpu family...

Milan Sreckovic [:milan] (needinfo for best results)

Comment 9

•

9 years ago

What are the code changes applied to beta between beta 2 and beta 3?

[:philipp]

Comment 10

•

9 years ago

https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_49_0b2_RELEASE&tochange=FIREFOX_49_0b3_RELEASE

Milan Sreckovic [:milan] (needinfo for best results)

Comment 11

•

9 years ago

Hopefully not bug 1287075, but right now my money is on that. I say hopefully, because that was a backout of a part of bug 12732250 that landed in May. Botond, maybe we need to tweak that backout?

Flags: needinfo?(botond)

Markus Stange [:mstange]

Comment 12

•

9 years ago

(In reply to Milan Sreckovic [:milan] from comment #11) > Hopefully not bug 1287075, but right now my money is on that. I say > hopefully, because that was a backout of a part of bug 12732250 that landed > in May. Botond, maybe we need to tweak that backout? It's not obvious that this was caused by bug 1287075. The change from that bug only touched FrameLayerBuilder.cpp and there are no callers to DisplayListClipState::AutoSaveRestore::AutoSaveRestore in that file, not even inlined. However, the code generated by FrameLayerBuilder.cpp may be close in proximity to the code for DisplayListClipState::AutoSaveRestore in the generated XUL binary. I don't think we can tweak something specific to that patch in order to work around this bug.

Flags: needinfo?(botond)

Markus Stange [:mstange]

Comment 13

•

9 years ago

Actually, since both FrameLayerBuilder.cpp and DisplayListClipState.cpp are in layout/base/, the files might have been unified into the same compilation unit, which would mean they'd end up very close in the binary. So you may be right about bug 1287075 being the regressor.

Botond Ballo [:botond]

Comment 14

•

9 years ago

(In reply to Markus Stange [:mstange] from comment #13) > Actually, since both FrameLayerBuilder.cpp and DisplayListClipState.cpp are > in layout/base/, the files might have been unified into the same compilation > unit, which would mean they'd end up very close in the binary. Why is them being close in the binary a reason for a change to one to cause a crash in the other? Is the theory that the actual problem is general memory corruption / some other kind of undefined behaviour, which just tends to manifest in the form of crashes in AutoSaveRestore (possibly among other things), and a change to code "close by" made it hit AutoSaveRestore more often?

Markus Stange [:mstange]

Comment 15

•

9 years ago

(In reply to Botond Ballo [:botond] from comment #14) > (In reply to Markus Stange [:mstange] from comment #13) > > Actually, since both FrameLayerBuilder.cpp and DisplayListClipState.cpp are > > in layout/base/, the files might have been unified into the same compilation > > unit, which would mean they'd end up very close in the binary. > > Why is them being close in the binary a reason for a change to one to cause > a crash in the other? Well, it's more of a guess, I wouldn't actually put money on this assertion. The crash is caused by a processor bug that, under some conditions, causes the program counter to jump slightly too far and end up in the middle of an instruction, so the processor will interpret the bytes at that location as a different instruction and crash sooner or later. See e.g. bug 772330 comment 12 for reference. So since this crash depends on the target location of a jump, moving code to a different offset can affect it. (This is why this processor bug was worked around in the JIT by adding nops around the critical location, see bug 1281759.) And my guess is that any code shifting caused by nearby code is less likely to be compensated for by e.g. nop alignment padding than shifting due to code that's further away. But that sounds really hand-wavey, so maybe it's not true at all.

Marcia Knous [:marcia]

Reporter

Comment 16

•

9 years ago

In early B4 data I am seeing a low volume crash (so far) in [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | mozilla::DisplayListClipState::AutoSaveRestore::~AutoSaveRestore].

[:philipp]

Comment 17

•

9 years ago

the signature here is gone with in 49.0b4 & "AuthenticAMD family 20" cpus are back to their normal crashing pattern so i'll mark this bug as resolved. curiously there seems to be a different cpu-specific spike, this time with a particular intel class in 49.0b4 again though (signatures starting with arena...): https://crash-stats.mozilla.com/search/?product=Firefox&process_type=browser&cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping&version=49.0b4&_sort=-date&_facets=signature&_facets=user_comments&_facets=adapter_vendor_id&_facets=platform_pretty_version&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox49: affected → unaffected

Resolution: --- → WORKSFORME

Calixte Denizet (:calixte)

Comment 18

•

9 years ago

49 is obviously affected but clearly no more so I set status to wontfix because we don't know how to fix it.

status-firefox49: unaffected → wontfix

Sylvestre Ledru [:Sylvestre]

Updated

•

9 years ago

Resolution: WORKSFORME → WONTFIX

David Bolter [:davidb] (NeedInfo me for attention)

Comment 19

•

9 years ago

Liz do you want this to continue blocking so that we keep an eye on it?

Flags: needinfo?(lhenry)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 20

•

9 years ago

With the wontfix flag on status-firefox49, it shouldn't show up in triage unless the bug re-opens. So I like keeping the flag set to "blocking", just in case.

Flags: needinfo?(lhenry)