Closed Bug 1294963 Opened 8 years ago Closed 8 years ago

Crash in mozilla::DisplayListClipState::AutoSaveRestore::AutoSaveRestore

Categories

(Core :: JavaScript Engine: JIT, defect)

48 Branch
x86
Windows 8
defect
Not set
critical

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox49 blocking wontfix

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, regression, Whiteboard: [gfx-noted])

Crash Data

[Tracking Requested - why for this release]: New crash that emerged in B3.

This bug was filed from the Socorro interface and is 
report bp-d7c7980c-e494-4805-8111-061802160813.
=============================================================

http://bit.ly/2b5Icy1. Reports all seem to have AMD graphic cards.

Possible regression Bug 1281759?
Flags: needinfo?(jdemooij)
Whiteboard: [gfx-noted]
[Tracking Requested - why for this release]:
I guess you meant 49
ni liz if she wants to backout bug 1281759
Flags: needinfo?(lhenry)
Seems like a good idea. I asked jandem in bug 1281759.
Flags: needinfo?(lhenry)
#1 top crash on beta 3, with 3864 crashes in the last week. That's really bad and yes let's back out the patch that caused it!
Blocks: amdbug
Component: Graphics → JavaScript Engine: JIT
Why do you think this is a regresion from bug 1281759 rather than just yet another instance of bug 772330?
Flags: needinfo?(lhenry)
I don't see how bug 1281759 is related. All it does is (1) set a flag if we're using CPU family X model Y (2) if that flag is set, insert some NOP instructions. It's pretty straight-forward.

So I agree with comment 4 that this is likely another instance of bug 772330 (and bug 1281759 fixed another instance of this in JIT code).

The one here seems very frequent though. If there's no fix and no other backout candidate we can try to back out bug 1281759 I guess, but I don't see how it's related and it won't tell us much because the bug may not show up in all builds.
Flags: needinfo?(jdemooij)
Did we see a corresponding decrease in the JIT-y variants of the AMD bug in that beta? Maybe you fixed those crashes but they all just ended up here instead?
Oh, ok. I think I see the issue, I was a bit hasty. So, if this was the AMD build issue, then our beta 4 build (without backing out) should not see this high crash rate. We should wait and see what things look like with beta 4, which should release a bit later today.
Flags: needinfo?(lhenry)
(In reply to Andrew McCreight [:mccr8] from comment #6)
> Did we see a corresponding decrease in the JIT-y variants of the AMD bug in
> that beta? Maybe you fixed those crashes but they all just ended up here
> instead?

i don't think so: 49.0b2 had 500 crashes from the infamous "AuthenticAMD family 20" cpu family, whereas 49.0b3 has registered 4600 crashes (~4000 just for this bug's signature) from "AuthenticAMD family 20" cpus so far. so those [@ mozilla::DisplayListClipState::AutoSaveRestore::AutoSaveRestore] crashes seemed to occur in addition to the "normal" crashing level of that cpu family...
What are the code changes applied to beta between beta 2 and beta 3?
Hopefully not bug 1287075, but right now my money is on that.  I say hopefully, because that was a backout of a part of bug 12732250 that landed in May.  Botond, maybe we need to tweak that backout?
Flags: needinfo?(botond)
(In reply to Milan Sreckovic [:milan] from comment #11)
> Hopefully not bug 1287075, but right now my money is on that.  I say
> hopefully, because that was a backout of a part of bug 12732250 that landed
> in May.  Botond, maybe we need to tweak that backout?

It's not obvious that this was caused by bug 1287075. The change from that bug only touched FrameLayerBuilder.cpp and there are no callers to DisplayListClipState::AutoSaveRestore::AutoSaveRestore in that file, not even inlined. However, the code generated by FrameLayerBuilder.cpp may be close in proximity to the code for DisplayListClipState::AutoSaveRestore in the generated XUL binary.
I don't think we can tweak something specific to that patch in order to work around this bug.
Flags: needinfo?(botond)
Actually, since both FrameLayerBuilder.cpp and DisplayListClipState.cpp are in layout/base/, the files might have been unified into the same compilation unit, which would mean they'd end up very close in the binary. So you may be right about bug 1287075 being the regressor.
(In reply to Markus Stange [:mstange] from comment #13)
> Actually, since both FrameLayerBuilder.cpp and DisplayListClipState.cpp are
> in layout/base/, the files might have been unified into the same compilation
> unit, which would mean they'd end up very close in the binary.

Why is them being close in the binary a reason for a change to one to cause a crash in the other?

Is the theory that the actual problem is general memory corruption / some other kind of undefined behaviour, which just tends to manifest in the form of crashes in AutoSaveRestore (possibly among other things), and a change to code "close by" made it hit AutoSaveRestore more often?
(In reply to Botond Ballo [:botond] from comment #14)
> (In reply to Markus Stange [:mstange] from comment #13)
> > Actually, since both FrameLayerBuilder.cpp and DisplayListClipState.cpp are
> > in layout/base/, the files might have been unified into the same compilation
> > unit, which would mean they'd end up very close in the binary.
> 
> Why is them being close in the binary a reason for a change to one to cause
> a crash in the other?

Well, it's more of a guess, I wouldn't actually put money on this assertion.

The crash is caused by a processor bug that, under some conditions, causes the program counter to jump slightly too far and end up in the middle of an instruction, so the processor will interpret the bytes at that location as a different instruction and crash sooner or later. See e.g. bug 772330 comment 12 for reference.

So since this crash depends on the target location of a jump, moving code to a different offset can affect it. (This is why this processor bug was worked around in the JIT by adding nops around the critical location, see bug 1281759.) And my guess is that any code shifting caused by nearby code is less likely to be compensated for by e.g. nop alignment padding than shifting due to code that's further away. But that sounds really hand-wavey, so maybe it's not true at all.
In early B4 data I am seeing a low volume crash (so far) in [@ arena_dalloc_small | je_free | nsTArray_base<T>::ShiftData<T> | mozilla::DisplayListClipState::AutoSaveRestore::~AutoSaveRestore].
the signature here is gone with in 49.0b4 & "AuthenticAMD family 20" cpus are back to their normal crashing pattern so i'll mark this bug as resolved.

curiously there seems to be a different cpu-specific spike, this time with a particular intel class in 49.0b4 again though (signatures starting with arena...): 
https://crash-stats.mozilla.com/search/?product=Firefox&process_type=browser&cpu_info=%5EGenuineIntel%20family%206%20model%2061%20stepping&version=49.0b4&_sort=-date&_facets=signature&_facets=user_comments&_facets=adapter_vendor_id&_facets=platform_pretty_version&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
49 is obviously affected but clearly no more so I set status to wontfix because we don't know how to fix it.
Resolution: WORKSFORME → WONTFIX
Liz do you want this to continue blocking so that we keep an eye on it?
Flags: needinfo?(lhenry)
With the wontfix flag on status-firefox49, it shouldn't show up in triage unless the bug re-opens. So I like keeping the flag set to "blocking", just in case.
Flags: needinfo?(lhenry)
You need to log in before you can comment on or make changes to this bug.