Closed Bug 1578722 Opened 5 years ago Closed 2 years ago

Crash in [@ SkScalerContext::AutoDescriptorGivenRecAndEffects] on Intel CPU family 6 model 122 stepping 1

Categories

(Core :: Graphics, defect, P2)

69 Branch
Desktop
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox69 --- wontfix

People

(Reporter: marcia, Unassigned)

References

Details

(Keywords: crash, regression, steps-wanted)

Crash Data

This bug is for crash report bp-3082dfe6-8d7f-4a3f-baec-c026d0190904.

Seen while looking at releases crashes. Currently #4 with no bug associated with it: https://bit.ly/2jZ0rPf. Comments mention repeated crashing.

Top 10 frames of crashing thread:

0 xul.dll SkScalerContext::AutoDescriptorGivenRecAndEffects gfx/skia/skia/src/core/SkScalerContext.cpp:1130
1 xul.dll SkStrikeCache::FindOrCreateStrikeExclusive gfx/skia/skia/src/core/SkStrikeCache.cpp:190
2 xul.dll SkGlyphRunListPainter::drawForBitmapDevice gfx/skia/skia/src/core/SkGlyphRunPainter.cpp:225
3 xul.dll SkBitmapDevice::drawGlyphRunList gfx/skia/skia/src/core/SkBitmapDevice.cpp:541
4 xul.dll SkGlyphRunBuilder::drawTextBlob gfx/skia/skia/src/core/SkGlyphRun.cpp:232
5 xul.dll SkCanvas::onDrawTextBlob gfx/skia/skia/src/core/SkCanvas.cpp:2552
6 xul.dll SkCanvas::drawTextBlob gfx/skia/skia/src/core/SkCanvas.cpp:2573
7 xul.dll mozilla::gfx::DrawTargetSkia::DrawGlyphs gfx/2d/DrawTargetSkia.cpp:1391
8 xul.dll void mozilla::gfx::FillGlyphsCommand::ExecuteOnDT gfx/2d/DrawCommands.h:577
9 xul.dll mozilla::gfx::DrawTarget::DrawCapturedDT gfx/2d/DrawTarget.cpp:167

This crash was mentioned in the Channel meeting yesterday - these are entry-level intel cpus (gemini lake). Philipp noted we have had trouble with these in previous releases.

Hi Lee, this crash is spiking in 69.0 post-release. Can you please take a look?

Flags: needinfo?(lsalzman)

we have already had build specific crash signatures spiking up in the past with this particular cpu, for example bug 1524257, bug 1544192 and bug 1553380.

Priority: -- → P2

Without a repro method, it is difficult to say what is going on here. The stack doesn't peg a specific line where the problem might be occurring, and there are a lot of different objects in play near that area in the stack, none of which looks overtly wrong as causing the crash. So the first step here would be to get some sort of initial lead on what is causing this to allow us to reproduce it.

Flags: needinfo?(lsalzman)

The comments aren't really useful in terms of getting any steps - they just mention repeated crashing. My guess is we would have to get a machine with this spec if we wanted to reproduce. Some correlations:

(100.0% in signature vs 01.83% overall) CPU Info = family 6 model 122 stepping 1
(97.87% in signature vs 01.57% overall) address = 0x5a
(100.0% in signature vs 07.92% overall) reason = EXCEPTION_ACCESS_VIOLATION_WRITE
(20.43% in signature vs 99.99% overall) graphics_startup_test = null
(33.23% in signature vs 00.95% overall) adapter_vendor_id = 0x00ba [61.24% vs 01.35% if process_type = content]
(95.43% in signature vs 41.98% overall) platform_pretty_version = Windows 10
(20.43% in signature vs 73.86% overall) app_init_dlls = null
(35.06% in signature vs 00.84% overall) adapter_device_id = 0x3185 [48.67% vs 01.13% if startup_crash = 0]
(30.79% in signature vs 00.56% overall) adapter_device_id = 0x3184 [46.76% vs 01.01% if adapter_vendor_id = 0x8086]
(100.0% in signature vs 61.66% overall) cpu_arch = amd64
(25.30% in signature vs 03.59% overall) bios_manufacturer = Insyde Corp. [42.13% vs 03.26% if process_type = content]
(95.12% in signature vs 47.78% overall) Module "wshbth.dll" = true [92.00% vs 57.85% if platform_version = 10.0.17134]

Looks like the 70.0b4 beta build is also affected

Summary: Crash in [@ SkScalerContext::AutoDescriptorGivenRecAndEffects] → Crash in [@ SkScalerContext::AutoDescriptorGivenRecAndEffects] on Intel CPU family 6 model 122 stepping 1
See Also: → 1553380

in case an affected user is ending up reading this bug report - according to the chrome thread and our stability data, switching to a 32bit version of the browser might fix this crash pattern. you can get the 32bit installer from https://www.mozilla.org/en-US/firefox/all/

Chrome landed a speculative workaround for this, not sure if it can apply here too: https://chromium.googlesource.com/v8/v8.git/+/10360127e8bcc4a683ca2f49c0459d548299551b

(In reply to Emilio Cobos Álvarez (:emilio) from comment #9)

Chrome landed a speculative workaround for this, not sure if it can apply here too: https://chromium.googlesource.com/v8/v8.git/+/10360127e8bcc4a683ca2f49c0459d548299551b

Indeed, we're seeing the same failure mode in this signature. Looking at https://crash-stats.mozilla.org/report/index/52c7bc49-6030-4cca-a701-ac6980191007#tab-rawdump,

0:000> db xul+0x3f1639e-10 L20
0000000183f1638e cc cc 41 57 41 56 56 57-53 48 81 ec b0 00 00 00 ..AWAVVWSH......
0000000183f1639e 4d 89 c6 48 89 d6 48 89-cf 48 8b 05 8a fc 75 01 M..H..H..H....u.

Let's see what happens if the cpu makes the same "off by 16" mistake when crossing the 16-byte boundary:

0:000> eb . 4d 89 41 57; u . L1
ntdll!LdrpDoDebuggerBreak+0x30:
00007ff8`4e4511dc 4d894157 mov qword ptr [r9+57h],r8

In that report, r9 == 3, so r9 + 57h == 0x5a, which matches the crash address in the description.

I went looking for crashes specific to this cpu and also found the same off-by-16 in style::properties::NonCustomPropertyId::allowed_in in 69.0.1.

Crash Signature: [@ SkScalerContext::AutoDescriptorGivenRecAndEffects] → [@ SkScalerContext::AutoDescriptorGivenRecAndEffects][@ style::properties::NonCustomPropertyId::allowed_in ]

Some maybe-relevant IRC discussion:

19:23 <dmajor> emilio: do you want to try landing chrome's cpu workarounds? I don't know how to word the attribute for the rust one.
20:02 <emilio> dmajor: sorry, was on a meeting. Hmm, not sure `#[repr(align)]` will work on functions...
20:05 <emilio> dmajor: nope, that doesn't seem to work... I'll poke a bit more
20:12 <emilio> dmajor: I guess that what the attribute does in clang is setting the `alignstack(N)`?
20:12 <emilio> dmajor: from http://llvm.org/docs/LangRef.html#function-attributes
20:13 <dmajor> emilio: I don't think it would be related to the stack
20:16 <emilio> dmajor: ah, true, it just emits the "align" attribute in the IR
20:16 <emilio> dmajor: (looking at https://godbolt.org/z/eU9sm8)
20:20 <emilio> dmajor: I don't see anything relevant in https://doc.rust-lang.org/reference/items/functions.html#attributes-on-functions
20:21 <emilio> dmajor: the closest I can see that we could use is https://doc.rust-lang.org/reference/abi.html#the-link_section-attribute, specifying a custom section that we know is well-aligned

Not sure how feasible / reasonable that would be...

Maybe it is easier to do this at the linker level for all functions? Otherwise it may become a whack a mole.

This CPU bug now affects 82.0.2.

I tried the stitching-together-bytes as in comment 10, and it doesn't seem to be the same off-by-16 failure mode this time (although I won't rule out the possibility that there are deeper layers of the hardware bug that we don't understand). I suspect that any active intervention that we'd try would be no more likely to succeed than just spinning a fresh build.

QA Whiteboard: qa-not-actionable
Severity: critical → S2

No crashes on crash stats, decreasing severity -> S3.

Severity: S2 → S3

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.