Closed Bug 1149318 Opened 5 years ago Closed 4 years ago

crash in SkFontMgr_DirectWrite::SkFontMgr_DirectWrite(IDWriteFactory*, IDWriteFontCollection*, wchar_t*, int)

Categories

(Core :: Graphics: Text, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla44
Tracking Status
firefox41 --- wontfix
firefox42 + fixed
firefox43 + fixed
firefox44 --- fixed

People

(Reporter: aklotz, Assigned: dmajor)

References

Details

(Keywords: crash, topcrash-win, Whiteboard: gfx-noted)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-2c594b93-4ca0-4c98-ba6b-e6baf2150330.
=============================================================
Mentioned by a reddit user. See
http://www.reddit.com/r/firefox/comments/30tewy/strange_crash_always_happens_when_visiting_a/
From the reddit post, it looks like it might be a driver issue. Let's see if this still comes up after this user updates their drivers.
Whiteboard: gfx-noted
https://crash-stats.mozilla.com/report/index/ccadcc4a-46d6-49ff-9661-33e9a2150507

I have the latest nvidia drivers for my GPU and got this crash just now. My Intel GPU drivers are quite old however, and I believe Firefox was using the Intel GPU.

I will update my drivers after work and see if I get this crash again.
Have been seeing this on several different machines on Google Maps Street View.  I have HWA disabled (Firefox disables it automatically anyway), but hardware fonts enabled via gfx.font_rendering.directwrite.enabled.  Disabling hardware fonts prevents the crash but makes websites look terrible.

Crash 100% reproducable by visiting Google Maps and attempting to go into Street View.  On other browsers I notice that Street View now shows street names projected on the ground.  I do not recall seeing that feature at all prior to the crashing, so I suspect that's the cause.

https://crash-stats.mozilla.com/report/index/ae42bbdb-26fc-435a-97ae-61e0e2150519
I've been receiving reports from customers that text operations were not working on the website, so I decided to investigate. And to my surprise only 1 of the machines in the office was able to replicate the issue.

While investigating the issue I attempted to load https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API/Tutorial/Drawing_text and this caused the browser to crash. Upon submitting the crash report (https://crash-stats.mozilla.com/report/index/e284dc1a-dc6c-4002-91de-90e852150629) all canvas text operations seem to be working flawlessly.
Duplicate of this bug: 1185017
Assignee: nobody → jdaggett
I haven't experienced this issue for a long time and can visit all of the links in prior comments without Firefox crashing. I still have DirectWrite and HW accel on.

I updated my Intel GPU drivers to the latest version, which I believe fixed the problem in my case.
Breakdown of dwrite.dll versions for recent beta crashreports:

  15 6.1.7600.16385
   7 6.1.7600.16699
   3 6.1.7600.16972
   1 6.1.7600.20602
  11 6.1.7601.17514
   2 6.1.7601.18834
   1 6.2.9200.16384
  17 6.2.9200.16492
  11 6.2.9200.16571
  15 6.2.9200.17292
  68 6.2.9200.17461
  12 6.3.9600.16384
   1 6.3.9600.17111
   1 6.3.9600.17415
   2 6.3.9600.17795
   2 6.3.9600.17999

Whatever is going on here is basically spread evenly across all versions of DirectWrite.
This is canvas text related, based both on the codepath and on the comments here.

Examples:

https://developer.mozilla.org/en-US/docs/Web/API/Canvas_API/Tutorial/Drawing_text
any Google street view

Comment 4 suggests this may occur in D2D disabled, DirectWrite manually enabled cases.
Keywords: testcase-wanted
Counts of the AdapterVendorId info from beta crashreports (41.0b). Doesn't seem to be specific to a single given driver/vendor. 0x8086 is Intel?
For recent beta crashreports, this is #39.
Keywords: topcrash-win
(In reply to John Daggett (:jtd) from comment #10)
> 0x8086 is Intel?

Can confirm 0x8086 is Intel, checked my laptop's About:Support page.
This is showing up in rather high volume on 42b1.

There's definitely some corruption going on; one of the parameters on the stack (I think |factory| but I'm not certain) has been clobbered and replaced by a value derived form the user's locale setting.

For example, en-US users tend to crash on address 0x006e0065 ('en'), and es-ES users tend to crash on 0x00730065 ('es'), etc.

The majority of crash reports have Intel driver 8.15.10.2697. John, is there anything we can do in terms of blocklisting etc. to avoid whatever codepath causes this corruption?
Flags: needinfo?(jdaggett)
[Tracking Requested - why for this release]: This is near the top of the crash scores on 42b1
Oh you know what - I didn't even realize that there's a |localeName| parameter involved in that constructor; that could be relevant!

Sounds like there's a level of indirection missing? The code was expecting a pointer to that string, but got the raw chars reinterpreted as a pointer?
> one of the parameters on the
> stack (I think |factory| but I'm not certain) has been clobbered
I was off by one; the bogus pointer value is actually where |fontCollection| was supposed to be. 

Here's a stack from a crash dump, showing the call between SkFontMgr_New_DirectWrite and SkFontMgr_DirectWrite::SkFontMgr_DirectWrite:

002e68dc  601e1ecd xul!SkFontMgr_New_DirectWrite+0x82 <-- return address
002e68e0  001c8b98 <-- factory
002e68e4  006e0065 <-- fontCollection
002e68e8  002e6908 <-- localeName
002e68ec  00000006 <-- localeNameLen
002e68f0  002e6be8
002e68f4  00000000
002e68f8  0083f020
002e68fc  774c03f3 
002e6900  006e0065 <-- 'e' 'n'
002e6904  0055002d <-- '-' 'U'
002e6908  00000053 <-- 'S' '\0'

(Note, no 'this' pointer because it's passed in the ecx register)

The |localeName| parameter actually points to 2e6908, which is the string "S", not "en-US"! Looking at the disassembly, I want to say that 0x2e6900 was actually intended to be the storage space for the caller's |sysFontCollection| variable.

My theory is: |localeNameStorage| was always meant to be 0x2e6908, and we did getUserDefaultLocaleNameProc(0x2e6908), but the function actually wrote to 0x2e6900 instead, clobbering the |sysFontCollection| variable.
Another possibility is that |localeNameStorage| was (still) meant to be 0x2e6908, but we accidentally called getUserDefaultLocaleNameProc(0x2e6900).

How can I make this function get executed? I'd like to step through it under a debugger.
(In reply to David Major [:dmajor] from comment #13)
> This is showing up in rather high volume on 42b1.
> 
> There's definitely some corruption going on; one of the parameters on the
> stack (I think |factory| but I'm not certain) has been clobbered and
> replaced by a value derived form the user's locale setting.
> 
> For example, en-US users tend to crash on address 0x006e0065 ('en'), and
> es-ES users tend to crash on 0x00730065 ('es'), etc.
> 
> The majority of crash reports have Intel driver 8.15.10.2697. John, is there
> anything we can do in terms of blocklisting etc. to avoid whatever codepath
> causes this corruption?

I don't really know. Since this is related to the use of Skia in 2D canvas this may be a better question for Bas and/or Jeff M. Last time I checked, the Skia trunk code in this area hadn't changed much but it would probably be a good idea to reach out to them to see if they are aware of this. Guessing Chrome may have hit this once they enabled the use of DirectWrite. Jeff/Bas, any ideas?
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(jdaggett)
Flags: needinfo?(bas)
(In reply to David Major [:dmajor] from comment #17)
> Another possibility is that |localeNameStorage| was (still) meant to be
> 0x2e6908, but we accidentally called getUserDefaultLocaleNameProc(0x2e6900).
> 
> How can I make this function get executed? I'd like to step through it under
> a debugger.

Google maps possibly?
(In reply to John Daggett (:jtd) from comment #19)
> (In reply to David Major [:dmajor] from comment #17)
> > Another possibility is that |localeNameStorage| was (still) meant to be
> > 0x2e6908, but we accidentally called getUserDefaultLocaleNameProc(0x2e6900).
> > 
> > How can I make this function get executed? I'd like to step through it under
> > a debugger.
> 
> Google maps possibly?

No luck on google maps, nor on agar.io which is the most commonly reported URL. Does the use of this codepath depend on my gfx hardware and/or prefs?
FWIW, I don't think we should be using Skia with DirectWrite. We're only supposed to be using dwrite with D2D (with fallback to cairo). So does anyone know why we're using dwrite with skia?
Flags: needinfo?(jmuizelaar)
I suspect the large increase here is caused by bug 1208833.
Depends on: 1208833
Flags: needinfo?(bas)
I got tired of trying to execute this function "for real", so I executed it artificially by setting my instruction pointer to SkFontMgr_New_DirectWrite.

The stack pointer gets corrupted after the call to getUserDefaultLocaleNameProc, due to a bad calling convention (missing WINAPI) on:
  typedef int (*SkGetUserDefaultLocaleNameProc)(LPWSTR, int);

As far as I can tell, _any_ machine that executes SkFontMgr_New_DirectWrite will crash; for some reason though, only certain machines are coming down this codepath to begin with.
The code is correct upstream: https://github.com/google/skia/blob/master/src/utils/win/SkDWrite.h#L52

How does one bring in the fix?
Flags: needinfo?(gwright)
[Tracking Requested - why for this release]: This long-standing crash is spiking on version 42. The fix should be trivial and extremely safe. I think we should fix on as many channels as possible, maybe even ESR?
Should just be as simple as applying the upstream patch and landing it. As it's already upstream, there's no need to keep track of it on our end.
Flags: needinfo?(gwright) → needinfo?(lsalzman)
(In reply to George Wright (:gw280) from comment #27)
> Should just be as simple as applying the upstream patch and landing it. As
> it's already upstream, there's no need to keep track of it on our end.

If it's already upstream and it is part of resolving this issue, then feel free to just attach the patch for checkin here.
Flags: needinfo?(lsalzman)
Copied the fix from upstream
Assignee: jdaggett → dmajor
Attachment #8668014 - Flags: review?(lsalzman)
Make us match upstream with WINAPI usage on the typedef, as requested by :dmajor
Attachment #8668015 - Flags: review?(gwright)
Oops. All yours! :)
Assignee: dmajor → lsalzman
Attachment #8668015 - Attachment is obsolete: true
Attachment #8668015 - Flags: review?(gwright)
Comment on attachment 8668014 [details] [diff] [review]
Fix the calling convention on SkGetUserDefaultLocaleNameProc

Already upstreamed, so no problem.
Attachment #8668014 - Flags: review?(lsalzman) → review+
Tagback! :) If you are the one investigating and verifying the issue, I'd rather you remain the assignee.
Assignee: lsalzman → dmajor
I've confirmed with a debugger that this patch fixes the issue. The code now finds the correct values on the stack.
Given that this has been in upstream for a while and we can confirm in a debugger, let's get this nominated for uplift very soon, so that it's in the beta the goes to build tomorrow.
Comment on attachment 8668014 [details] [diff] [review]
Fix the calling convention on SkGetUserDefaultLocaleNameProc

Approval Request Comment
[Feature/regressing bug #]: bug 1017113
[User impact if declined]: repeated crashes on certain machines
[Describe test coverage new/current, TreeHerder]: we can't test this in automation, but I have tested it under a debugger
[Risks and why]: Trivial patch, extremely safe, and necessary for correctness
[String/UUID change made/needed]: none
Attachment #8668014 - Flags: approval-mozilla-beta?
Attachment #8668014 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/d3f16a3d97a1
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Comment on attachment 8668014 [details] [diff] [review]
Fix the calling convention on SkGetUserDefaultLocaleNameProc

Fix a crash, Should be in 42b3.
Attachment #8668014 - Flags: approval-mozilla-beta?
Attachment #8668014 - Flags: approval-mozilla-beta+
Attachment #8668014 - Flags: approval-mozilla-aurora?
Attachment #8668014 - Flags: approval-mozilla-aurora+
(In reply to Carsten Book [:Tomcat] from comment #40)
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> aurora&revision=3a47df4bf819

erm https://hg.mozilla.org/releases/mozilla-aurora/rev/3a47df4bf819 of course :)
Thanks David for all your work tracking this down!!
You need to log in before you can comment on or make changes to this bug.