Closed Bug 705287 Opened 13 years ago Closed 6 years ago

Hang in gfxDWriteFontList::DelayedInitFontList

Categories

(Core :: Graphics, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED WONTFIX

People

(Reporter: scoobidiver, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: hang, Whiteboard: [Snappy:P1])

Crash Data

It's a new crash signature and is #73 top crasher in today's build.

The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3c8147998124&tochange=de483d897af4
The likely culprit is bug 627842.

Signature	chromehang | InitializeFontMetrics(OpenTypeFace const&, DWRITE_FONT_METRICS*, PhysicalFontFace const*)
UUID	d666d9b5-6687-435c-b471-ec5662111125
Date Processed	2011-11-25 04:55:53.367164
Uptime	51
Install Age	20.6 hours since version was first installed.
Install Time	2011-11-24 16:16:16
Product	Firefox
Version	11.0a1
Build ID	20111124031031
Release Channel	nightly
OS	Windows NT
OS Version	6.1.7601 Service Pack 1
Build Architecture	x86
Build Architecture Info	GenuineIntel family 6 model 23 stepping 6
Crash Reason	EXCEPTION_BREAKPOINT
Crash Address	0x6e3e193d
User Comments	
App Notes 	AdapterVendorID: 10de, AdapterDeviceID: 0640, AdapterSubsysID: 00000000, AdapterDriverVersion: 8.17.12.6099
D2D? D2D+
DWrite? DWrite+
xpcom_runtime_abort(###!!! ABORT: HangMonitor triggered: file e:/builds/moz2_slave/m-cen-w32-ntly/build/xpcom/threads/HangMonitor.cpp, line 111)
Processor Notes 	
EMCheckCompatibility	True

Thread 0
Frame 	Module 	Signature [Expand] 	Source
0 	DWrite.dll 	InitializeFontMetrics 	
1 	DWrite.dll 	FontCollectionBuilder::Builder::AddFontInternal 	
2 	DWrite.dll 	FontCollectionBuilder::Builder::AddFont 	
3 	DWrite.dll 	FontCollectionBuilder::Builder::AddFile 	
4 	DWrite.dll 	FontCollectionBuilder::FontCollectionBuilder 	
5 	DWrite.dll 	FontCollectionElement::AddToCacheImpl 	
6 	DWrite.dll 	CacheWriter::AddElement 	
7 	DWrite.dll 	ClientSideCacheContext::ClientLookup 	
8 	DWrite.dll 	ClientSideCacheContext::InitializeElementImpl 	
9 	DWrite.dll 	FontCollectionElement::FontCollectionElement 	
10 	DWrite.dll 	DWriteFontCollection::DWriteFontCollection 	
11 	DWrite.dll 	ComObject<DWriteFontCollection>::ComObject<DWriteFontCollection><unsigned __int64*,unsigned int,unsigned __int64,IntrusivePtr<ClientSideCacheContext>,DWriteFactory*,FontCollection> 	
12 	DWrite.dll 	InnerComObject<DWriteFactory,DWriteFontCollection>::InnerComObject<DWriteFactory,DWriteFontCollection><unsigned __int64*,unsigned int,unsigned __int64,IntrusivePtr<ClientSideCacheContext>,DWriteFactory*,FontCollection> 	
13 	DWrite.dll 	DWriteFactory::GetSystemFontCollectionInternal 	
14 	DWrite.dll 	DWriteFactory::GetSystemFontCollection 	
15 	xul.dll 	gfxDWriteFontList::DelayedInitFontList 	gfx/thebes/gfxDWriteFontList.cpp:764
16 	xul.dll 	gfxDWriteFontList::ResolveFontName 	gfx/thebes/gfxDWriteFontList.cpp:1145
17 	xul.dll 	gfxWindowsPlatform::ResolveFontName 	gfx/thebes/gfxWindowsPlatform.cpp:591
18 	xul.dll 	gfxFontGroup::ForEachFontInternal 	gfx/thebes/gfxFont.cpp:2373
19 		@0xe 	
20 		@0x65006c 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=chromehang%20|%20InitializeFontMetrics%28OpenTypeFace%20const%26%2C%20DWRITE_FONT_METRICS*%2C%20PhysicalFontFace%20const*%29
Crash Signature: [@ chromehang | InitializeFontMetrics(OpenTypeFace const&, DWRITE_FONT_METRICS*, PhysicalFontFace const*) ] → [@ chromehang | InitializeFontMetrics(OpenTypeFace const&, DWRITE_FONT_METRICS* PhysicalFontFace const*) ] [@ chromehang | `anonymous namespace''::TrueTypeFontMetricsBuilder::GetBlackBox(unsigned short) ] [@ chromehang | FontFileAnalyzer::IsPfm(unsigned…
This is probably exactly the same issue as bug 705258.
Blocks: hang-detector
No longer blocks: font-inflation
Summary: Crash in gfxDWriteFontList::DelayedInitFontList → Hang in gfxDWriteFontList::DelayedInitFontList
(In reply to Scoobidiver from comment #0)
> It's a new crash signature and is #73 top crasher in today's build.
> 
> The regression range is:
> http://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=3c8147998124&tochange=de483d897af4
> The likely culprit is bug 627842.

To be clear:  "chromehang | " is a new type of crash report, introduced by bug 429592, which landed in that range.
Some of the stacks in the dumps are incomplete but most of the stacks take place within the call to DWriteFactory::GetSystemFontCollection, the DirectWrite call that returns the object needed to look up names.

DirectWrite dll version:

Module|DWrite.dll|6.1.7600.16385
Module|DWrite.dll|6.1.7600.16385
Module|DWrite.dll|6.1.7600.16385
Module|DWrite.dll|6.1.7600.16385
Module|DWrite.dll|6.1.7600.16385
Module|DWrite.dll|6.1.7601.17514
Module|DWrite.dll|6.1.7601.17514
Module|DWrite.dll|6.1.7601.17563
Module|DWrite.dll|6.1.7601.17563
Module|DWrite.dll|6.1.7601.17563
Module|DWrite.dll|6.1.7601.17563
Module|DWrite.dll|6.1.7601.17563
Module|DWrite.dll|7.0.6002.18409

The delay problem in GetSystemFontCollection was fixed by Microsoft in KB2505438, which included DWrite.dll versions 6.1.7600.16763 and higher (non-SP1) and 6.1.7601.17563 and higher (SP1).  Windows 8 should't have this problem at all.

This data seems to indicate that the problem is not completely fixed, 7/14 cases have supposedly fixed versions of the DWrite.dll.
Keywords: crash, regressionhang
Scoobidiver, if we have a crash signature, we should leave the crash keyword on it or else Socorro will probably not find it any more.
Keywords: crash
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #4)
> Scoobidiver, if we have a crash signature, we should leave the crash keyword
> on it or else Socorro will probably not find it any more.
I don't think the crash keyword is required so that the bug shows up in Socorro. See bug 705761 for instance.
In addition, there are many crashes in Bugzilla with no crash keyword.
Yeah, I also don't think chromehang's should be tagged with 'crash' since that will directly affect crash-related metrics when they *aren't* crashes but hangs.
(In reply to Scoobidiver from comment #5)
> I don't think the crash keyword is required so that the bug shows up in
> Socorro. See bug 705761 for instance.

Interesting, then I may be wrong. OK, I'll probably need to rewrite some other of my tooling so that "crash" and "hang" are counted in it as "crash bugs", I guess, or my analysis of issues to track in CrashKill is wrong... But thanks for making me realize that.

> In addition, there are many crashes in Bugzilla with no crash keyword.

Crashes without the crash keyword are wrong, that needs to be changed.
Keywords: crash
Assignee: nobody → jdaggett
Whiteboard: [Snappy:P1]
Need to track down the DirectWrite DLL version associated with the outliers here.  There's a known MS bug with the timing of DirectWrite font cache startup than can cause this, but MS claims this is fixed.

Telemetry variable: DWRITEFONT_DELAYEDINITFONTLIST_COLLECT
Blocks: 859558
As mentioned in comment 3, this is fundamentally an OS-level bug that occurs because of a timing issue at *cold* startup between the system font cache and the dwrite system library.  Microsoft has said they fixed this previously so hopefully as users update their OS the problem will disappear altogether.  The only "solution" here is to basically notice when a long delay has occurred and effectively disable hardware acceleration.  Given the *very* small percentage of users that are affected by this, I don't think that's the right way to go but it's certainly a possibility.

Bug 706340 added gfxinfo to telemetry data.  So we should be able to use that to answer the question:

How many users of FF {23,24,25} hit a delay here of more that 2secs and what versions of DirectWrite are they using?

Vladan, could you look at the telemetry data for this?  Or point me at the docs on how to pull this out?

Telemetry item: DWRITEFONT_DELAYEDINITFONTLIST_COLLECT

The latest version of the telemetry metrics interface doesn't seem to have an export feature, that's kind of unfortunate.

http://telemetry.mozilla.org/#path=release/25/DWRITEFONT_DELAYEDINITFONTLIST_COLLECT
Flags: needinfo?(vdjeric)
Depends on: 863122
The chromehangs link I emailed you also contains info on hang stacks involving gfxDWriteFontList::DelayedInitFontList and the associated DWriteVersions.

I took a quick look at the top 5 hang signatures involving DelayedInitFontList in that link, and I found the following DWriteVersions (in various proportions):

6.1.7600.16385
6.1.7600.16699
6.1.7600.16763
6.1.7600.16961
6.1.7600.16972
6.1.7600.20655
6.1.7600.20710
6.1.7600.20781
6.1.7600.20830
6.1.7600.21162
6.1.7601.16562
6.1.7601.17105
6.1.7601.17514
6.1.7601.17563
6.1.7601.17776
6.1.7601.17789
6.1.7601.18126
6.1.7601.21935
6.2.9200.16433
6.2.9200.16492
6.2.9200.16571
6.2.9200.16581
7.0.6002.23097

Granted, the vast majority of the hangs reported in this sample are from DWrite versions before Microsoft's fix. Unfortunately, that means real users are still running into this issue.

If you'd like to work with the raw Telemetry data, you can do your analysis directly on our new Telemetry backend:

http://mreid-moz.github.io/blog/2013/11/06/current-state-of-telemetry-analysis/
http://jonasfj.dk/blog/2013/11/telemetry-rebooted-analysis-future/

Please come by the #telemetry channel for more details
Flags: needinfo?(vdjeric)
Latest results from Telemetry stats:
http://jsfiddle.net/69QNY/2/

For the past few months, using Firefox 24 release data:

The DirectWrite call GetSystemFontCollection() takes over 2.3secs
in roughly 2.1 million startups.  This is out of a total of 137
million DirectWrite startups (for comparison, 209 million GDI
startups).

data: release/24/DWRITEFONT_DELAYEDINITFONTLIST_COLLECT
gfxDWriteFontList::DelayedInitFontList GetSystemFontCollection (ms)

index	start	end	count	pct	cum pct
6	647	2324	     480054	0.35	98.45
7	2324	8350	1171404	0.85	99.31
8	8350	30000	 790817	0.58	99.88
9	>30000		     159719	0.12	100.00

total counts: 137304675

data: release/24/GDI_INITFONTLIST_TOTAL
gfxGDIFontList::InitFontList Total (ms)

index	start	end	count	pct	cum pct
6	647	2324	   1712605	0.82	99.56
7	2324	8350	696282	0.33	99.90
8	8350	30000	189563	0.09	99.99
9	>30000		     28550	0.01	100.00

total counts: 209074700

I think we need to figure out two things -- are these cold
startups (i.e. does the startup occur soon after a system
restart) and, for the long delay cases, how many of these are due
to large numbers of fonts on the system?

One "solution" would be to shunt users affected by consistent
long DirectWrite startup times to GDI by effectively disabling
hardware acceleration. This is a bit of a "nuclear solution", the
user would lose the benefits of DirectWrite and related
acceleration.
Distribution of number of fonts at startup, only for DirectWrite users:

data: release/24/DWRITEFONT_DELAYEDINITFONTLIST_COUNT
gfxDWriteFontList::DelayedInitFontList Font Family Count

index	start	end	count	pct	cum pct
0	0	1	0	0.00	0.00
1	1	3	24	0.00	0.00
2	3	10	60	0.00	0.00
3	10	32	3231	0.00	0.00
4	32	101	140744	0.10	0.10
5	101	319	123098407	89.65	89.76
6	319	1006	12889622	9.39	99.15
7	1006	3172	1076709	0.78	99.93
8	3172	10000	94342	0.07	100.00
9	>10000		1539	0.00	100.00

total counts: 137304678

So there's a *very* small percentage of users with more than 3000 fonts
on their system (roughly 0.07% of startups).
Updated telemetry stats don't show any significant change in the percentage of users affected by long times spent in DirectWrite system font collection initialization:

http://jsfiddle.net/eGvw5/4/

release/24 DWRITEFONT_DELAYEDINITFONTLIST_COLLECT
index	start	end	count	pct	cum pct
6	647	2324	427213	0.35	98.46
7	2324	8350	1048207	0.85	99.32
8	8350	30000	701801	0.57	99.89
9	>30000		134983	0.11	100.00

total counts: 122697422

release/27 DWRITEFONT_DELAYEDINITFONTLIST_COLLECT
index	start	end	count	pct	cum pct
6	647	2324	621027	0.37	98.44
7	2324	8350	1491087	0.88	99.32
8	8350	30000	963853	0.57	99.89
9	>30000		180538	0.11	100.00

total counts: 168563706

release/28 DWRITEFONT_DELAYEDINITFONTLIST_COLLECT
index	start	end	count	pct	cum pct
6	647	2324	36820	0.32	98.77
7	2324	8350	81229	0.71	99.48
8	8350	30000	50133	0.44	99.92
9	>30000		9719	0.08	100.00

total counts: 11472782

nightly/31 DWRITEFONT_DELAYEDINITFONTLIST_COLLECT
index	start	end	count	pct	cum pct
6	647	2324	10084	0.42	98.44
7	2324	8350	21144	0.89	99.32
8	8350	30000	13668	0.57	99.90
9	>30000		2487	0.10	100.00

total counts: 2388461

Given that system startup time was added to telemetry data at the end of January, it would be interesting to know the following data:

For the set of telemetry data where (DWRITEFONT_DELAYEDINITFONTLIST_COLLECT > 2324), what's the distribution of time since system startup? Are these all within 60secs of startup?

Additionally, for the set of telemetry data where (DWRITEFONT_DELAYEDINITFONTLIST_COLLECT > 2324) what's the distribution of DirectWrite library versions? Are there recent versions included in that list? This would be useful information to report back to Microsoft.  In the past they've assured me that this problem has been "fixed" in recent versions of the DirectWrite library.
Crash Signature: , PhysicalFontFace const*) ] [@ chromehang | `anonymous namespace''::TrueTypeFontMetricsBuilder::GetBlackBox(unsigned short) ] [@ chromehang | FontFileAnalyzer::IsPfm(unsigned __int64) ] → , PhysicalFontFace const*) ] [@ chromehang | `anonymous namespace''::TrueTypeFontMetricsBuilder::GetBlackBox(unsigned short) ] [@ chromehang | FontFileAnalyzer::IsPfm(unsigned __int64) ] [@ chromehang | InitializeFontMetrics ] [@ chromehang | `anonymo…
Assignee: jd.bugzilla → nobody
Closing because no crash reported since 12 weeks.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Closing because no crash reported since 12 weeks.
You need to log in before you can comment on or make changes to this bug.