Closed Bug 711656 Opened 13 years ago Closed 12 years ago

Firefox 11.0a1 Crash in TextStageManager::AddStagesForSubrect with Intel GMA X4500HD, 4500MHD, and HD Graphics up to 8.15.10.2141

Categories

(Core :: Graphics, defect)

11 Branch
All
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla13
Tracking Status
firefox10 --- unaffected
firefox11 + verified
firefox12 + fixed
firefox13 --- fixed
firefox-esr10 --- unaffected

People

(Reporter: marcia, Assigned: drs)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash, Whiteboard: [startupcrash][qa-] Keep Open!!)

Crash Data

Attachments

(8 files, 4 obsolete files)

5.55 KB, patch
joe
: review+
Details | Diff | Splinter Review
13.83 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
4.54 KB, patch
bjacob
: review+
Details | Diff | Splinter Review
146.78 KB, patch
Details | Diff | Splinter Review
1.46 KB, patch
bas.schouten
: review+
Details | Diff | Splinter Review
40.84 KB, image/png
Details
45.42 KB, image/png
Details
29.90 KB, image/png
Details
Seen while looking at Windows crash stats.
https://crash-stats.mozilla.com/report/list?signature=__entry_from_strcat_in_strcpy. Crashes started in crash stats using the 2011120100 build, but there was a rather large spike on 2011121500 with 116 crashes.

Possible pushlog regression range:

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=cc94a16983b0&tochange=e9686560b98d

Frame 	Module 	Signature 	Source
0 	msvcrt.dll 	__entry_from_strcat_in_strcpy 	
1 	d2d1.dll 	TextStageManager::AddStagesForSubrect(IDWritePrivateGlyphRunAnalysis*,SubRectInfo const*) 	
2 	msvcrt.dll 	free 	
3 	DWrite.dll 	DWriteGlyphRunAnalysis::GetAlphaTextureBounds(DWRITE_TEXTURE_TYPE,tagRECT*) 	
4 	DWrite.dll 	GlyphRunAnalysis::GlyphRunAnalysis(ClientSideCacheContext&,FontFaceElement const&,DWRITE_FONT_SIMULATIONS,DWRITE_GLYPH_RUN const*,float,DWRITE_MATRIX const*,DWRITE_RENDERING_MODE,DWRITE_MEASURING_MODE,float,float) 	
5 	d2d1.dll 	HwGlyphRunRealizer::IssueRenderingCommands(BatchedBrush*)
I added related crash signatures that started in 11.0a1/20111215.
Crash Signature: [@ __entry_from_strcat_in_strcpy] [@ memcpy | TextStageManager::AddStagesForSubrect(IDWritePrivateGlyphRunAnalysis*, SubRectInfo const*)] [@ memset | TextStageManager::AddStagesForSubrect(IDWritePrivateGlyphRunAnalysis*, SubRectInfo const*)]
Summary: Firefox 11.0a1 Crash [@ __entry_from_strcat_in_strcpy ] → Firefox 11.0a1 Crash in TextStageManager::AddStagesForSubrect
The stacks here look fairly bogus to me, unfortunately.

Trying to guess what could be going on... the crashes I looked at are all EXCEPTION_ACCESS_VIOLATION_WRITE, and virtually all of them occured at a page boundary (the only exception I saw was at 0x20, which sounds like it's related to an unexpectedly-NULL pointer). For the rest of them, I'm guessing that someone is writing beyond an allocated buffer, perhaps corrupting something else that follows, but crashing when they reach the end of the writable page.

As a precaution, I'm marking this as security-sensitive, given that it looks like there's code writing to memory it shouldn't be touching.... if someone figures out the specific scenario where this happens, they might be able to control what gets corrupted, and hence arrive at some kind of exploit.

The dwrite-related addresses on the stack seem to imply involvement of font/text-rendering code, even though I don't actually believe the backtraces are accurate.

Most of the crashes are happening at startup, and one has a comment "facebook" .... Looking at the possible pushlog range suggested in comment #0, I'm concerned there might be a link with the system-font-fallback telemetry that was added in bug 705590 (though I don't actually see a problem in the patch there). Hitting font fallback on facebook (due to the other-language links) at startup is likely very common.
Group: core-security
Whiteboard: sg:critical?
I don't think this is something due to the Telemetry probe in system font fallback that landed 11/30, there are reports of this happening as far back as June but the problem seems to have grown more frequent around 12/15, as noted in comment 1.  The crashes after 12/15 all seem to happen very close to startup time (uptime = 1, 2, 3, 6, 7, etc.).

DirectWrite GlyphRunAnalysis objects are used in drawing.  Bas's patch for bug 702851 was landed around 12/15, maybe the patch there somehow tickled the code that causes this crash?
I pulled down 167 dumps from the past four weeks.  All of these seem to contain the following stack on one of the non-crashing threads:

0 	ntdll.dll 	ZwGetContextThread 	
1 	KERNELBASE.dll 	KERNELBASE.dll@0x6249f 	
2 	KERNELBASE.dll 	KERNELBASE.dll@0x518a 	
3 	kernel32.dll 	kernel32.dll@0x4addb 	
4 	dbghelp.dll 	Win32LiveSystemProvider::EnumModules 	
5 	KERNELBASE.dll 	KERNELBASE.dll@0x13240 	
6 	kernel32.dll 	kernel32.dll@0xe5eb 	
7 	ntdll.dll 	RtlAllocateMemoryBlockLookaside 	
8 	dbghelp.dll 	NtWin32LiveSystemProvider::EnumFunctionTables 	
9 	ntdll.dll 	RtlNtStatusToDosErrorNoTeb 	
10 	ntdll.dll 	RtlNtStatusToDosErrorNoTeb 	
11 	dbghelp.dll 	AllocMemory 	
12 	dbghelp.dll 	GenGetDebugRecord 	
13 	KERNELBASE.dll 	KERNELBASE.dll@0xc095 	
14 	dbghelp.dll 	GenAllocateModuleObject 	
15 	KERNELBASE.dll 	KERNELBASE.dll@0x1852 	
16 	kernel32.dll 	kernel32.dll@0x22a60 	
17 	dbghelp.dll 	GenGetAuxMemory 	
18 	dbghelp.dll 	GenGetProcessInfo 	
19 	ntdll.dll 	RtlDosApplyFileIsolationRedirection_Ustr 	

Anyone have a clue as to what this might be?  I don't see this in other crashdumps.
I'm guessing Bas would be the best person to track this down.
Assignee: nobody → bas.schouten
That doesn't ring any bells. Can you pull one of these dumps into Visual Studio or WinDBG and get a better stack for that thread?
The crash volume is now the same as before the spike that has decreased slowly.

I added the _VEC_memzero crash signature which has similar stack traces and spiked at the same time.
Crash Signature: [@ __entry_from_strcat_in_strcpy] [@ memcpy | TextStageManager::AddStagesForSubrect(IDWritePrivateGlyphRunAnalysis*, SubRectInfo const*)] [@ memset | TextStageManager::AddStagesForSubrect(IDWritePrivateGlyphRunAnalysis*, SubRectInfo const*)] → [@ __entry_from_strcat_in_strcpy] [@ memcpy | TextStageManager::AddStagesForSubrect(IDWritePrivateGlyphRunAnalysis*, SubRectInfo const*)] [@ memset | TextStageManager::AddStagesForSubrect(IDWritePrivateGlyphRunAnalysis* SubRectInfo const*)] [@ _VEC_mem…
If this now occurs at startup for some configurations, then we may have forced these users to not use Nightly, hence the drop in crashes.
Adding the top crash keyword - #8 on the trunk.
Keywords: topcrash
The _VEC_memzero crash signature is #6 top crasher in 11.0a2 and #30 in 12.0a1.

The first frames of the stack trace are:
Frame 	Module 	Signature [Expand] 	Source
0 	msvcrt.dll 	_VEC_memzero 	
1 	d2d1.dll 	TextStageManager::MapTextureTransferSurface 	
2 	d2d1.dll 	TextStageManager::AddStagesForSubrect 	
3 	d2d1.dll 	HwGlyphRunRealizer::IssueRenderingCommands 	
4 	d2d1.dll 	CHwSurfaceRenderTarget::DrawGlyphRunInternal 	
5 	d2d1.dll 	CCommand_DrawGlyphRun::Execute 	
6 	d2d1.dll 	CHwSurfaceRenderTarget::ProcessBatch 	
7 	d2d1.dll 	CBatchSerializer::FlushInternal 	
8 	d2d1.dll 	CBatchSerializer::FlushInternalAndTryIncreaseStorage 	
9 	d2d1.dll 	DrawingContext::SetTextAntialiasMode 	
10 	d2d1.dll 	D2DRenderTargetBase<ID2D1DCRenderTarget>::SetTextAntialiasMode 	
11 	xul.dll 	_cairo_dwrite_show_glyphs_on_d2d_surface 	gfx/cairo/cairo/src/cairo-d2d-surface.cpp:4099
12 	xul.dll 	_cairo_d2d_show_glyphs 	gfx/cairo/cairo/src/cairo-d2d-surface.cpp:4205
13 	xul.dll 	_cairo_surface_show_text_glyphs 	gfx/cairo/cairo/src/cairo-surface.c:2768
14 	xul.dll 	_cairo_gstate_show_text_glyphs 	gfx/cairo/cairo/src/cairo-gstate.c:1998
15 	xul.dll 	GlyphBuffer::Flush 	gfx/thebes/gfxFont.cpp:1135
16 	xul.dll 	gfxFont::Draw 	gfx/thebes/gfxFont.cpp:1328
17 	xul.dll 	gfxTextRun::Draw 	gfx/thebes/gfxFont.cpp:3679
18 	xul.dll 	nsTextFrame::DrawTextRun 	layout/generic/nsTextFrameThebes.cpp:5345
19 	xul.dll 	nsTextFrame::DrawText 	layout/generic/nsTextFrameThebes.cpp:5467
20 	xul.dll 	nsTextFrame::PaintText 	layout/generic/nsTextFrameThebes.cpp:5333
21 	xul.dll 	nsDisplayText::Paint 	layout/generic/nsTextFrameThebes.cpp:4177
22 	xul.dll 	mozilla::FrameLayerBuilder::DrawThebesLayer 	layout/base/FrameLayerBuilder.cpp:2120
23 	xul.dll 	mozilla::layers::ThebesLayerD3D10::DrawRegion 	gfx/layers/d3d10/ThebesLayerD3D10.cpp:406
24 	xul.dll 	mozilla::layers::ThebesLayerD3D10::Validate 	gfx/layers/d3d10/ThebesLayerD3D10.cpp:281
25 	xul.dll 	mozilla::layers::ContainerLayerD3D10::Validate 	gfx/layers/d3d10/ContainerLayerD3D10.cpp:380
26 	xul.dll 	mozilla::layers::ContainerLayerD3D10::Validate 	gfx/layers/d3d10/ContainerLayerD3D10.cpp:382
27 	xul.dll 	mozilla::layers::LayerManagerD3D10::Render 	gfx/layers/d3d10/LayerManagerD3D10.cpp:686
28 	xul.dll 	mozilla::layers::LayerManagerD3D10::EndTransaction 	gfx/layers/d3d10/LayerManagerD3D10.cpp:355
29 	xul.dll 	nsDisplayList::PaintForFrame 	layout/base/nsDisplayList.cpp:633
30 	xul.dll 	nsLayoutUtils::PaintFrame 	layout/base/nsLayoutUtils.cpp:1700

More reports at:
https://crash-stats.mozilla.com/report/list?signature=_VEC_memzero
An unknown bug added _VEC_memzero to the skiplist (see https://github.com/mozilla/socorro/blob/master/scripts/config/processorconfig.py.dist).

Comments for crashes not appearing on startup let me thinks it's a dupe of bug 635464:
"switched graphics card. (amd 6650->intel i5 built-in)"
"switched graphics card"

Startup crashes might be caused by an automatic GPU switch.
Crash Signature: SubRectInfo const*)] [@ _VEC_memzero] → SubRectInfo const*)] [@ _VEC_memzero] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface]
(In reply to Scoobidiver from comment #13)
> An unknown bug added _VEC_memzero to the skiplist

bug 715921 (as mentioned in the commit message for this line - https://github.com/mozilla/socorro/commit/ef1bfa72005612d560bd4feea7a1fe93bf8e8a88 ) as requested by :mats.
Depends on: 715921
Crash Signature: SubRectInfo const*)] [@ _VEC_memzero] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface] → SubRectInfo const*)] [@ _VEC_memzero] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)]
93% of crashes happen within one minute.
It's #3 top crasher in 11.0b1.

I checked manually some crash reports and they all happens with Intel GPUs: 2e22 (GMA X4500HD), 2a42 (GMA 4500MHD), 0042 (desktop HD Graphics), 0046 (mobile HD Graphics).
When it's a single GPU, driver versions are up to 8.15.10.2141. The latest driver version is 8.15.10.2555 or 2622 depending on the device ID.
When it's a dual GPU with an ATI GPU, driver versions are up to 8.752.4.0.

If there was a downloaded blocklist, it should be implemented before the final release.
For new installations, blocklist.xml should be already up-to-date in the install folder which is not the case. Indeed, it's updated in the profile folder after the first startup.

More reports at:
https://crash-stats.mozilla.com/report/list?signature=_VEC_memzero%20|%20TextStageManager%3A%3AMapTextureTransferSurface%28D2D_RECT_U%20const%26%2C%20unsigned%20char**%2C%20unsigned%20int*%29
Blocks: 605779
Summary: Firefox 11.0a1 Crash in TextStageManager::AddStagesForSubrect → Firefox 11.0a1 Crash in TextStageManager::AddStagesForSubrect with Intel GMA X4500HD, 4500MHD, and HD Graphics up to 8.15.10.2141
Whiteboard: sg:critical? → sg:critical?[startupcrash]
Version: Trunk → 11 Branch
These Intel GPUs are already D2D-blocklisted in hard code for these driver versions.

bp-7c333271-3b23-4730-8ebd-0040e2120207:
User Comments	update to Firfox 11 beta 1 and it is crashing on my Windows 7 environment
App Notes 	
AdapterVendorID: 0x8086, AdapterDeviceID: 0x0046, AdapterSubsysID: 043f1028, AdapterDriverVersion: 8.15.10.2141
D2D? D2D+
DWrite? DWrite+
D3D10 Layers? D3D10 Layers+

According to that, I don't think users set gfx.direct2d.force-enabled to true.

I think it's a regression from bug 704710 that landed in 11.0a1/20111215 and that has canceled hard-coded blocklists.
Blocks: 704710
Correlations confirm that the D2D-blocklist for Intel GMA 4500 and HD is gone when Device and vendor IDs were converted to string:
    100% (633/633) vs.   8% (2348/30240) igd10umd32.dll
         35% (219/633) vs.   1% (242/30240) 8.15.10.2125
          1% (5/633) vs.   0% (5/30240) 8.15.10.2131
         64% (406/633) vs.   1% (410/30240) 8.15.10.2141
Crash Signature: SubRectInfo const*)] [@ _VEC_memzero] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] → SubRectInfo const*)] [@ _VEC_memzero] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] [@ msvcrt.dll@0x9cc6]
I was unable to repro this issue when I spoofed a few of the given driver versions and adapter/vendor combos. That is, they were blocked when I spoofed them. My best guess is that there's some kind of string case issue when comparing, for example "0x2A22" to "0x2a22", so I wrote this patch to fix that.
Attachment #599055 - Flags: review?(joe)
Try push: https://tbpl.mozilla.org/?tree=Try&rev=37b1a9ec0eb0

I'd appreciate it if someone having this issue or someone with access to a broken configuration could test this try build, once it's done.
Assignee: bas.schouten → bugzilla
Status: NEW → ASSIGNED
Comment on attachment 599055 [details] [diff] [review]
Patch v1.0, fix driver blocklist string comparison casing.

Review of attachment 599055 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, this seems very likely to be the problem.
Attachment #599055 - Flags: review?(joe) → review+
By the way, many many thanks to Doug, who looked at this while the rest of the graphics team was up to its eyes in Fennec work.
Comment on attachment 599055 [details] [diff] [review]
Patch v1.0, fix driver blocklist string comparison casing.

[Triage Comment]
Attachment #599055 - Flags: approval-mozilla-beta+
Attachment #599055 - Flags: approval-mozilla-aurora+
Approving for landing as per Alex K.
http://hg.mozilla.org/mozilla-central/rev/7dcbce54a953
http://hg.mozilla.org/releases/mozilla-aurora/rev/da24c3793fa8

will land on beta as soon as my clone is done.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
Shouldn't this be kept open until we confirm that this patch actually eliminates the underlying crash?  I don't see any confirmation that someone reproduced the crash and then verified that the patch fixed the problem.  We seem to be guessing that this will solve the problem.
I agree with jtd, I was really shooting from the hip here. If this doesn't fix it, I'll probably need more data from someone who can repro the bug.
Target Milestone: mozilla13 → ---
(In reply to Jonathan Kew (:jfkthame) from comment #2)
> As a precaution, I'm marking this as security-sensitive, given that it looks
> like there's code writing to memory it shouldn't be touching.... if someone
> figures out the specific scenario where this happens, they might be able to
> control what gets corrupted, and hence arrive at some kind of exploit.
Is it still security sensitive because bug 590373 was not?
Whiteboard: sg:critical?[startupcrash] → [sg:critical][startupcrash]
(In reply to Scoobidiver from comment #29)
> The patch fixes nothing

I wouldn't be completely sure of that - there are no crashes with those signatures in builds from yesterday.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #30)
> I wouldn't be completely sure of that - there are no crashes with those
> signatures in builds from yesterday.
They are supposed to be fixed in 13.0a1 and 12.0a2/20120222. In addition, the Table and Graph tabs don't take into account crashes with the last build that is why you were mislead.
(In reply to Scoobidiver from comment #31)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #30)
> > I wouldn't be completely sure of that - there are no crashes with those
> > signatures in builds from yesterday.
> They are supposed to be fixed in 13.0a1 and 12.0a2/20120222. In addition,
> the Table and Graph tabs don't take into account crashes with the last build
> that is why you were mislead.

They take account anything up to the last full UTC day, at least once the reports have been generated for that day, which they have been for the 23rd, AFAIK, and we don't have crashes for those signatures in builds from the 23rd yet. Also, the Reports tab lists everything we have processed so far, and it also doesn't list any crashes for builds newer than the 22nd - on Aurora, that is. On trunk, I see 2 of the msvcrt ones, with "AdapterDeviceID: 0x0046", which wouldn't be affected by this patch anyhow.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #32)
> On trunk, I see 2 of the msvcrt ones, with "AdapterDeviceID: 0x0046", which
> wouldn't be affected by this patch anyhow.
0x0046 is the device ID for HD Graphics. Anyway, the D2D-blocklist introduced in bug 590373 is still not effective.
(In reply to Scoobidiver from comment #33)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #32)
> > On trunk, I see 2 of the msvcrt ones, with "AdapterDeviceID: 0x0046", which
> > wouldn't be affected by this patch anyhow.
> 0x0046 is the device ID for HD Graphics. Anyway, the D2D-blocklist
> introduced in bug 590373 is still not effective.

I don't understand what the patch there did block exactly, and how this is related to those msvcrt.dll@0x9cc6 signatures we seem to still be seeing. I don't think either of us two can't really be sure what is effective there and what isn't, we need the graphics team to assess that.
I can't repro this problem with any spoofed configuration, and there's so many things that could be causing this that it's really difficult to pinpoint it without a computer that has the problem.

I'm going to keep looking but any solutions I come up with are guesses at best for now. I think a good next step at this point is to open up this bug report to the public (instead of marking it as secure) so that maybe someone who can actually repro this can step forward and help. It seems like the security concerns were warranted until we found out it was due to a blocklisting problem rather than a (realistic) potential attack vector.
Let's have a look at crashes with build id >= 20120222 in the following list, and see what we can say about them:
https://crash-stats.mozilla.com/report/list?signature=_VEC_memzero%20|%20TextStageManager%3A%3AMapTextureTransferSurface%28D2D_RECT_U%20const%26%2C%20unsigned%20char**%2C%20unsigned%20int*%29

***

https://crash-stats.mozilla.com/report/index/d65ce81d-6ca8-478e-be74-85d7c2120225
(+ many others: this case is the vast majority)
AdapterVendorID: 0x8086, AdapterDeviceID: 0x0046, AdapterSubsysID: 10391462, AdapterDriverVersion: 8.15.10.2125
This driver is blacklisted. (Occurs with a few other versions too)

https://crash-stats.mozilla.com/report/index/5b00c309-dc24-4ad5-b964-b9cf62120226
AdapterVendorID: 0x8086, AdapterDeviceID: 0x0046, AdapterSubsysID: 1600103c, AdapterDriverVersion: 8.741.1.1000
anomaly: this is a AMD driver version on Intel vendor ID
way to check: 8.741 not in Intel range

https://crash-stats.mozilla.com/report/index/85483fe2-a6d3-490f-8370-6c3492120227
Cisco VPN
AdapterVendorID: 0x1002, AdapterDeviceID: 0x9480, AdapterSubsysID: 20561b0a, AdapterDriverVersion: 8.750.0.0
Has dual GPUs. GPU #2: AdapterVendorID2: 0x8086, AdapterDeviceID2: 0x0046, AdapterSubsysID2: 20561b0a, AdapterDriverVersion2: 8.750.0.0
Here we have two 'anomalies': the GPU#2 is again a Intel with AMD driver version, and "Cisco VPN" reminds of bug 435756

https://crash-stats.mozilla.com/report/index/a3fee3b3-30fe-4edf-80a0-d35e62120227
AdapterVendorID: 0x8086, AdapterDeviceID: 0x2e22, AdapterSubsysID: 02511025, AdapterDriverVersion: 8.15.10.2141
This driver is blacklisted.

https://crash-stats.mozilla.com/report/index/1ef8a7a4-e846-4637-9bd9-5657c2120226
Cisco VPN
AdapterVendorID: 0x8086, AdapterDeviceID: 0x0046, AdapterSubsysID: 1425103c, AdapterDriverVersion: 8.15.10.2202
This one is "fine"... it's exactly the minimum version required for this device and OS. Maybe we could up the version requirement by a notch? Or maybe we should really worry about this Cisco VPN thing.

***

Questions:
 1. is it plausible that all these users force-enabled or should we assume there is a bug in our blacklisting logic? Now I wish we were reporting on force-enabled prefs in crash reports :-/
 2. in dual GPU setups, is it normal to have a AMD driver version on a Intel device?

Other than that, note that *all* of these crashes that are not on blacklisted drivers, are on driver versions that are only slightly above the minimum required version, so they could be blacklisted by a tiny increase of the version requirement.
Group: core-security
Keywords: qawanted
Whiteboard: [sg:critical][startupcrash] → [startupcrash]
Cheng, it would really help if we could ask users who have given us their email address in crash stacks the contents of their about:support. Since this seems to be a startup crash, it might be necessary for those users to start up in safe mode, but their about:support will still give us the necessary information.
Benoit Jacob has agreed to 1) prepare a potential backout patch for Doug's string changes, just in case, and 2) create a patch to add the force-enabled state to our crash report annotations.
That is what Jeff originally asked for in bug 627464. This patch replaces the '?' by '!' for force-enabled features.
Attachment #600984 - Flags: review?(jmuizelaar)
(In reply to Joe Drew (:JOEDREW!) from comment #39)
> Benoit Jacob has agreed to 1) prepare a potential backout patch for Doug's
> string changes, just in case, and 2) create a patch to add the force-enabled
> state to our crash report annotations.

2) done; for 1), it turns out that Doug's patch (bug 704710) does not unapply cleanly at all. Following IRC discussion, we're going the alternate route of basically reimplementing the relevant subset of the old blacklisting code, alongside the new. By 'relevant part' I mean the part that deals with Intel 4500/HD devices on Windows. Doug said on IRC he would make the patch today.
Comment on attachment 600984 [details] [diff] [review]
report force-enabled features

Review of attachment 600984 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/src/gfxCrashReporterUtils.cpp
@@ -110,1 @@
>  

Why this change?
Attachment #600984 - Flags: review?(jmuizelaar) → review+
Because I didn't like that a lot. Nested feature initialization resulted in seemingly random line breaks. Adapting it to the present change would have made it even more awkward.
http://hg.mozilla.org/integration/mozilla-inbound/rev/5c1d6779a7d7
Whiteboard: [startupcrash] → [startupcrash] Keep Open!!
Comment on attachment 600984 [details] [diff] [review]
report force-enabled features

[Approval Request Comment]
This patch is very low-risk and can help understanding crashes.
Attachment #600984 - Flags: approval-mozilla-aurora?
This is a hack to hopefully fix this issue for the reported bad set of cards. It will not solve the problem in the correct way but will prevent us from having to back out the whole thing. It will also not deal with the issue if more cards than this specific set are not being blocked when they should. This is bad to land but probably necessary.
Attachment #601094 - Flags: review?(joe)
Comment on attachment 601094 [details] [diff] [review]
Patch v1.0, special case the entire Intel GMAX4500HD series.

Review of attachment 601094 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/windows/GfxInfo.cpp
@@ +916,5 @@
>  
> +  /** Bug 711656: HACK HACK HACK
> +   * Special case this, even though it's already in the blocklist. 
> +   * Note that we're getting the driver version twice because the first one
> +   * happens within the 'if (!aDriverInfo->Length())' check, but we don't want to

This needs to be inside the if (!aDriverInfo->Length()) check, though, because we don't want to evaluate the system blocklist on a blocklist download ping.

@@ +957,5 @@
> +         driverVersion < V(8,15,10,2202) ||
> +         mWindowsVersion == gfxWindowsPlatform::kWindows7 &&
> +         driverVersion < V(8,15,10,2202)) &&
> +        gVendorID == 0x8086 /* Intel */) {
> +      for (PRUint32 i = 0; i < sizeof(IntelGMAX4500HD)/sizeof(IntelGMAX4500HD[0]); i++) {

ArrayLength(IntelGMAX4500HD)
Attachment #601094 - Flags: review?(joe) → review+
Comment on attachment 600984 [details] [diff] [review]
report force-enabled features

[Triage Comment]
Low risk - approving for Aurora 12 in support of bug investigation.
Attachment #600984 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 601094 [details] [diff] [review]
Patch v1.0, special case the entire Intel GMAX4500HD series.

Review of attachment 601094 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/windows/GfxInfo.cpp
@@ +948,5 @@
> +      0x2a02, /* IntelGL960_1 */
> +      0x2a03, /* IntelGL960_2 */
> +      0x2a12, /* IntelGM965_1 */
> +      0x2a13 /* IntelGM965_2 */
> +    };

This is the wrong list of device id's! This is IntelGMAX3000, not IntelGMAX4500HD.
Attachment #601094 - Flags: review-
Good catch!
Attachment #601094 - Attachment is obsolete: true
Attachment #601155 - Flags: review?(bjacob)
Comment on attachment 601155 [details] [diff] [review]
Patch v1.1, special case the entire Intel GMAX4500HD series.

Review of attachment 601155 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/windows/GfxInfo.cpp
@@ +957,5 @@
> +
> +    if ((mWindowsVersion == gfxWindowsPlatform::kWindowsXP &&
> +         driverVersion < V(6,14,10,5284) ||
> +         mWindowsVersion == gfxWindowsPlatform::kWindowsVista &&
> +         driverVersion < V(8,15,10,2202) ||

This is correct C++ but will have compilers like GCC scream warnings: "suggest parentheses around && inside of ||"
Attachment #601155 - Flags: review?(bjacob) → review+
Landed and fixed the above comment.
https://hg.mozilla.org/mozilla-central/rev/7ce4d9b55863
Crash Signature: SubRectInfo const*)] [@ _VEC_memzero] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] [@ msvcrt.dll@0x9cc6] → SubRectInfo const*)] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] [@ msvcrt.dll@0x9cc6]
I received the about:support from one user who could reproduce at some point, but at the moment she couldn't reproduce anymore (in firefox 11) and her about:support showed 8.15.10.2125 correctly blacklisted. Maybe we have some uninitialized value somewhere?
(In reply to Benoit Jacob [:bjacob] from comment #60)
> Good news! On Firefox 13, there is no crash with build id > 20120301 !
First, there are crashes after the last patch landed (see comment 57). Then, if you look at the history, you can have up to 5 builds without crashes (startup crashes depend on new users).
Could it be a scheduling issue because the blocklisting check comes too late?
No, the blocklist check comes first and blocks (occurs in the main thread) the actual initialization of the graphics feature.
Can we add a downloaded blocklist?
It'll work if Firefox 10 users use Firefox at least during one day before upgrading to Firefox 11.
For new installations, it won't work.
Crash Signature: SubRectInfo const*)] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] [@ msvcrt.dll@0x9cc6] → SubRectInfo const*)] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] [@ msvcrt.dll@0x9cc6] [@ msvcrt.dll@0x15a45]
Hardware: x86 → All
The problem here is that our whole blacklisting mechanism has a bug. Adding a downloaded blacklist entry will almost certainly not solve it.
Bug 715401 is a very similar bug, with our blacklisting not working, and the people there have the Yahoo Toolbar.

Similarly, it is worth checking if the the present bug also has such a correlation with a particular add-on or module.
Many crash reports only list {972ce4c6-7e08-4474-a285-3208198ce6fd} under Extensions, and that means that they had no extension installed. So this crash is not related to any extension. I haven't checked Modules.
(In reply to Benoit Jacob [:bjacob] from comment #66)
> Similarly, it is worth checking if the the present bug also has such a
> correlation with a particular add-on or module.
There are no loaded extensions in startup crash reports.
Nevertheless, there are correlations per module (sometimes related to extensions):
* 11.0:
  msvcrt.dll@0x9cc6|EXCEPTION_ACCESS_VIOLATION_WRITE (193 crashes)
     32% (61/193) vs.  10% (3619/34679) GrooveShellExtensions.dll (MS Office Groove)
     23% (45/193) vs.   8% (2898/34679) aswProperty.dll (Avast! antivirus)
     12% (23/193) vs.   1% (224/34679) DpOFeedb.dll (DigitalPersona Pro)
     13% (26/193) vs.   3% (1055/34679) BtMmHook.dll (Bluetooth Software)
      8% (16/193) vs.   0% (31/34679) BGLsp.dll (BullGuard)
      8% (16/193) vs.   1% (264/34679) BabyFox.dll (Babylon Toolbar)

  _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)|EXCEPTION_ACCESS_VIOLATION_WRITE (68 crashes)
     44% (30/68) vs.  10% (3619/34679) GrooveShellExtensions.dll (MS Office Groove)
     28% (19/68) vs.   8% (2898/34679) aswProperty.dll (Avast! antivirus)
     16% (11/68) vs.   2% (761/34679) DropboxExt.14.dll (DropBox)
Here,

   https://tbpl.mozilla.org/?tree=Try&rev=b9c53783cfa9

is a try push of mozilla-beta with the following changesets cleanly backed out:

undo-39ce0a82bb75
undo-fe937bac6e75
undo-dd2e6ce53715
undo-838e8168ea50
undo-49b8bec6d175
undo-8f7893e1c20c
undo-b9c1b8afb35a
undo-8c075fee9be4     # this is the patch from bug 704710 that we want to back out

the other changesets here are later patches affecting the same files, that I had to back out as well in order to have a clean backout of bug 704710. I could have kept them and done a non-clean backout by hand, but our goal right now is to minimize risk.

Note that it would be very difficult to do the same on mozilla-central, because Bug 679226 / cset e57e271bf328 completely restructured the widget/ directory: moved files around, and even changed indentation, meaning that we can't back stuff out of mozilla-central without a lot of manual work, meaning a high risk of mistake.
Here's my backout patch queue folded into one patch. Applies against current mozilla-beta. That's what's on try (previous comment).
Previous Try failed because I forgot to re-add some old xpcshell tests. New Try:
https://tbpl.mozilla.org/?tree=Try&rev=fe369816c08e
Backed out the patch here from mozilla-beta:
http://hg.mozilla.org/releases/mozilla-beta/rev/cf3911792b59

As well as 7 other patches, including Bug 704710.
http://hg.mozilla.org/releases/mozilla-beta/rev/ff5f2055aba4

(the other 6 csets are in between these two ones)
Benoit, thanks for this work! I hope it helps and we're not backing out (too many) newer blocks that are needed as well, I wouldn't like if this set of high-volume crashers would just be replaced by another set of high-volume crashers.
I guess you guys will work on a better solution that hopefully will even be available in time for 12?
Yes, of course we have to work on the real solution. The backout just buys us 6 weeks.
This bug was first reported the day after the patch in bug 699482 was landed, and that wasn't backed out. Maybe it was that which caused it.
The patch in bug 699482 inserted FEATURE_STATUS_UNKNOWN = 2; in the IDL. Previously we had: FEATURE_BLOCKED_DRIVER_VERSION = 2; We are currently (since bug 625160 landed, and until bug 653102 is fixed) caching these status under the gfx.blacklist.* preferences, which are plain integers like |2|. What could happen is that people had the value |2| in the preference, and suddenly the meaning of that cached value changed.
So, the good news is that if the theory in comment 78 is true, this can be fixed by just clearing the gfx.blacklist.* prefs.
This implements the minimal fix assuming the theory in comment 78 is true. Patch for mozilla-central.
Same, but for mozilla-beta (just different directory structure)
Argh!! The way the downloaded blacklisting is implemented, relies fundamentally on these prefs, not just as cache but as the only intermediate storage between the evaluation of the downloaded blacklist and the actual blacklisting decision.
Attachment #604659 - Attachment description: minimal mozilla-central patch to NOT use cached blacklisting decisions → minimal mozilla-central patch to NOT use cached blacklisting decisions (ARGH! breaks downloaded blacklist!)
Attachment #604660 - Attachment description: minimal mozilla-beta patch to NOT use cached blacklisting decisions → minimal mozilla-beta patch to NOT use cached blacklisting decisions (ARGH! breaks downloaded blacklist!)
Also note that if the theory in comment 78 is correct, backing out bug 699482 won't help, as now the people with prefs from after this patch will get wrong blacklisting decisions.
(In reply to Benoit Jacob [:bjacob] (On vacation until March 18) from comment #83)
> Also note that if the theory in comment 78 is correct, backing out bug
> 699482 won't help, as now the people with prefs from after this patch will
> get wrong blacklisting decisions.
It's true for people with version 11 and above. This patch didn't land in the Release channel. So the backout of bug 699482 in Beta could work for people who haven't switched between channels with the same profile, if your theory is right.
Got it! Here's what happened. As comment 78 says, bug 699482 changed the value of FEATURE_BLOCKED_DRIVER_VERSION so that users who had a cached FEATURE_BLOCKED_DRIVER_VERSION now get FEATURE_STATUS_UNKNOWN instead.

Normally that shouldn't cause real trouble because gfx features should block on anything else than FEATURE_NO_INFO (which means: no bad news. No news is good news).

But D2D initialization has a bug here, that is triggered by this change:

http://hg.mozilla.org/mozilla-central/file/b98d1ff2d051/gfx/thebes/gfxWindowsPlatform.cpp#l412

412 #ifdef CAIRO_HAS_D2D_SURFACE
413     bool d2dDisabled = false;
414     bool d2dForceEnabled = false;
415     bool d2dBlocked = false;
416 
417     nsCOMPtr<nsIGfxInfo> gfxInfo = do_GetService("@mozilla.org/gfx/info;1");
418     if (gfxInfo) {
419         PRInt32 status;
420         if (NS_SUCCEEDED(gfxInfo->GetFeatureStatus(nsIGfxInfo::FEATURE_DIRECT2D, &status))) {
421             if (status != nsIGfxInfo::FEATURE_NO_INFO) {
422                 d2dDisabled = true;
423                 if (status == nsIGfxInfo::FEATURE_BLOCKED_DRIVER_VERSION ||
424                     status == nsIGfxInfo::FEATURE_BLOCKED_DEVICE)
425                 {
426                     d2dBlocked = true;
427                 }
428             }
429         }
430     }
431 
432     d2dDisabled = Preferences::GetBool("gfx.direct2d.disabled", false);
433     d2dForceEnabled = Preferences::GetBool("gfx.direct2d.force-enabled", false);


When status == nsIGfxInfo::FEATURE_STATUS_UNKNOWN, at line 422 we set d2dDisabled=true, but we don't pass the if() condition at line 423-424 so we don't set d2dBlocked=true. Below at line 432 we overwrite d2dDisabled anyway, so there is a plain bug here: line 422 never has any effect.
Not asking for review, this is a trivial fix and I want to land it now to ask for beta approval.
Marking as depending on Bug 653102. I hope that this time we'll really prioritize fixing bug 653102.
Depends on: 653102
Comment on attachment 604666 [details] [diff] [review]
fix d2d blocklisting check

[Approval Request Comment]
Regression caused by (bug #): 699482
User impact if declined: crashiness. specifically, Direct2D/DirectWrite blacklisting is broken in Firefox 11+ for users upgrading from Firefox 10-
Testing completed (on m-c, etc.): Just landed on m-c
Risk to taking this patch (and alternatives if risky): No risk. Trivial bug fix.
String changes made by this patch: none
Attachment #604666 - Flags: approval-mozilla-beta?
Attachment #604666 - Flags: approval-mozilla-aurora?
Also, if my theory is correct, here are the STR for this bug:
1. use a Win7 machine where Direct2D should be blacklisted (use spoofing to simulate an old driver version if needed)
2. go to about:config, set gfx.blacklist.direct2d=2, restart browser, check that the pref is still there
3. go to about:support, you should see Direct2D not blacklisted anymore
(In reply to Benoit Jacob [:bjacob] (On vacation until March 18) from comment #86)
> Not asking for review, this is a trivial fix and I want to land it now to
> ask for beta approval.

You should ask for review even if you don't want to block on it - any patch that we land on beta could use a second set of eyes.
Comment on attachment 604666 [details] [diff] [review]
fix d2d blocklisting check

Alright, asking Bas for review per Gavin's suggestion.
Attachment #604666 - Flags: review?
Attachment #604666 - Flags: review? → review+
Attachment #604659 - Attachment is obsolete: true
Attachment #604660 - Attachment is obsolete: true
Verified bjacob's STR, though it was using a spoofed configuration. Doing it this way doesn't mean the STR may not work on real hardware, but it doesn't verify that this is actually the problem.
STR:

0) Start with an ATI Radeon HD 4200 (vendor ID 1002, device ID 9710, driver version 8.733.0.0)
1) Download/install FF10beta6: https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/10.0b6/win32/en-US/Firefox%20Setup%2010.0b6.exe
2) Ensure this is a clean new profile (firefox.exe -P, create new profile)
3) Check about:support for the value of "Direct2D Enabled" ("Blocked for your graphics...")
4) Check gfx.blacklist.direct2d in about:config (it's unset at this point)
5) Force a blocklist ping in the Error Console:

Components.classes["@mozilla.org/extensions/blocklist;1"].getService(Components.interfaces.nsITimerCallback).notify(null);

6) Check about:support for the value of "Direct2D Enabled" ("Blocked for your graphics...")
7) Check gfx.blacklist.direct2d in about:config (it's now set to 2)
8) Go to "About Mozilla Firefox" and apply the update to FF11beta7
9) Check about:support for the value of "Direct2D Enabled" ("true")
10) Check gfx.blacklist.direct2d in about:config (still set to 2)
STR 2 (no update, just using the same profile across 10/nightly):

0) Start with an ATI Radeon HD 4200 (vendor ID 1002, device ID 9710, driver version 8.733.0.0)
1) Download/install FF10.0.2
2) Ensure this is a clean new profile (firefox.exe -P, create new profile)
3) Check about:support for the value of "Direct2D Enabled" ("Blocked for your graphics...")
4) Check gfx.blacklist.direct2d in about:config (it's unset at this point)
5) Force a blocklist ping in the Error Console:

Components.classes["@mozilla.org/extensions/blocklist;1"].getService(Components.interfaces.nsITimerCallback).notify(null);

6) Check about:support for the value of "Direct2D Enabled" ("Blocked for your graphics...")
7) Check gfx.blacklist.direct2d in about:config (it's now set to 2)
8) Install the latest available Nightly (3fdc1c14a8ce) 
9) Check about:support for the value of "Direct2D Enabled" ("true")
10) Check gfx.blacklist.direct2d in about:config (still set to 2)



Steps to Verify (no update, just using the same profile across 10/nightly TB build):

0) Start with an ATI Radeon HD 4200 (vendor ID 1002, device ID 9710, driver version 8.733.0.0)
1) Download/install FF10.0.2
2) Ensure this is a clean new profile (firefox.exe -P, create new profile)
3) Check about:support for the value of "Direct2D Enabled" ("Blocked for your graphics...")
4) Check gfx.blacklist.direct2d in about:config (it's unset at this point)
5) Force a blocklist ping in the Error Console:

Components.classes["@mozilla.org/extensions/blocklist;1"].getService(Components.interfaces.nsITimerCallback).notify(null);

6) Check about:support for the value of "Direct2D Enabled" ("Blocked for your graphics...")
7) Check gfx.blacklist.direct2d in about:config (it's now set to 2)
8) Install the Nightly TB build with Benoit's patch (b9357da14ab8) 
9) Check about:support for the value of "Direct2D Enabled" ("false")
10) Check gfx.blacklist.direct2d in about:config (still set to 2)



Is #9 in the steps to verify expected to report back "false" instead of "Blocked.."?
Thanks for verifying this on real hardware!
Comment on attachment 604666 [details] [diff] [review]
fix d2d blocklisting check

[Triage Comment]
Please land on Aurora 12, Beta 11, and Release 11.
Attachment #604666 - Flags: approval-mozilla-release+
Attachment #604666 - Flags: approval-mozilla-beta?
Attachment #604666 - Flags: approval-mozilla-beta+
Attachment #604666 - Flags: approval-mozilla-aurora?
Attachment #604666 - Flags: approval-mozilla-aurora+
Blocks: 699482
No longer blocks: 704710
Verified that "Direct2D Enabled" in about:support correctly reports "false" using the STR in https://bugzilla.mozilla.org/show_bug.cgi?id=711656#c95 for Beta 8.
Noticing a difference with AzureBackend: direct 2d across FF 10 & 11 on Windows 7 64-bit:

- FF 10.0.2 makes no mention of AzureBackend: direct2d
- FF 11 Beta 8 indicates a mention of AzureBackend: direct2d

Why would these two be different?
Comparing about:support on Win 7 32-bit on FF 10.0.2 vs. FF 11 Beta 8, the values appear to identical. Could this be a Win 7 64-bit specific issue?
Adding results from Windows 7 64-bit VM at Alex's request...

Adapter Description - VMWare SVGA 3D (Microsoft Corporation - WDDM)
Driver Version - 7.14.1.1134

Firefox 10.0.2:
Direct2d Enabled - Blocked by your graphics card because of unresolved driver issues
gfx.blacklist.direct2d - does not exist
*ping blocklist*
Direct2d Enabled - Blocked by your graphics card because of unresolved driver issues
gfx.blacklist.direct2d - 3
*pave over with Firefox 11.0b8*
Direct2d Enabled - Blocked for your graphics driver version
gfx.blacklist.direct2d - 3
Also for comment 101, GPU Accelerated Windows shows 0 in beta 8, 0/1 for 10.0.2
The difference in GPU Accelerated Windows is bug 598339, and the additional adapter description for GPU #2 is bug 679110. The addition of AzureBackend=direct2d is of interest.
(In reply to Jason Smith from comment #101)
> - FF 11 Beta 8 indicates a mention of AzureBackend: direct2d
It was introduced in 11.0 by bug 702770 and the misalignment fixed in 12.0 by bug 707223.
I confirm that the AzureBackend line is nothing to worry about. It only says which Azure backend would be used regardless of blacklisting, it doesn't override blacklisting.
There is no blocklist on Windows 8 (not a regression). See https://crash-stats.mozilla.com/report/list?signature=msvcrt.dll%400x9890

The Intel hack should be removed from m-c.
Crash Signature: SubRectInfo const*)] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] [@ msvcrt.dll@0x9cc6] [@ msvcrt.dll@0x15a45] → SubRectInfo const*)] [@ _VEC_memzero | TextStageManager::MapTextureTransferSurface(D2D_RECT_U const&, unsigned char**, unsigned int*)] [@ msvcrt.dll@0x9cc6] [@ msvcrt.dll@0x15a45] [@ msvcrt.dll@0x9890]
Confirmed that the only crash here is still the one Windows 8 user running into [@ msvcrt.dll@0x9890 ]
There are two crashes in 11.0b8 and one crash in 12.0a2/20120311, but it's bug 635464.
The Windows 8 graphics blocklist is treated in bug 706908.

I let it opened in order to remove the Intel hack.
Blocks: 715401
Verified fixed in Firefox 11 (build#2) using the STR in comment#95.
There is one crash in 11.0 build 2 with Intel's driver version 8.15.10.2281 that is not in the blocklist: bp-02a4a511-afb9-445d-b62a-3cf1d2120314
Those crashes should be monitored carefully to see if the cutoff version should be shifted.
Backed out the Intel special case code:
https://hg.mozilla.org/mozilla-central/rev/0dcc716da095
I close it as fixed. For remaining issues (Windows 8, driver versions not in the blocklist), file a new bug.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
I don't think qawanted is needed anymore -- please re-add if this is incorrect.

Alex, can you use your hardware to verify this is fixed with the latest Firefox 12 and 13 builds?
Keywords: qawanted
qa- RE comment 118. Alex, if you want QA to verify, please give someone the hardware (probably Marcia since Juan is on PTO).
Whiteboard: [startupcrash] Keep Open!! → [startupcrash][qa-] Keep Open!!
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #118)
> I don't think qawanted is needed anymore -- please re-add if this is
> incorrect.
> 
> Alex, can you use your hardware to verify this is fixed with the latest
> Firefox 12 and 13 builds?

I'll bring it in tomorrow - keep forgetting :)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: