Open Bug 737176 Opened 10 years ago Updated 10 years ago

Post-mortem on bug 711656 / bug 715401 i.e. how Direct2D driver blacklisting stopped working

Categories

(Core :: Graphics, defect)

defect
Not set
normal

Tracking

()

People

(Reporter: bjacob, Unassigned)

References

(Depends on 2 open bugs)

Details

*** What happened: the symptoms ***

A Windows top-crasher (bug 711656, bug 715401) nearly prevented us from shipping Firefox 11 on time. The crashes were related to Direct2D / DirectWrite. Most of the crash reports pointed to old, blacklisted driver versions. The questions was: why was blacklisting apparently not working for these people?

*** Explanation of the crash ***

GfxInfo is our blacklisting tool. For a given feature, it produces a feature status code whose value is an enum constant.

Since support for a downloadable blacklist was implemented in bug 625160, these feature status codes are cached as the gfx.blacklist.* preferences.

Gfx features normally only allow initialization when the status code is NO_INFO (numeric value 0). But Direct2D was doing something complicated/buggy here (see bug 711656 comment 85 for details) that made it very dependent on the particular numeric values of other constants.

In bug 699482, a new enumeration constant was inserted into this list, shifting the values of some other constants by +1. The NO_INFO constant remained at value 0, so most gfx features were unaffected, but Direct2D was affected.

The problems only occurred for people upgrading to Firefox 11 from Firefox 10, with the same profile. The scenario was:
 - User uses Firefox 10 with a blacklisted driver
 - Firefox 10 caches the blacklisting decision as a numeric preference. '2' means 'blocked driver version'
 - User upgrades to Firefox 11
 - Now the cached value '2' means 'unknown status'. Most Gfx features still correctly interprete this as 'should not use' but Direct2D only blocks when it finds a known enum value, so here Direct2D isn't blocked anymore ---> crashes.

*** What we could do now about this ***

1. Permanent state (here, the gfx.blacklist.* prefs) is very costly in terms of bug-prone-ness and complexity. See bug 653102 about no longer having permanent state here.

2. Non-default permanent state (i.e. non-default prefs) should always be added to crash reports. To avoid making this a constant startup overhead, we could simply add non-default prefs to crash reports when they are read. We'd then miss the prefs that haven't been read, but I'd contend that's a feature, because a pref that's never been read probably can't be the cause of a crash anyway.

3. APIs exposing more low-level details than the caller needs, are bug-prone. Here, the caller of the GfxInfo blacklisting check call gets back a detailed status code that they really never need. All what the caller wants is a boolean answer to the question: "is this feature OK to use?". The only exception is aboutSupport.js. We should add a new boolean getter call, and make all the C++ Gfx code use that, so that future people copying and pasting existing C++ code, would copy that.

*** Why this happened now ***

We've had plenty of hard/weird Gfx topcrashers, but we've never come so close to missing a release because of one. It used to be that all the Gfx team members spent a lot of time each week to look at their bugmail and crash-stats and focus on fixing topcrashers as a higher priority than feature development. But lately, we've all been so busy with new feature development (and initial bugfixing) that we've somewhat lost focus from the plain bugfixing i.e. our legacy workload. At least I know that I haven't at all paid as much attention to random Gfx bugs lately, as I used to last year. Maybe, another sign of that is that we haven't (AFAIK) done BugKill anymore lately.
Depends on: 711656, 715401
Let's keep this bug open until we've agreed on the "What to do about this now" (see above) and until that's implemented (block this bug).
Depends on: 653102
OS: Linux → All
Hardware: x86_64 → All
Depends on: 745868
You need to log in before you can comment on or make changes to this bug.