Closed Bug 1272808 Opened 5 years ago Closed 5 years ago

Expose webgl failure id histogram to telemetry

Categories

(Core :: Graphics: Layers, defect)

Unspecified
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: BenWa, Assigned: BenWa)

References

Details

(Whiteboard: gfx-noted)

Attachments

(1 file)

No description provided.
Whiteboard: gfx-noted
https://reviewboard.mozilla.org/r/52585/#review49718

::: gfx/thebes/gfxPrefs.h:75
(Diff revision 1)
>  class PreferenceAccessImpl;
>  class gfxPrefs;
>  class gfxPrefs final
>  {
>    friend class gfxWindowsPlatform;
> +  friend class gfxPlatform;

The previous one slipped through my review - I'd rather we remove both friends and make the name and default value methods public.  I'll make a patch.
OS: Unspecified → All
Comment on attachment 8752391 [details]
MozReview Request: Bug 1272808 - Expose context creation error to telemetry. r=jgilbert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/52585/diff/1-2/
Attachment #8752391 - Attachment description: MozReview Request: Bug 1272808 - Expose WebGL context creation errors to telemetry. r=jgilbert → MozReview Request: Bgu 1272808 - Expose context creation error to telemetry. r=jgilbert
Assignee: nobody → bgirard
Comment on attachment 8752391 [details]
MozReview Request: Bug 1272808 - Expose context creation error to telemetry. r=jgilbert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/52585/diff/2-3/
Attachment #8752391 - Attachment description: MozReview Request: Bgu 1272808 - Expose context creation error to telemetry. r=jgilbert → MozReview Request: Bug 1272808 - Expose context creation error to telemetry. r=jgilbert
Requesting approval per: https://wiki.mozilla.org/Firefox/Data_Collection#Requesting_Approval

I'm introducing a keyed histogram that will include how frequently various failure id 'FAILURE_*'[1] AND webglcontexterror are hit throughout the code base. We've got over 100 unique failure ID (including blocklist rule failure ids) and an unbounded number of webglcontexterror code (they include runtime arguments such as width/height and thus there's no limit). We could give each webglcontexterror a unique failure Id and bound the total number of histograms to a few hundred but these error code would be less descriptive.

[1] http://mxr.mozilla.org/mozilla-central/search?string=FEATURE_FAILURE&find=widget&findi=&filter=%5E%5B%5E%5C0%5D*%24&hitlimit=&tree=mozilla-central
Flags: needinfo?(benjamin)
Comment on attachment 8752391 [details]
MozReview Request: Bug 1272808 - Expose context creation error to telemetry. r=jgilbert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/52585/diff/3-4/
Data questions:

How do you plan to monitor this data? I'm fairly certain that the default views available on t.m.o will not be helpful for monitoring this over time. Also, telemetry alerting won't be useful for this metric because AFAIK it doesn't support keyed histograms as all, and it also doesn't really alert properly on counters. And since you've marked this as expires-never, you need to have some plan for permanent monitoring. Unless you're certain that this will be necessary long-term, I'd encourage you to start out collecting it temporarily, explore the dataset using one-off spark notebooks, and later decide whether it's worth the engineering effort to build a permanent monitoring/alerting system.

In order to gauge privacy risk, I need a lot more detail about the key format/space. Comment 5 was a helpful start, and I think that the FAILURE_* bits are fine from a privacy perspective. The FEATURE_FAILURE_DL_BLACKLIST_<ID> bits are the IDs from the blocklist?

I need to know more about what data can be included with webglcontexterror. It is probably not acceptable to include runtime arguments like canvas size, at least in the main payload. If we need to record additional data like that, it may be necessary to send a special ping that's not associated with a particular user.

Actual review comments:

The description of this histogram needs to be improved. Please describe *when* this count is incremented. If multiple errors occur for the same webgl canvas, will it record this many times or just once (can that even happen?). Are some errors recorded only once per session, for example during first initialization? Also, describe the key format in detail, including the details from comment 5 and the extra details about webglcontexterror.
Flags: needinfo?(benjamin)
Flags: needinfo?(bgirard)
Let me know if the following answers are satisfactory and I'll make the request changes.

(In reply to Benjamin Smedberg  [:bsmedberg] from comment #7)
> Data questions:
> 
> How do you plan to monitor this data? I'm fairly certain that the default
> views available on t.m.o will not be helpful for monitoring this over time.

We are planning on manually analyzing this data. This will be done using any, or all, of the graphics telemetry dashboard[1], custom spark queries, or using the sql dashboard queries.

> Unless you're certain that this will be necessary long-term

I think we're completely certain that it will be long term. We need to track the reason why graphics features are not enable for all our population and this data will let us target where we want to focus on to make the most important. For instances some of our blacklist is too wide just to be safe. With this data we can focus on the blocklist rules that are blocking the most users unnecessarily.

This will be an iterative process that will continue over multiple releases and will allow us to answer precisely why our gfx feature usage is the way it is. This is a core topic at every gaphics meet-up we have and up to this point we've only ever hard rough data. This patch will give us precise answers and will let us watch the precise evolution as we use this data to relax rules and land new rules to deal with bugs.

> The FEATURE_FAILURE_DL_BLACKLIST_<ID> bits are the IDs from the blocklist?

Yes, they are from the blacklist IDs. The other ones are hardcoded in the code and will implicitly identify the hardware and driver version. Which is already precisely recorded in the telemetry ping header in much more details.

> It is probably not acceptable to include runtime arguments like canvas size, at least in the main payload.

In that case we can probably introduce failureId in the code alongside with webglcontexterror and just report these. We could turn the failures into FEATURE_FAILURE_WEBGL_OVERSIZED_CANVAS and similar.

> it may be necessary to send a special ping that's not associated with a particular user.

I think I'd rather just not use webglcontexterror and sanitize the failures to unique failure ids without runtime info. Does anyone disagree?

[1] https://github.com/dvander/moz-gfx-telemetry
Flags: needinfo?(bgirard) → needinfo?(benjamin)
With the sanitization of webglcontexterror this SGTM.

This pretty clearly should end up as an opt-out metric. Whether you do that immediately, or test the data and dashboarding first, please plan on making this opt-out eventually.

Ping me if you want advice or help with data engineering/analysis work. We have people who are eager to mentor more people to be familiar with the data processing pipeline.
Flags: needinfo?(benjamin)
We should definitely make it opt-out before this goes to beta.  I'm assuming we'll be able to get telemetry on how many people have opted out.
What does opt-out mean here specifically? I see telemetry has an option under Data Choices for 'Share additional data (i.e., Telemetry)' which I *think* must be checked for keyed histogram telemetry. What kind of opt-out option do we want here?
opt-out is collected by default from all Firefox users on all channels. It corresponds to the "Firefox Health Report" checkbox in prefs
opt-in is off by default in release builds, on by default in prerelease channels, and corresponds to what we have traditionally called "telemetry"

In this case I expect that the distribution of machine types and failures will vary significantly between beta and release, so for both exploratory and long-term operational monitoring you shouldn't rely on the beta population to understand what's going on.
Comment on attachment 8752391 [details]
MozReview Request: Bug 1272808 - Expose context creation error to telemetry. r=jgilbert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/52585/diff/4-5/
Comment on attachment 8752391 [details]
MozReview Request: Bug 1272808 - Expose context creation error to telemetry. r=jgilbert

https://reviewboard.mozilla.org/r/52585/#review50482

Please match the existing formatting norms.

::: dom/canvas/WebGL2Context.cpp:99
(Diff revision 5)
>      gl::GLFeature::uniform_matrix_nonsquare,
>      gl::GLFeature::vertex_array_object
>  };
>  
>  bool
> -WebGLContext::InitWebGL2(nsACString* const out_failReason)
> +WebGLContext::InitWebGL2(nsACString* const out_failReason, nsACString& out_failureId)

`failureId` would probably be better named `failureType`.
`failureId` sounds like it's an enum/int.

::: dom/canvas/WebGLContext.h:1204
(Diff revision 5)
>      // WebGL 2 specifics (implemented in WebGL2Context.cpp)
>  public:
>      virtual bool IsWebGL2() const = 0;
>  
>  protected:
> -    bool InitWebGL2(nsACString* const out_failReason);
> +    bool InitWebGL2(nsACString* const out_failReason, nsACString& out_failureId);

Outparams should be pointers, here and elsewhere.

::: dom/canvas/WebGLContext.cpp:434
(Diff revision 5)
>   * caps. Finally, resize the backbuffer to an acceptable size given the
>   * requested size.
>   */
>  
>  static bool
> -IsFeatureInBlacklist(const nsCOMPtr<nsIGfxInfo>& gfxInfo, int32_t feature)
> +IsFeatureInBlacklist(const nsCOMPtr<nsIGfxInfo>& gfxInfo, int32_t feature, nsACString& failureId)

If `failureId` is an in-param, it should be const T&.
If an out-param, it should be a T* const.
Attachment #8752391 - Flags: review?(jgilbert) → review+
The rule I was told was to prefer T& if T* cannot be null, which is the case here.
(In reply to Benoit Girard (:BenWa) from comment #15)
> The rule I was told was to prefer T& if T* cannot be null, which is the case
> here.

The value of visibility of passing of out-params at the call site overrides this advantage.

Besides, we break this preference basically everywhere. Eventually this might be nice, but we don't follow it today.
(In reply to Jeff Gilbert [:jgilbert] from comment #14)
> ::: dom/canvas/WebGL2Context.cpp:99
> (Diff revision 5)
> >      gl::GLFeature::uniform_matrix_nonsquare,
> >      gl::GLFeature::vertex_array_object
> >  };
> >  
> >  bool
> > -WebGLContext::InitWebGL2(nsACString* const out_failReason)
> > +WebGLContext::InitWebGL2(nsACString* const out_failReason, nsACString& out_failureId)
> 
> `failureId` would probably be better named `failureType`.
> `failureId` sounds like it's an enum/int.
> 

I applied the other changes but left failureId because that's what it's called elsewhere in the code and it is an identifier. I don't think the confusion with an enum is a problem. In fact refactoring the IDs into an enum could be a follow-up clean up we could make to improve this.
Comment on attachment 8752391 [details]
MozReview Request: Bug 1272808 - Expose context creation error to telemetry. r=jgilbert

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/52585/diff/5-6/
Do you want/need to have another look at this patch or should I carry forward the SGTM? We no longer use webglcontexterror as keys and I've added more details to the histogram description.
Flags: needinfo?(benjamin)
I looked it over again, this looks good.
Flags: needinfo?(benjamin)
https://hg.mozilla.org/mozilla-central/rev/705b7365e3e5
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Summary: Expose context creation error to telemetry → Expose webgl failure id histogram to telemetry
Blocks: 1275987
Depends on: 1346203
You need to log in before you can comment on or make changes to this bug.