Open Bug 1261839 Opened 8 years ago Updated 2 years ago

add telemetry for service worker failures

Categories

(Core :: DOM: Service Workers, task, P2)

task

Tracking

()

People

(Reporter: bkelly, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tw-dom] btpp-fixlater DWS_NEXT)

Attachments

(1 file)

We have some basic telemetry at the moment, but we should add some more.  Specifically, we should be tracking failures to intercept.  We could be having a bunch of things failing in the wild now, but have no idea.
Whiteboard: [tw-dom] → [tw-dom] btpp-fixlater
Assignee: nobody → catalin.badea392
Attachment #8861924 - Flags: review?(gfritzsche) → review?(chutten)
Comment on attachment 8861924 [details] [diff] [review]
Add telemetry tracking for fetch interception failures.

Review of attachment 8861924 [details] [diff] [review]:
-----------------------------------------------------------------

All good on the Telemetry side. Requires data review (see https://wiki.mozilla.org/Firefox/Data_Collection )

::: dom/workers/ServiceWorkerPrivate.cpp
@@ +1701,5 @@
>                             aDocumentId, aIsReload, newWorkerCreated);
>    rv = r->Init();
>    if (NS_WARN_IF(NS_FAILED(rv))) {
> +    Telemetry::ScalarAdd(Telemetry::ScalarID::DOM_SERVICEWORKERS_INTERCEPTION_FAILURE,
> +      NS_LITERAL_STRING("Interception failed while initializing fetch event"), 1);

Strings this long are going to be difficult to display (see other keyed scalars like https://mzl.la/2q5T6eI for how they're displayed). If you can be more concise, I invite you to do so.

But if that display's not a big deal, then there's no problem with a longer string.
Attachment #8861924 - Flags: review?(chutten) → review+
Comment on attachment 8861924 [details] [diff] [review]
Add telemetry tracking for fetch interception failures.

Data collection review - bsmedberg, could you please review this? Thanks!
Attachment #8861924 - Flags: feedback?(benjamin)
Comment on attachment 8861924 [details] [diff] [review]
Add telemetry tracking for fetch interception failures.

DATA REVIEW COMMENTS:

>diff --git a/toolkit/components/telemetry/Scalars.yaml b/toolkit/components/telemetry/Scalars.yaml

>+dom.serviceworkers:
>+  interception_failure:
>+    bug_numbers:
>+      - 1261839
>+    description: >
>+      Counts the different error situations encountered while intercepting fetch requests.
>+    expires: never

This data does not yet meet the risk/reward bar for permanent data collection. In order to promote this to permanent, you need:

* a long-term plan articulated for who is responsible for monitoring this data and how they're going to do it
* automated tests that ensure this metric continue to work properly

Typically for a metric like this you should start out collecting the data for 6 months/5 releases, and during that time you can prove that the collection is correct and come up with a better long-term monitoring plan. Then you can either make it permanent if appropriate or renew the expiration.

>+    keyed: true

If you are going to use a keyed scalar, then the histogram description should state exactly how the key is constructed. However in this case I am convinced that a keyed scalar is not what you want. The best datatype for this would be a categorical histogram. It's self-documenting in that the enumeration is enforced in the code and show up here as well. And a keyed scalar is quite a bit less efficient to process and aggregate in the data pipeline.

>+    kind: uint
>+    notification_emails:
>+      - cbadea@mozilla.com
>+    release_channel_collection: opt-out
>+    record_in_processes:
>+      - 'content'

Until we drop the non-e10s config completely in 57, I think this has to be main and content processes.

I'm going to mark data-review- on this because I'd like to see it again based on comments here.

NON DATA REVIEW COMMENT:

You are recording the number of errors here, but are you recording the total number of interceptions? As usage of workers changes over time, you'd presumably need to have both the total count and the error counts in order to monitor error rates.
Attachment #8861924 - Flags: feedback?(benjamin) → feedback-
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Comment on attachment 8861924 [details] [diff] [review]
> Add telemetry tracking for fetch interception failures.
> 
> DATA REVIEW COMMENTS:
> 
> >diff --git a/toolkit/components/telemetry/Scalars.yaml b/toolkit/components/telemetry/Scalars.yaml
> 
> >+dom.serviceworkers:
> >+  interception_failure:
> >+    bug_numbers:
> >+      - 1261839
> >+    description: >
> >+      Counts the different error situations encountered while intercepting fetch requests.
> >+    expires: never
> 
> This data does not yet meet the risk/reward bar for permanent data
> collection. In order to promote this to permanent, you need:
What are the risks here? This telemetry would give us a sense of whether
our SW infrastructure is failing in the wild, of which we currently know
almost *nothing* at the moment.

> * a long-term plan articulated for who is responsible for monitoring this
> data and how they're going to do it
The people responsible would be the team work on SW, we would use this data
as an indicator that changes to our SW codebase (or unrelated changes in the tree)
don't break service workers.

> * automated tests that ensure this metric continue to work properly
These are exceptional/unexpected error cases, I'm having a hard time figuring out
how to test this.
These codepaths can't easily be triggered through our test framework, and writing
functional tests for this component so that I can trigger these probes in a test
environment seems too much.

What tests would you consider appropriate here?
 
> Typically for a metric like this you should start out collecting the data
> for 6 months/5 releases, and during that time you can prove that the
> collection is correct and come up with a better long-term monitoring plan.
> Then you can either make it permanent if appropriate or renew the expiration.
> 
> >+    keyed: true
> 
> If you are going to use a keyed scalar, then the histogram description
> should state exactly how the key is constructed. However in this case I am
> convinced that a keyed scalar is not what you want. The best datatype for
> this would be a categorical histogram. It's self-documenting in that the
> enumeration is enforced in the code and show up here as well. And a keyed
> scalar is quite a bit less efficient to process and aggregate in the data
> pipeline.
I'll switch to categorial histograms.
> 
> >+    kind: uint
> >+    notification_emails:
> >+      - cbadea@mozilla.com
> >+    release_channel_collection: opt-out
> >+    record_in_processes:
> >+      - 'content'
> 
> Until we drop the non-e10s config completely in 57, I think this has to be
> main and content processes.
> 
> I'm going to mark data-review- on this because I'd like to see it again
> based on comments here.
> 
> NON DATA REVIEW COMMENT:
> 
> You are recording the number of errors here, but are you recording the total
> number of interceptions? As usage of workers changes over time, you'd
> presumably need to have both the total count and the error counts in order
> to monitor error rates.
You're right.
Flags: needinfo?(benjamin)
(In reply to Cătălin Badea (:catalinb) from comment #5)

> > This data does not yet meet the risk/reward bar for permanent data
> > collection. In order to promote this to permanent, you need:
> What are the risks here? This telemetry would give us a sense of whether
> our SW infrastructure is failing in the wild, of which we currently know
> almost *nothing* at the moment.

Every piece of data we collect represents some risk. Keyed histograms are naturally more risky because we can enforce fewer privacy properties about them or accidentally collect private data in the key. Switching to a categorical histogram makes that risk go away.

But overall until you have experience with the data, it's hard to know how much it's going to have long-term value. So the usual best practice is to start out by collecting for 6 months and then decide whether to make something permanent.


> The people responsible would be the team work on SW, we would use this data
> as an indicator that changes to our SW codebase (or unrelated changes in the
> tree)
> don't break service workers.

Yes but how exactly? By manually looking at telemetry.mozilla.org dashboards? Or do you have another less manual plan?

> > * automated tests that ensure this metric continue to work properly
> These are exceptional/unexpected error cases, I'm having a hard time
> figuring out
> how to test this.
> These codepaths can't easily be triggered through our test framework

Oh, I didn't realize that web content couldn't already trigger these conditions. Are you saying that these are conditions that shouldn't *ever* happen?

If these are really things that should never happen, then maybe an automated test isn't necessary. Although it still opens you up to risk of accidentally breaking the telemetry as code gets refactored.
Flags: needinfo?(benjamin)
Priority: -- → P2
We probably still want this, but the patch needs to be updated.
Assignee: catalin.badea392 → nobody
Flags: needinfo?(mdaly)
Andrew, can you update this patch?
Flags: needinfo?(mdaly) → needinfo?(bugmail)
Removing need-info and adding to DWS_NEXT
Flags: needinfo?(bugmail)
Whiteboard: [tw-dom] btpp-fixlater → [tw-dom] btpp-fixlater DWS_NEXT
Type: defect → task
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: