Open
Bug 1261839
Opened 9 years ago
Updated 2 years ago
add telemetry for service worker failures
Categories
(Core :: DOM: Service Workers, task, P2)
Core
DOM: Service Workers
Tracking
()
NEW
People
(Reporter: bkelly, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [tw-dom] btpp-fixlater DWS_NEXT)
Attachments
(1 file)
4.27 KB,
patch
|
chutten
:
review+
benjamin
:
feedback-
|
Details | Diff | Splinter Review |
We have some basic telemetry at the moment, but we should add some more. Specifically, we should be tracking failures to intercept. We could be having a bunch of things failing in the wild now, but have no idea.
Updated•9 years ago
|
Whiteboard: [tw-dom] → [tw-dom] btpp-fixlater
Reporter | ||
Updated•8 years ago
|
Comment 1•8 years ago
|
||
Attachment #8861924 -
Flags: review?(gfritzsche)
Updated•8 years ago
|
Assignee: nobody → catalin.badea392
Updated•8 years ago
|
Attachment #8861924 -
Flags: review?(gfritzsche) → review?(chutten)
Comment 2•8 years ago
|
||
Comment on attachment 8861924 [details] [diff] [review]
Add telemetry tracking for fetch interception failures.
Review of attachment 8861924 [details] [diff] [review]:
-----------------------------------------------------------------
All good on the Telemetry side. Requires data review (see https://wiki.mozilla.org/Firefox/Data_Collection )
::: dom/workers/ServiceWorkerPrivate.cpp
@@ +1701,5 @@
> aDocumentId, aIsReload, newWorkerCreated);
> rv = r->Init();
> if (NS_WARN_IF(NS_FAILED(rv))) {
> + Telemetry::ScalarAdd(Telemetry::ScalarID::DOM_SERVICEWORKERS_INTERCEPTION_FAILURE,
> + NS_LITERAL_STRING("Interception failed while initializing fetch event"), 1);
Strings this long are going to be difficult to display (see other keyed scalars like https://mzl.la/2q5T6eI for how they're displayed). If you can be more concise, I invite you to do so.
But if that display's not a big deal, then there's no problem with a longer string.
Attachment #8861924 -
Flags: review?(chutten) → review+
Comment 3•8 years ago
|
||
Comment on attachment 8861924 [details] [diff] [review]
Add telemetry tracking for fetch interception failures.
Data collection review - bsmedberg, could you please review this? Thanks!
Attachment #8861924 -
Flags: feedback?(benjamin)
Comment 4•8 years ago
|
||
Comment on attachment 8861924 [details] [diff] [review]
Add telemetry tracking for fetch interception failures.
DATA REVIEW COMMENTS:
>diff --git a/toolkit/components/telemetry/Scalars.yaml b/toolkit/components/telemetry/Scalars.yaml
>+dom.serviceworkers:
>+ interception_failure:
>+ bug_numbers:
>+ - 1261839
>+ description: >
>+ Counts the different error situations encountered while intercepting fetch requests.
>+ expires: never
This data does not yet meet the risk/reward bar for permanent data collection. In order to promote this to permanent, you need:
* a long-term plan articulated for who is responsible for monitoring this data and how they're going to do it
* automated tests that ensure this metric continue to work properly
Typically for a metric like this you should start out collecting the data for 6 months/5 releases, and during that time you can prove that the collection is correct and come up with a better long-term monitoring plan. Then you can either make it permanent if appropriate or renew the expiration.
>+ keyed: true
If you are going to use a keyed scalar, then the histogram description should state exactly how the key is constructed. However in this case I am convinced that a keyed scalar is not what you want. The best datatype for this would be a categorical histogram. It's self-documenting in that the enumeration is enforced in the code and show up here as well. And a keyed scalar is quite a bit less efficient to process and aggregate in the data pipeline.
>+ kind: uint
>+ notification_emails:
>+ - cbadea@mozilla.com
>+ release_channel_collection: opt-out
>+ record_in_processes:
>+ - 'content'
Until we drop the non-e10s config completely in 57, I think this has to be main and content processes.
I'm going to mark data-review- on this because I'd like to see it again based on comments here.
NON DATA REVIEW COMMENT:
You are recording the number of errors here, but are you recording the total number of interceptions? As usage of workers changes over time, you'd presumably need to have both the total count and the error counts in order to monitor error rates.
Attachment #8861924 -
Flags: feedback?(benjamin) → feedback-
Comment 5•8 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Comment on attachment 8861924 [details] [diff] [review]
> Add telemetry tracking for fetch interception failures.
>
> DATA REVIEW COMMENTS:
>
> >diff --git a/toolkit/components/telemetry/Scalars.yaml b/toolkit/components/telemetry/Scalars.yaml
>
> >+dom.serviceworkers:
> >+ interception_failure:
> >+ bug_numbers:
> >+ - 1261839
> >+ description: >
> >+ Counts the different error situations encountered while intercepting fetch requests.
> >+ expires: never
>
> This data does not yet meet the risk/reward bar for permanent data
> collection. In order to promote this to permanent, you need:
What are the risks here? This telemetry would give us a sense of whether
our SW infrastructure is failing in the wild, of which we currently know
almost *nothing* at the moment.
> * a long-term plan articulated for who is responsible for monitoring this
> data and how they're going to do it
The people responsible would be the team work on SW, we would use this data
as an indicator that changes to our SW codebase (or unrelated changes in the tree)
don't break service workers.
> * automated tests that ensure this metric continue to work properly
These are exceptional/unexpected error cases, I'm having a hard time figuring out
how to test this.
These codepaths can't easily be triggered through our test framework, and writing
functional tests for this component so that I can trigger these probes in a test
environment seems too much.
What tests would you consider appropriate here?
> Typically for a metric like this you should start out collecting the data
> for 6 months/5 releases, and during that time you can prove that the
> collection is correct and come up with a better long-term monitoring plan.
> Then you can either make it permanent if appropriate or renew the expiration.
>
> >+ keyed: true
>
> If you are going to use a keyed scalar, then the histogram description
> should state exactly how the key is constructed. However in this case I am
> convinced that a keyed scalar is not what you want. The best datatype for
> this would be a categorical histogram. It's self-documenting in that the
> enumeration is enforced in the code and show up here as well. And a keyed
> scalar is quite a bit less efficient to process and aggregate in the data
> pipeline.
I'll switch to categorial histograms.
>
> >+ kind: uint
> >+ notification_emails:
> >+ - cbadea@mozilla.com
> >+ release_channel_collection: opt-out
> >+ record_in_processes:
> >+ - 'content'
>
> Until we drop the non-e10s config completely in 57, I think this has to be
> main and content processes.
>
> I'm going to mark data-review- on this because I'd like to see it again
> based on comments here.
>
> NON DATA REVIEW COMMENT:
>
> You are recording the number of errors here, but are you recording the total
> number of interceptions? As usage of workers changes over time, you'd
> presumably need to have both the total count and the error counts in order
> to monitor error rates.
You're right.
Flags: needinfo?(benjamin)
Comment 6•8 years ago
|
||
(In reply to Cătălin Badea (:catalinb) from comment #5)
> > This data does not yet meet the risk/reward bar for permanent data
> > collection. In order to promote this to permanent, you need:
> What are the risks here? This telemetry would give us a sense of whether
> our SW infrastructure is failing in the wild, of which we currently know
> almost *nothing* at the moment.
Every piece of data we collect represents some risk. Keyed histograms are naturally more risky because we can enforce fewer privacy properties about them or accidentally collect private data in the key. Switching to a categorical histogram makes that risk go away.
But overall until you have experience with the data, it's hard to know how much it's going to have long-term value. So the usual best practice is to start out by collecting for 6 months and then decide whether to make something permanent.
> The people responsible would be the team work on SW, we would use this data
> as an indicator that changes to our SW codebase (or unrelated changes in the
> tree)
> don't break service workers.
Yes but how exactly? By manually looking at telemetry.mozilla.org dashboards? Or do you have another less manual plan?
> > * automated tests that ensure this metric continue to work properly
> These are exceptional/unexpected error cases, I'm having a hard time
> figuring out
> how to test this.
> These codepaths can't easily be triggered through our test framework
Oh, I didn't realize that web content couldn't already trigger these conditions. Are you saying that these are conditions that shouldn't *ever* happen?
If these are really things that should never happen, then maybe an automated test isn't necessary. Although it still opens you up to risk of accidentally breaking the telemetry as code gets refactored.
Flags: needinfo?(benjamin)
Updated•7 years ago
|
Priority: -- → P2
Comment 7•7 years ago
|
||
We probably still want this, but the patch needs to be updated.
Assignee: catalin.badea392 → nobody
Flags: needinfo?(mdaly)
Comment 8•7 years ago
|
||
Andrew, can you update this patch?
Flags: needinfo?(mdaly) → needinfo?(bugmail)
Updated•6 years ago
|
Whiteboard: [tw-dom] btpp-fixlater → [tw-dom] btpp-fixlater DWS_NEXT
Updated•5 years ago
|
Type: defect → task
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•