1261839 - add telemetry for service worker failures

Reporter

Description

•

9 years ago

We have some basic telemetry at the moment, but we should add some more. Specifically, we should be tracking failures to intercept. We could be having a bunch of things failing in the wild now, but have no idea.

Andrew Overholt [:overholt]

Updated

•

9 years ago

Whiteboard: [tw-dom] → [tw-dom] btpp-fixlater

Ben Kelly [:bkelly, not reviewing]

Reporter

Updated

•

8 years ago

Blocks: ServiceWorkers-data
No longer blocks: ServiceWorkers-postv1

Cătălin Badea (:catalinb)

Comment 1

•

8 years ago

Attached patch Add telemetry tracking for fetch interception failures. — Details — Splinter Review

Attachment #8861924 - Flags: review?(gfritzsche)

Cătălin Badea (:catalinb)

Updated

•

8 years ago

Assignee: nobody → catalin.badea392

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Attachment #8861924 - Flags: review?(gfritzsche) → review?(chutten)

Chris H-C :chutten

Comment 2

•

8 years ago

Comment on attachment 8861924 [details] [diff] [review] Add telemetry tracking for fetch interception failures. Review of attachment 8861924 [details] [diff] [review]: ----------------------------------------------------------------- All good on the Telemetry side. Requires data review (see https://wiki.mozilla.org/Firefox/Data_Collection ) ::: dom/workers/ServiceWorkerPrivate.cpp @@ +1701,5 @@ > aDocumentId, aIsReload, newWorkerCreated); > rv = r->Init(); > if (NS_WARN_IF(NS_FAILED(rv))) { > + Telemetry::ScalarAdd(Telemetry::ScalarID::DOM_SERVICEWORKERS_INTERCEPTION_FAILURE, > + NS_LITERAL_STRING("Interception failed while initializing fetch event"), 1); Strings this long are going to be difficult to display (see other keyed scalars like https://mzl.la/2q5T6eI for how they're displayed). If you can be more concise, I invite you to do so. But if that display's not a big deal, then there's no problem with a longer string.

Attachment #8861924 - Flags: review?(chutten) → review+

Cătălin Badea (:catalinb)

Comment 3

•

8 years ago

Comment on attachment 8861924 [details] [diff] [review] Add telemetry tracking for fetch interception failures. Data collection review - bsmedberg, could you please review this? Thanks!

Attachment #8861924 - Flags: feedback?(benjamin)

Benjamin Smedberg

Comment 4

•

8 years ago

Comment on attachment 8861924 [details] [diff] [review] Add telemetry tracking for fetch interception failures. DATA REVIEW COMMENTS: >diff --git a/toolkit/components/telemetry/Scalars.yaml b/toolkit/components/telemetry/Scalars.yaml >+dom.serviceworkers: >+ interception_failure: >+ bug_numbers: >+ - 1261839 >+ description: > >+ Counts the different error situations encountered while intercepting fetch requests. >+ expires: never This data does not yet meet the risk/reward bar for permanent data collection. In order to promote this to permanent, you need: * a long-term plan articulated for who is responsible for monitoring this data and how they're going to do it * automated tests that ensure this metric continue to work properly Typically for a metric like this you should start out collecting the data for 6 months/5 releases, and during that time you can prove that the collection is correct and come up with a better long-term monitoring plan. Then you can either make it permanent if appropriate or renew the expiration. >+ keyed: true If you are going to use a keyed scalar, then the histogram description should state exactly how the key is constructed. However in this case I am convinced that a keyed scalar is not what you want. The best datatype for this would be a categorical histogram. It's self-documenting in that the enumeration is enforced in the code and show up here as well. And a keyed scalar is quite a bit less efficient to process and aggregate in the data pipeline. >+ kind: uint >+ notification_emails: >+ - cbadea@mozilla.com >+ release_channel_collection: opt-out >+ record_in_processes: >+ - 'content' Until we drop the non-e10s config completely in 57, I think this has to be main and content processes. I'm going to mark data-review- on this because I'd like to see it again based on comments here. NON DATA REVIEW COMMENT: You are recording the number of errors here, but are you recording the total number of interceptions? As usage of workers changes over time, you'd presumably need to have both the total count and the error counts in order to monitor error rates.

Attachment #8861924 - Flags: feedback?(benjamin) → feedback-

Cătălin Badea (:catalinb)

Comment 5

•

8 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #4) > Comment on attachment 8861924 [details] [diff] [review] > Add telemetry tracking for fetch interception failures. > > DATA REVIEW COMMENTS: > > >diff --git a/toolkit/components/telemetry/Scalars.yaml b/toolkit/components/telemetry/Scalars.yaml > > >+dom.serviceworkers: > >+ interception_failure: > >+ bug_numbers: > >+ - 1261839 > >+ description: > > >+ Counts the different error situations encountered while intercepting fetch requests. > >+ expires: never > > This data does not yet meet the risk/reward bar for permanent data > collection. In order to promote this to permanent, you need: What are the risks here? This telemetry would give us a sense of whether our SW infrastructure is failing in the wild, of which we currently know almost *nothing* at the moment. > * a long-term plan articulated for who is responsible for monitoring this > data and how they're going to do it The people responsible would be the team work on SW, we would use this data as an indicator that changes to our SW codebase (or unrelated changes in the tree) don't break service workers. > * automated tests that ensure this metric continue to work properly These are exceptional/unexpected error cases, I'm having a hard time figuring out how to test this. These codepaths can't easily be triggered through our test framework, and writing functional tests for this component so that I can trigger these probes in a test environment seems too much. What tests would you consider appropriate here? > Typically for a metric like this you should start out collecting the data > for 6 months/5 releases, and during that time you can prove that the > collection is correct and come up with a better long-term monitoring plan. > Then you can either make it permanent if appropriate or renew the expiration. > > >+ keyed: true > > If you are going to use a keyed scalar, then the histogram description > should state exactly how the key is constructed. However in this case I am > convinced that a keyed scalar is not what you want. The best datatype for > this would be a categorical histogram. It's self-documenting in that the > enumeration is enforced in the code and show up here as well. And a keyed > scalar is quite a bit less efficient to process and aggregate in the data > pipeline. I'll switch to categorial histograms. > > >+ kind: uint > >+ notification_emails: > >+ - cbadea@mozilla.com > >+ release_channel_collection: opt-out > >+ record_in_processes: > >+ - 'content' > > Until we drop the non-e10s config completely in 57, I think this has to be > main and content processes. > > I'm going to mark data-review- on this because I'd like to see it again > based on comments here. > > NON DATA REVIEW COMMENT: > > You are recording the number of errors here, but are you recording the total > number of interceptions? As usage of workers changes over time, you'd > presumably need to have both the total count and the error counts in order > to monitor error rates. You're right.

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 6

•

8 years ago

(In reply to Cătălin Badea (:catalinb) from comment #5) > > This data does not yet meet the risk/reward bar for permanent data > > collection. In order to promote this to permanent, you need: > What are the risks here? This telemetry would give us a sense of whether > our SW infrastructure is failing in the wild, of which we currently know > almost *nothing* at the moment. Every piece of data we collect represents some risk. Keyed histograms are naturally more risky because we can enforce fewer privacy properties about them or accidentally collect private data in the key. Switching to a categorical histogram makes that risk go away. But overall until you have experience with the data, it's hard to know how much it's going to have long-term value. So the usual best practice is to start out by collecting for 6 months and then decide whether to make something permanent. > The people responsible would be the team work on SW, we would use this data > as an indicator that changes to our SW codebase (or unrelated changes in the > tree) > don't break service workers. Yes but how exactly? By manually looking at telemetry.mozilla.org dashboards? Or do you have another less manual plan? > > * automated tests that ensure this metric continue to work properly > These are exceptional/unexpected error cases, I'm having a hard time > figuring out > how to test this. > These codepaths can't easily be triggered through our test framework Oh, I didn't realize that web content couldn't already trigger these conditions. Are you saying that these are conditions that shouldn't *ever* happen? If these are really things that should never happen, then maybe an automated test isn't necessary. Although it still opens you up to risk of accidentally breaking the telemetry as code gets refactored.

Flags: needinfo?(benjamin)

Marion Daly [:mdaly]

Updated

•

7 years ago

Priority: -- → P2

Cătălin Badea (:catalinb)

Comment 7

•

7 years ago

We probably still want this, but the patch needs to be updated.

Assignee: catalin.badea392 → nobody

Flags: needinfo?(mdaly)

Marion Daly [:mdaly]

Comment 8

•

7 years ago

Andrew, can you update this patch?

Flags: needinfo?(mdaly) → needinfo?(bugmail)

Marion Daly [:mdaly]

Comment 9

•

6 years ago

Removing need-info and adding to DWS_NEXT

Flags: needinfo?(bugmail)

Marion Daly [:mdaly]

Updated

•

6 years ago

Whiteboard: [tw-dom] btpp-fixlater → [tw-dom] btpp-fixlater DWS_NEXT

Jens Stutte [:jstutte]

Updated

•

5 years ago

Type: defect → task

BMO Automation

Updated

•

2 years ago

Severity: normal → S3