1359610 - Add telemetry for fetch interception duration

Assignee

Description

•

7 years ago

Measure how long it takes to handle fetch events and report it through telemetry histograms.

Cătălin Badea (:catalinb)

Assignee

Comment 1

•

7 years ago

Attached patch Remove or update obsolote service worker telemetry probes. — Details — Splinter Review

Cătălin Badea (:catalinb)

Assignee

Comment 2

•

7 years ago

Attached patch Add telemetry for fetch interception. (obsolete) — Details — Splinter Review

Cătălin Badea (:catalinb)

Assignee

Comment 3

•

7 years ago

Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Georg, could you please review this?

Attachment #8861634 - Flags: review?(gfritzsche)

Cătălin Badea (:catalinb)

Assignee

Comment 4

•

7 years ago

Ben, could you check the timestamps are correct and whether is there something else I should measure on fetch interception?

Flags: needinfo?(bkelly)

Cătălin Badea (:catalinb)

Assignee

Updated

•

7 years ago

Attachment #8861633 - Flags: review?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 5

•

7 years ago

Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8861634 [details] [diff] [review]:
-----------------------------------------------------------------

Chris, can you take a look?

Attachment #8861634 - Flags: review?(gfritzsche) → review?(chutten)

Georg Fritzsche [:gfritzsche]

Updated

•

7 years ago

Attachment #8861633 - Flags: review?(gfritzsche) → review?(chutten)

Chris H-C :chutten

Comment 6

•

7 years ago

Comment on attachment 8861633 [details] [diff] [review]
Remove or update obsolote service worker telemetry probes.

Review of attachment 8861633 [details] [diff] [review]:
-----------------------------------------------------------------

Code is good, use of telemetry checks out. Requires data collection review.

::: toolkit/components/telemetry/Histograms.json
@@ +9608,5 @@
>      "n_buckets": 20,
>      "description": "Tracking how ServiceWorkerRegistrar loads data before the first content is shown. File bugs in Core::DOM in case of a Telemetry regression."
>    },
>    "SERVICE_WORKER_REQUEST_PASSTHROUGH": {
> +    "expires_in_version": "never",

You will require data-collection review for these changes (see https://wiki.mozilla.org/Firefox/Data_Collection ). Almost certainly you will be recommended to drop this from "never" to a full five-version cycle (so, "61").

Attachment #8861633 - Flags: review?(chutten) → review+

Chris H-C :chutten

Comment 7

•

7 years ago

Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8861634 [details] [diff] [review]:
-----------------------------------------------------------------

Can you give me a sketch of why these three measures are important to record, and what actions you will take based on how they come in from real users and how they change over time?

Also: this patch will also require a data collection review.

::: dom/workers/ServiceWorkerEvents.cpp
@@ +113,5 @@
>    // TODO: When bug 1204254 is implemented, this time marker should be moved to
>    // the point where the body of the network request is complete.
>    mChannel->SetHandleFetchEventEnd(TimeStamp::Now());
> +
> +  mChannel->SetFinishResponseEnd(TimeStamp::Now());

TimeStamp::Now() can occasionally show up in profiles. If these call sites are on hot paths, and you aren't relying on there being different times reported to the two SetX() functions, maybe you should load Now() into a temporary so you aren't calling it any more often than before.

::: toolkit/components/telemetry/Histograms.json
@@ +9607,5 @@
>      "high": 5000,
>      "n_buckets": 20,
>      "description": "Tracking how ServiceWorkerRegistrar loads data before the first content is shown. File bugs in Core::DOM in case of a Telemetry regression."
>    },
> +  "SERVICE_WORKER_FETCH_INTERCEPTION_DURATION": {

If in milliseconds (which these are) it is common to suffix probes with "_MS"

@@ +9615,5 @@
> +    "high": 4000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Time delta between when a network request is intercepted and the service worker provides a response. This includes the dispatch time from MT to worker thread and back."

Specify the units.

@@ +9625,5 @@
> +    "high": 3000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the fetch event is dispatched by the Service Worker and before we execute the event listeners."

You should prepend "Time (in ms) " to this description.

@@ +9636,5 @@
> +    "high": 5000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the respondWith promise resolves and when we provide the response through the intercepted channel (FinishSynthesizedResponse/ResetInterception)."

You should prepend "Time (in ms) " to this description

Attachment #8861634 - Flags: review?(chutten) → review-

Cătălin Badea (:catalinb)

Assignee

Comment 8

•

7 years ago

(In reply to Chris H-C :chutten from comment #7)
> Comment on attachment 8861634 [details] [diff] [review]
> Add telemetry for fetch interception.
> 
> Review of attachment 8861634 [details] [diff] [review]:
> -----------------------------------------------------------------
Thanks for the review! I actually discovered that I need more probes added.
Will flag you when I have the patch ready.
 
> Can you give me a sketch of why these three measures are important to
> record, and what actions you will take based on how they come in from real
> users and how they change over time?

First, this is helpful to get a sense of the overhead using service workers
adds, fetch interception in particular, on which we currently have no data.

Next, our SW infrastructure is due to having a considerable refactoring which
will most likely bring changes in performance, having telemetry will help us
measure the impact and work on improving SW perf.

Improving SW performance will be a continuous effort spanning more
than 5 releases and this is not in any way sensitive user data,
thus I think having these probes indefinitely is best.

Ben Kelly [:bkelly, not reviewing]

Comment 9

•

7 years ago

Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8861634 [details] [diff] [review]:
-----------------------------------------------------------------

Looks mostly good to me.  I think I'd like to have an explicit "this is how long it takes to spawn a service worker" probe.  Maybe that is in a different bug, though?

::: toolkit/components/telemetry/Histograms.json
@@ +9615,5 @@
> +    "high": 4000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Time delta between when a network request is intercepted and the service worker provides a response. This includes the dispatch time from MT to worker thread and back."

So this is includes the following?

1) Possible SW thread startup
2) Worker thread event dispatch
3) Async time until promise passed to respondWith() completes
4) Time to synthesize the response (including the body?)

And in the future this would include IPC round trips to/from parent and service worker process?

Is that right?

@@ +9625,5 @@
> +    "high": 3000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the fetch event is dispatched by the Service Worker and before we execute the event listeners."

Does this just measure time between starting to dispatch the event and the event handler starting to execute on the worker thread?  So purely a measure of the worker thread event queue backlog?

@@ +9636,5 @@
> +    "high": 5000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the respondWith promise resolves and when we provide the response through the intercepted channel (FinishSynthesizedResponse/ResetInterception)."

Why combine synthesizing and resetting in a single probe?  Those seem like pretty different cases to me.

Can we split these out to:

1) Time to synthesize Response (headers, but body maybe not available)
2) Time to complete the body of a synthesized Response
3) Time to reset the channel

Today we do (1) and (2) at the same time because we block synthesizing until the body is fully there.  We should fix that, though, so being able to distinguish would be useful.  See bug 1204254.

Also, in the future do you expect this to include IPC time to send data from service worker process to content process?

Attachment #8861634 - Flags: feedback+

Ben Kelly [:bkelly, not reviewing]

Comment 10

•

7 years ago

(In reply to Cătălin Badea (:catalinb) from comment #8) 
> > Can you give me a sketch of why these three measures are important to
> > record, and what actions you will take based on how they come in from real
> > users and how they change over time?
> 
> First, this is helpful to get a sense of the overhead using service workers
> adds, fetch interception in particular, on which we currently have no data.
> 
> Next, our SW infrastructure is due to having a considerable refactoring which
> will most likely bring changes in performance, having telemetry will help us
> measure the impact and work on improving SW perf.
> 
> Improving SW performance will be a continuous effort spanning more
> than 5 releases and this is not in any way sensitive user data,
> thus I think having these probes indefinitely is best.

To add to this, we really need this data because right now we are flying blind.  We are being pinged by large sites that want to use service workers but are running into performance problems.  Right now we have zero knowledge about how our changes effect performance in the wild.  In addition, we have no way to detect regressions that might occur from unrelated changes in the tree.

Unfortunately our experience is that service workers can be subject to racy behavior, unusual network configuration, etc.  Real sites easily hit paths and timing conditions that we don't trigger regularly in our automation.  We are adding talos tests, but live data is going to be critical for improving and maintaining this feature over the long haul.

Chris, please let us know if you need additional information here and we'll put it together.  Thanks.

Flags: needinfo?(bkelly) → needinfo?(chutten)

Chris H-C :chutten

Comment 11

•

7 years ago

Nope, your explanations are complete enough for me, thank you.

Flags: needinfo?(chutten)

Cătălin Badea (:catalinb)

Assignee

Comment 12

•

7 years ago

Attached patch Add telemetry for fetch interception. (obsolete) — Details — Splinter Review

Chris, could you please review this?

This addresses Ben's comments as well.
I wanted to differentiate between navigation and resource loads, so I converted
the histograms to keyed histograms. Let me know if there is a better approach.

Attachment #8865453 - Flags: review?(chutten)

Chris H-C :chutten

Comment 13

•

7 years ago

Comment on attachment 8865453 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8865453 [details] [diff] [review]:
-----------------------------------------------------------------

The Telemetry use is acceptable (see note about keyed histograms being somewhat harder to deal with, though nothing insurmountable).

The "high" values may be over-generous, though.

This change requires Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

::: netwerk/protocol/http/InterceptedChannel.cpp
@@ +180,5 @@
> +
> +    Telemetry::HistogramID id = (mSynthesizedOrReset == Synthesized) ?
> +      Telemetry::SERVICE_WORKER_FETCH_EVENT_FINISH_SYNTHESIZED_RESPONSE_MS :
> +      Telemetry::SERVICE_WORKER_FETCH_EVENT_CHANNEL_RESET_MS;
> +    Telemetry::Accumulate(id, navigationOrSubresource,

Using keyed histograms for just two keys seems a little bit like overkill. Also, keyed histograms are not supported by our sometimes-useful automated histogram change alerter (alerts.telemetry.mozilla.org). It is an acceptable choice, but it has that downside (and some small perf and related downsides)

Doubling the probes would be the alternative.

::: toolkit/components/telemetry/Histograms.json
@@ +9611,5 @@
> +  "SERVICE_WORKER_FETCH_INTERCEPTION_DURATION_MS": {
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "bug_numbers": [1359610],
> +    "high": 6000,

Are these typical values for interceptions you see in test environments? 6s sounds like an awfully-long time. Some of the others sound overly-generous as well.

A point to note is that there is an "and over" bucket: if something comes in over the "high" value we do still record it. We just don't know how _much_ over.

Attachment #8865453 - Flags: review?(chutten) → review+

Cătălin Badea (:catalinb)

Assignee

Comment 14

•

7 years ago

(In reply to Chris H-C :chutten from comment #13)
> Comment on attachment 8865453 [details] [diff] [review]
> Add telemetry for fetch interception.
> 
> Review of attachment 8865453 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The Telemetry use is acceptable (see note about keyed histograms being
> somewhat harder to deal with, though nothing insurmountable).
> 
> The "high" values may be over-generous, though.
> 
> This change requires Data Collection Review:
> https://wiki.mozilla.org/Firefox/Data_Collection
> 
> ::: netwerk/protocol/http/InterceptedChannel.cpp
> @@ +180,5 @@
> > +
> > +    Telemetry::HistogramID id = (mSynthesizedOrReset == Synthesized) ?
> > +      Telemetry::SERVICE_WORKER_FETCH_EVENT_FINISH_SYNTHESIZED_RESPONSE_MS :
> > +      Telemetry::SERVICE_WORKER_FETCH_EVENT_CHANNEL_RESET_MS;
> > +    Telemetry::Accumulate(id, navigationOrSubresource,
> 
> Using keyed histograms for just two keys seems a little bit like overkill.
> Also, keyed histograms are not supported by our sometimes-useful automated
> histogram change alerter (alerts.telemetry.mozilla.org). It is an acceptable
> choice, but it has that downside (and some small perf and related downsides)
> 
> Doubling the probes would be the alternative.
> 
> ::: toolkit/components/telemetry/Histograms.json
> @@ +9611,5 @@
> > +  "SERVICE_WORKER_FETCH_INTERCEPTION_DURATION_MS": {
> > +    "expires_in_version": "never",
> > +    "kind": "exponential",
> > +    "bug_numbers": [1359610],
> > +    "high": 6000,
> 
> Are these typical values for interceptions you see in test environments? 6s
> sounds like an awfully-long time. Some of the others sound overly-generous
> as well.
The FETCH_INTERCEPTION actually depends on the size of the response body, since
we don't currently support streaming the body. Thus, we'll definitely see high values
on this one. Once we can stream responses, this probe should be constrained to a more
limited range as it will not include the time needed to download the body - I'll mention
this in the histogram description.

The other histograms, yes I was too generous on the high value.

Cătălin Badea (:catalinb)

Assignee

Comment 15

•

7 years ago

Comment on attachment 8861633 [details] [diff] [review]
Remove or update obsolote service worker telemetry probes.

Could you please do the data collection review for these patches? Thanks!

Attachment #8861633 - Flags: feedback?(liuche)

Cătălin Badea (:catalinb)

Assignee

Updated

•

7 years ago

Attachment #8861634 - Attachment is obsolete: true

Cătălin Badea (:catalinb)

Assignee

Updated

•

7 years ago

Attachment #8865453 - Flags: feedback?(liuche)

Cătălin Badea (:catalinb)

Assignee

Comment 16

•

7 years ago

Attached patch Add telemetry for fetch interception. — Details — Splinter Review

Using smaller "high" values, dropped expiration to 61.

Attachment #8867673 - Flags: feedback?(benjamin)

Cătălin Badea (:catalinb)

Assignee

Updated

•

7 years ago

Attachment #8865453 - Attachment is obsolete: true

Attachment #8865453 - Flags: feedback?(liuche)

Benjamin Smedberg

Comment 17

•

7 years ago

Comment on attachment 8867673 [details] [diff] [review]
Add telemetry for fetch interception.

I ask everyone to have a real person in the alert_emails in addition to a list, so that I know who's responsible. Please add the appropriate person (probably yourself) to alert_emails.

data-r=me with that change (I did not review the code)

Attachment #8867673 - Flags: feedback?(benjamin) → feedback+

Pulsebot

Comment 18

•

7 years ago

Pushed by catalin.badea392@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c82613f6f92
Add telemetry for fetch interception. r=chutten data-r=bsmedberg

Cătălin Badea (:catalinb)

Assignee

Updated

•

7 years ago

Keywords: leave-open

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 19

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9c82613f6f92

Chenxia Liu [:liuche]

Updated

•

7 years ago

Attachment #8861633 - Flags: feedback?(liuche)

Hsin-Yi Tsai (she/her) [:hsinyi]

Updated

•

7 years ago

Priority: -- → P2

Marion Daly [:mdaly]

Comment 20

•

6 years ago

Catalin: there's a patch sitting on this but this bug is still open, give it a look?

Flags: needinfo?(catalin.badea392)

Cătălin Badea (:catalinb)

Assignee

Comment 21

•

6 years ago

We still have those expired probes (referenced in the patch) and they should be removed. We could close this bug since the main issue was addressed and remove the expired probes in a different bug.

Assignee: catalin.badea392 → nobody

Flags: needinfo?(catalin.badea392)

Marion Daly [:mdaly]

Updated

•

6 years ago

Flags: needinfo?(bugmail)

Marion Daly [:mdaly]

Comment 22

•

6 years ago

removing need info and moving to DWS_NEXT

Flags: needinfo?(bugmail)

Marion Daly [:mdaly]

Updated

•

6 years ago

Whiteboard: DWS_NEXT

BugBot [:suhaib / :marco/ :calixte]

Comment 23

•

5 years ago

The leave-open keyword is there and there is no activity for 6 months.
:overholt, maybe it's time to close this bug?

Flags: needinfo?(overholt)

Andrew Overholt [:overholt]

Comment 24

•

5 years ago

(In reply to Cătălin Badea (:catalinb) from comment #21)

We still have those expired probes (referenced in the patch) and they should
be removed. We could close this bug since the main issue was addressed and
remove the expired probes in a different bug.

I filed bug 1528302 to track removing them.

Status: NEW → RESOLVED

Closed: 5 years ago

Flags: needinfo?(overholt)

Keywords: leave-open

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Assignee: nobody → catalin.badea392

Remove or update obsolote service worker telemetry probes. 7 years ago Cătălin Badea (:catalinb) 10.96 KB, patch	chutten : review+	Details \| Diff \| Splinter Review
Add telemetry for fetch interception. 7 years ago Cătălin Badea (:catalinb) 10.37 KB, patch	chutten : review- bkelly : feedback+	Details \| Diff \| Splinter Review
Add telemetry for fetch interception. 7 years ago Cătălin Badea (:catalinb) 13.87 KB, patch	chutten : review+	Details \| Diff \| Splinter Review
Add telemetry for fetch interception. 7 years ago Cătălin Badea (:catalinb) 13.94 KB, patch	benjamin : feedback+	Details \| Diff \| Splinter Review