Closed Bug 1359610 Opened 7 years ago Closed 5 years ago

Add telemetry for fetch interception duration

Categories

(Core :: DOM: Service Workers, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: catalinb, Assigned: catalinb)

References

(Blocks 1 open bug)

Details

(Whiteboard: DWS_NEXT)

Attachments

(2 files, 2 obsolete files)

Measure how long it takes to handle fetch events and report it through telemetry histograms.
Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Georg, could you please review this?
Attachment #8861634 - Flags: review?(gfritzsche)
Ben, could you check the timestamps are correct and whether is there something else I should measure on fetch interception?
Flags: needinfo?(bkelly)
Attachment #8861633 - Flags: review?(gfritzsche)
Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8861634 [details] [diff] [review]:
-----------------------------------------------------------------

Chris, can you take a look?
Attachment #8861634 - Flags: review?(gfritzsche) → review?(chutten)
Attachment #8861633 - Flags: review?(gfritzsche) → review?(chutten)
Comment on attachment 8861633 [details] [diff] [review]
Remove or update obsolote service worker telemetry probes.

Review of attachment 8861633 [details] [diff] [review]:
-----------------------------------------------------------------

Code is good, use of telemetry checks out. Requires data collection review.

::: toolkit/components/telemetry/Histograms.json
@@ +9608,5 @@
>      "n_buckets": 20,
>      "description": "Tracking how ServiceWorkerRegistrar loads data before the first content is shown. File bugs in Core::DOM in case of a Telemetry regression."
>    },
>    "SERVICE_WORKER_REQUEST_PASSTHROUGH": {
> +    "expires_in_version": "never",

You will require data-collection review for these changes (see https://wiki.mozilla.org/Firefox/Data_Collection ). Almost certainly you will be recommended to drop this from "never" to a full five-version cycle (so, "61").
Attachment #8861633 - Flags: review?(chutten) → review+
Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8861634 [details] [diff] [review]:
-----------------------------------------------------------------

Can you give me a sketch of why these three measures are important to record, and what actions you will take based on how they come in from real users and how they change over time?

Also: this patch will also require a data collection review.

::: dom/workers/ServiceWorkerEvents.cpp
@@ +113,5 @@
>    // TODO: When bug 1204254 is implemented, this time marker should be moved to
>    // the point where the body of the network request is complete.
>    mChannel->SetHandleFetchEventEnd(TimeStamp::Now());
> +
> +  mChannel->SetFinishResponseEnd(TimeStamp::Now());

TimeStamp::Now() can occasionally show up in profiles. If these call sites are on hot paths, and you aren't relying on there being different times reported to the two SetX() functions, maybe you should load Now() into a temporary so you aren't calling it any more often than before.

::: toolkit/components/telemetry/Histograms.json
@@ +9607,5 @@
>      "high": 5000,
>      "n_buckets": 20,
>      "description": "Tracking how ServiceWorkerRegistrar loads data before the first content is shown. File bugs in Core::DOM in case of a Telemetry regression."
>    },
> +  "SERVICE_WORKER_FETCH_INTERCEPTION_DURATION": {

If in milliseconds (which these are) it is common to suffix probes with "_MS"

@@ +9615,5 @@
> +    "high": 4000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Time delta between when a network request is intercepted and the service worker provides a response. This includes the dispatch time from MT to worker thread and back."

Specify the units.

@@ +9625,5 @@
> +    "high": 3000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the fetch event is dispatched by the Service Worker and before we execute the event listeners."

You should prepend "Time (in ms) " to this description.

@@ +9636,5 @@
> +    "high": 5000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the respondWith promise resolves and when we provide the response through the intercepted channel (FinishSynthesizedResponse/ResetInterception)."

You should prepend "Time (in ms) " to this description
Attachment #8861634 - Flags: review?(chutten) → review-
(In reply to Chris H-C :chutten from comment #7)
> Comment on attachment 8861634 [details] [diff] [review]
> Add telemetry for fetch interception.
> 
> Review of attachment 8861634 [details] [diff] [review]:
> -----------------------------------------------------------------
Thanks for the review! I actually discovered that I need more probes added.
Will flag you when I have the patch ready.
 
> Can you give me a sketch of why these three measures are important to
> record, and what actions you will take based on how they come in from real
> users and how they change over time?

First, this is helpful to get a sense of the overhead using service workers
adds, fetch interception in particular, on which we currently have no data.

Next, our SW infrastructure is due to having a considerable refactoring which
will most likely bring changes in performance, having telemetry will help us
measure the impact and work on improving SW perf.

Improving SW performance will be a continuous effort spanning more
than 5 releases and this is not in any way sensitive user data,
thus I think having these probes indefinitely is best.
Comment on attachment 8861634 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8861634 [details] [diff] [review]:
-----------------------------------------------------------------

Looks mostly good to me.  I think I'd like to have an explicit "this is how long it takes to spawn a service worker" probe.  Maybe that is in a different bug, though?

::: toolkit/components/telemetry/Histograms.json
@@ +9615,5 @@
> +    "high": 4000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Time delta between when a network request is intercepted and the service worker provides a response. This includes the dispatch time from MT to worker thread and back."

So this is includes the following?

1) Possible SW thread startup
2) Worker thread event dispatch
3) Async time until promise passed to respondWith() completes
4) Time to synthesize the response (including the body?)

And in the future this would include IPC round trips to/from parent and service worker process?

Is that right?

@@ +9625,5 @@
> +    "high": 3000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the fetch event is dispatched by the Service Worker and before we execute the event listeners."

Does this just measure time between starting to dispatch the event and the event handler starting to execute on the worker thread?  So purely a measure of the worker thread event queue backlog?

@@ +9636,5 @@
> +    "high": 5000,
> +    "n_buckets": 20,
> +    "releaseChannelCollection": "opt-out",
> +    "alert_emails": ["cbadea@mozilla.com"],
> +    "description": "Measured between when the respondWith promise resolves and when we provide the response through the intercepted channel (FinishSynthesizedResponse/ResetInterception)."

Why combine synthesizing and resetting in a single probe?  Those seem like pretty different cases to me.

Can we split these out to:

1) Time to synthesize Response (headers, but body maybe not available)
2) Time to complete the body of a synthesized Response
3) Time to reset the channel

Today we do (1) and (2) at the same time because we block synthesizing until the body is fully there.  We should fix that, though, so being able to distinguish would be useful.  See bug 1204254.

Also, in the future do you expect this to include IPC time to send data from service worker process to content process?
Attachment #8861634 - Flags: feedback+
(In reply to Cătălin Badea (:catalinb) from comment #8) 
> > Can you give me a sketch of why these three measures are important to
> > record, and what actions you will take based on how they come in from real
> > users and how they change over time?
> 
> First, this is helpful to get a sense of the overhead using service workers
> adds, fetch interception in particular, on which we currently have no data.
> 
> Next, our SW infrastructure is due to having a considerable refactoring which
> will most likely bring changes in performance, having telemetry will help us
> measure the impact and work on improving SW perf.
> 
> Improving SW performance will be a continuous effort spanning more
> than 5 releases and this is not in any way sensitive user data,
> thus I think having these probes indefinitely is best.

To add to this, we really need this data because right now we are flying blind.  We are being pinged by large sites that want to use service workers but are running into performance problems.  Right now we have zero knowledge about how our changes effect performance in the wild.  In addition, we have no way to detect regressions that might occur from unrelated changes in the tree.

Unfortunately our experience is that service workers can be subject to racy behavior, unusual network configuration, etc.  Real sites easily hit paths and timing conditions that we don't trigger regularly in our automation.  We are adding talos tests, but live data is going to be critical for improving and maintaining this feature over the long haul.

Chris, please let us know if you need additional information here and we'll put it together.  Thanks.
Flags: needinfo?(bkelly) → needinfo?(chutten)
Nope, your explanations are complete enough for me, thank you.
Flags: needinfo?(chutten)
Chris, could you please review this?

This addresses Ben's comments as well.
I wanted to differentiate between navigation and resource loads, so I converted
the histograms to keyed histograms. Let me know if there is a better approach.
Attachment #8865453 - Flags: review?(chutten)
Comment on attachment 8865453 [details] [diff] [review]
Add telemetry for fetch interception.

Review of attachment 8865453 [details] [diff] [review]:
-----------------------------------------------------------------

The Telemetry use is acceptable (see note about keyed histograms being somewhat harder to deal with, though nothing insurmountable).

The "high" values may be over-generous, though.

This change requires Data Collection Review: https://wiki.mozilla.org/Firefox/Data_Collection

::: netwerk/protocol/http/InterceptedChannel.cpp
@@ +180,5 @@
> +
> +    Telemetry::HistogramID id = (mSynthesizedOrReset == Synthesized) ?
> +      Telemetry::SERVICE_WORKER_FETCH_EVENT_FINISH_SYNTHESIZED_RESPONSE_MS :
> +      Telemetry::SERVICE_WORKER_FETCH_EVENT_CHANNEL_RESET_MS;
> +    Telemetry::Accumulate(id, navigationOrSubresource,

Using keyed histograms for just two keys seems a little bit like overkill. Also, keyed histograms are not supported by our sometimes-useful automated histogram change alerter (alerts.telemetry.mozilla.org). It is an acceptable choice, but it has that downside (and some small perf and related downsides)

Doubling the probes would be the alternative.

::: toolkit/components/telemetry/Histograms.json
@@ +9611,5 @@
> +  "SERVICE_WORKER_FETCH_INTERCEPTION_DURATION_MS": {
> +    "expires_in_version": "never",
> +    "kind": "exponential",
> +    "bug_numbers": [1359610],
> +    "high": 6000,

Are these typical values for interceptions you see in test environments? 6s sounds like an awfully-long time. Some of the others sound overly-generous as well.

A point to note is that there is an "and over" bucket: if something comes in over the "high" value we do still record it. We just don't know how _much_ over.
Attachment #8865453 - Flags: review?(chutten) → review+
(In reply to Chris H-C :chutten from comment #13)
> Comment on attachment 8865453 [details] [diff] [review]
> Add telemetry for fetch interception.
> 
> Review of attachment 8865453 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The Telemetry use is acceptable (see note about keyed histograms being
> somewhat harder to deal with, though nothing insurmountable).
> 
> The "high" values may be over-generous, though.
> 
> This change requires Data Collection Review:
> https://wiki.mozilla.org/Firefox/Data_Collection
> 
> ::: netwerk/protocol/http/InterceptedChannel.cpp
> @@ +180,5 @@
> > +
> > +    Telemetry::HistogramID id = (mSynthesizedOrReset == Synthesized) ?
> > +      Telemetry::SERVICE_WORKER_FETCH_EVENT_FINISH_SYNTHESIZED_RESPONSE_MS :
> > +      Telemetry::SERVICE_WORKER_FETCH_EVENT_CHANNEL_RESET_MS;
> > +    Telemetry::Accumulate(id, navigationOrSubresource,
> 
> Using keyed histograms for just two keys seems a little bit like overkill.
> Also, keyed histograms are not supported by our sometimes-useful automated
> histogram change alerter (alerts.telemetry.mozilla.org). It is an acceptable
> choice, but it has that downside (and some small perf and related downsides)
> 
> Doubling the probes would be the alternative.
> 
> ::: toolkit/components/telemetry/Histograms.json
> @@ +9611,5 @@
> > +  "SERVICE_WORKER_FETCH_INTERCEPTION_DURATION_MS": {
> > +    "expires_in_version": "never",
> > +    "kind": "exponential",
> > +    "bug_numbers": [1359610],
> > +    "high": 6000,
> 
> Are these typical values for interceptions you see in test environments? 6s
> sounds like an awfully-long time. Some of the others sound overly-generous
> as well.
The FETCH_INTERCEPTION actually depends on the size of the response body, since
we don't currently support streaming the body. Thus, we'll definitely see high values
on this one. Once we can stream responses, this probe should be constrained to a more
limited range as it will not include the time needed to download the body - I'll mention
this in the histogram description.

The other histograms, yes I was too generous on the high value.
Comment on attachment 8861633 [details] [diff] [review]
Remove or update obsolote service worker telemetry probes.

Could you please do the data collection review for these patches? Thanks!
Attachment #8861633 - Flags: feedback?(liuche)
Attachment #8861634 - Attachment is obsolete: true
Attachment #8865453 - Flags: feedback?(liuche)
Using smaller "high" values, dropped expiration to 61.
Attachment #8867673 - Flags: feedback?(benjamin)
Attachment #8865453 - Attachment is obsolete: true
Attachment #8865453 - Flags: feedback?(liuche)
Comment on attachment 8867673 [details] [diff] [review]
Add telemetry for fetch interception.

I ask everyone to have a real person in the alert_emails in addition to a list, so that I know who's responsible. Please add the appropriate person (probably yourself) to alert_emails.

data-r=me with that change (I did not review the code)
Attachment #8867673 - Flags: feedback?(benjamin) → feedback+
Pushed by catalin.badea392@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9c82613f6f92
Add telemetry for fetch interception. r=chutten data-r=bsmedberg
Keywords: leave-open
Attachment #8861633 - Flags: feedback?(liuche)
Priority: -- → P2
Catalin: there's a patch sitting on this but this bug is still open, give it a look?
Flags: needinfo?(catalin.badea392)
We still have those expired probes (referenced in the patch) and they should be removed. We could close this bug since the main issue was addressed and remove the expired probes in a different bug.
Assignee: catalin.badea392 → nobody
Flags: needinfo?(catalin.badea392)
Flags: needinfo?(bugmail)
removing need info and moving to DWS_NEXT
Flags: needinfo?(bugmail)
Whiteboard: DWS_NEXT

The leave-open keyword is there and there is no activity for 6 months.
:overholt, maybe it's time to close this bug?

Flags: needinfo?(overholt)

(In reply to Cătălin Badea (:catalinb) from comment #21)

We still have those expired probes (referenced in the patch) and they should
be removed. We could close this bug since the main issue was addressed and
remove the expired probes in a different bug.

I filed bug 1528302 to track removing them.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(overholt)
Keywords: leave-open
Resolution: --- → FIXED
Assignee: nobody → catalin.badea392
You need to log in before you can comment on or make changes to this bug.