Get telemetry for off-main-thread delivery success/fail

RESOLVED FIXED in Firefox 56

Status

()

enhancement
RESOLVED FIXED
2 years ago
10 months ago

People

(Reporter: jduell.mcbugs, Assigned: schien)

Tracking

unspecified
mozilla56
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox56 fixed)

Details

(Whiteboard: [necko-active][PBg-HTTP-M3])

Attachments

(2 attachments, 1 obsolete attachment)

We've had a bunch of cases where we've broken off-main thread delivery without noticing (bug 1157561, possibly bug 1357678).

We opened bug 1326299 to try to detect this in our test harness. 

Another option is to add some telemetry that reports what % of requests to do off-main delivery wind up succeeding/failing--that way telemetry alerts can notify us if this regresses.  This may be easier to implement than bug 1326299, and it's worth doing anyway, since we'd actually like to know what's going on in our user base here.

SC tells me that we've landed off-main delivery in the child. It doesn't improve performance yet--that will take PBackgroundHttp--but it does mean that we can still get meaningful telemetry here.
We need to sort this out before we land PBackgroundHttp
Blocks: PBg-HTTP
I'm going to give this one to SC
Flags: needinfo?(schien)
Whiteboard: [necko-next]
So, we can simply add an individual probe to each OnDataAvail impl that expects to be redirected off the main thread, right?
I think it's much simpler to just record success/fail in channel.retargetTo.  No need to hook lots of OnDataAvailable impls.
Comment on attachment 8860239 [details]
Bug 1357682 - telemetry for monitoring OMT success rate in HttpChannelChild.

https://reviewboard.mozilla.org/r/132242/#review135104

Note to Bsmedberg: the reason we want this probe to be permanent is because it's currently the only warning we're going to have if we break off-main thread delivery for the HTML/imglib parsers, etc.  We've managed to silently break it for long stretches several times now (anything that attaches a non-threadsafe listener to the channel's listener chain will break it, so it's not simple a matter of necko folks being more careful--it's hard to audit).  So we're hoping we can keep this around as our canary.  Addons may also start breaking off-main delivery (by design: if we allow them to modify reply bytes we've got to hit the main thread-see bug 1255894, and we'd like to keep an eye on that indefinitely)
Attachment #8860239 - Flags: review?(jduell.mcbugs) → review+
Comment on attachment 8860239 [details]
Bug 1357682 - telemetry for monitoring OMT success rate in HttpChannelChild.

https://reviewboard.mozilla.org/r/132242/#review135116

::: toolkit/components/telemetry/Histograms.json:2171
(Diff revision 1)
> +    "alert_emails": ["necko@mozilla.com"],
> +    "bug_numbers": [1357682],
> +    "expires_in_version": "never",
> +    "kind": "enumerated",
> +    "n_values": 4,
> +    "description": "Stats about sussess rate of HTTP OMT request. (0=success, 1=success but request OMT to main thread, 2=failed since listener doesn't support, 3=failed since listener chain doesn't support)"

s/sussess/success/
Discussed with @jduell, we'd like to record this telemetry as well in case any popular addon breaking the OMT mechanism. 

@bsmedberg, is adding |releaseChannelCollection: "opt-in"| necessary for achieving that? I've checked https://wiki.mozilla.org/Firefox/Data_Collection but didn't find document for that.
Flags: needinfo?(schien) → needinfo?(benjamin)
I'm not concerned about collecting this data. I am concerned that I don't see how this data is going to map to actions.

Some questions:
* who is going to monitor this data long-term?
* Is there also automation testing for this? If image requests in the absence of addons should always use the OMT path, can we assert that using an automation framework?
* Is there anything other than an addon which would make it ok for these to be main thread? If so can we focus this on the addon case, and record the actual addon ID that is causing this?

The opt-in/opt-out question is mainly about whether you need to collect this data from release users. opt-in is the default and that collects from prerelease only (nightly-beta). opt-out gives you a lot more coverage for release users, which may be necessary for addon correlations but requires more careful design of how you aggregate/monitor the data.
Flags: needinfo?(schien)
Flags: needinfo?(jduell.mcbugs)
Flags: needinfo?(benjamin)
Comment on attachment 8860239 [details]
Bug 1357682 - telemetry for monitoring OMT success rate in HttpChannelChild.

https://reviewboard.mozilla.org/r/132242/#review136290

::: toolkit/components/telemetry/Histograms.json:2169
(Diff revision 1)
>    },
> +  "HTTP_CHILD_OMT_STATS": {
> +    "alert_emails": ["necko@mozilla.com"],
> +    "bug_numbers": [1357682],
> +    "expires_in_version": "never",
> +    "kind": "enumerated",

This should use a categorical histogram rather than enumerated: that would also avoid having magic numbers above.
Trying to answer these questions by myself first, @mayhemer you may correct me if anything wrong.

(In reply to Benjamin Smedberg [:bsmedberg] from comment #9)
> I'm not concerned about collecting this data. I am concerned that I don't
> see how this data is going to map to actions.
> 
> Some questions:
> * who is going to monitor this data long-term?
Necko team will monitor this datax

> * Is there also automation testing for this? If image requests in the
> absence of addons should always use the OMT path, can we assert that using
> an automation framework?
By the natural of nsITraceableChannel interface that HTTPChannel provides it'll be hard to make a strong assertion in automation framework. It really depends on how our code base use it, right now devtools have two usecases (network monitor / throttling) that will disable the OMT.
In addition bug 1357678 indicates our HTTP implementation might disabled OMT under certain scenario as well, so we cannot turn on such kind of assertion before these issues are fixed.
Bug 1326299 is filed for investigating how we can 

> * Is there anything other than an addon which would make it ok for these to
> be main thread? If so can we focus this on the addon case, and record the
> actual addon ID that is causing this?
Agreed that monitoring addon is something worth to do, however this telemetry is more about monitoring if our own code breaks OMT.

As for monitoring addon that breaks OMT, I guess it can be done in WebRequest web extension API after we deprecate other types of addon. 
> 
> The opt-in/opt-out question is mainly about whether you need to collect this
> data from release users. opt-in is the default and that collects from
> prerelease only (nightly-beta). opt-out gives you a lot more coverage for
> release users, which may be necessary for addon correlations but requires
> more careful design of how you aggregate/monitor the data.
For the purpose of monitoring if our new code breaks OMT, I think opt-in should be enough.
Flags: needinfo?(schien) → needinfo?(honzab.moz)
Whiteboard: [necko-next] → [necko-next][PBg-HTTP-M2]
Assignee: nobody → schien
Whiteboard: [necko-next][PBg-HTTP-M2] → [necko-active][PBg-HTTP-M2]
Is there anything else that could break OMT after we reach the 'LABELS_HTTP_CHILD_OMT_STATS::success' point?

If you want to check our code then probably an assertion would be more suitable.  But I can hardly thing of a reliable condition when the decoders or parsers should run off the main thread.  It's all so generic that finding one might be impossible.

OTOH, have you tried?  We depend on few bugs to fix first (converters for instance), but having a hard MOZ_ASSET(!MainThread) would be good to try anyway.
Flags: needinfo?(honzab.moz)
> > Some questions:
> > * who is going to monitor this data long-term?
> Necko team will monitor this datax

I'm sorry, I need more specifics.
* Is somebody going to build an automated alerting system that sends email to the necko team?
* Or is somebody in charge of looking at a dashboard periodically? If so, which person exactly is taking on that responsibility?
* Do you have specific thresholds or techniques you're going to use to determine whether there was a regression in Firefox versus say, a popular addon changing their behavior or more people using devtools?

As before, I don't have a problem with you collecting this data, but I need the plan for how you're going to use the data to be more concrete and less fuzzy.

In general, for expires: never measurements we ask that there be an automated test to make sure that the probes don't break by accident. Is that possible for this measurement?

If you'd like to start out experimenting with this and avoid writing up an analysis/monitoring plan, you can make it expire in 61 now and come back with a more definite plan later.
Flags: needinfo?(schien)
Attachment #8860239 - Flags: review?(benjamin)
Whiteboard: [necko-active][PBg-HTTP-M2] → [necko-active][PBg-HTTP-M3]
Sorry for the late response.

After a long thinking process, I would like to take the first version as an experiment to understand how HTTP OMT is being used in user environment.

I will record the success and fail reason along with relevant channel attributes, e.g. mime type, so that we can have further information on what source of HTTP request is not utilizing the OMT mechanism. We can then remove more road blockers before we are able to add strong assertion in automated tests.
Flags: needinfo?(schien)
(In reply to Honza Bambas (:mayhemer) from comment #13)
> 
> OTOH, have you tried?  We depend on few bugs to fix first (converters for
> instance), but having a hard MOZ_ASSET(!MainThread) would be good to try
> anyway.

Try MOZ_ASSERT for the retargetability of html5 parser and image decoder.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a54e5284e1c8132b0b1cfef70683eb703abfb993
From the try server result, there are just too many combination that can break the HTTP OMT.
I have to say that we are still too far away from using assertion to check the HTTP OMT availability.
Attachment #8860239 - Attachment is obsolete: true
I'd like to use this telemetry as an experiment to figure out the OMT success rate in user environment.
I use content policy as the key to separate the success rate between different use cases. Based on my local experiment, this telemetry can easily tells that our JS script loader, CSS loader, and XHR are not leverage HTTP OMT.
Comment on attachment 8885220 [details]
Bug 1357682 - Part 2, add telemetry for HTTPChannelChild OMT success/fail rate and reason.  data-r=bsmedberg

https://reviewboard.mozilla.org/r/156102/#review161208
Attachment #8885220 - Flags: review?(mcmanus) → review+
Comment on attachment 8885220 [details]
Bug 1357682 - Part 2, add telemetry for HTTPChannelChild OMT success/fail rate and reason.  data-r=bsmedberg

https://reviewboard.mozilla.org/r/156102/#review161394

data-r=me
Attachment #8885220 - Flags: review?(benjamin) → review+
Comment on attachment 8885219 [details]
Bug 1357682 - Part 1, helper function for keyed categorical histogram.

https://reviewboard.mozilla.org/r/156100/#review161600

::: toolkit/components/telemetry/Telemetry.h:116
(Diff revision 1)
>    Accumulate(static_cast<HistogramID>(CategoricalLabelId<E>::value),
>               static_cast<uint32_t>(enumValue));
>  };
>  
>  /**
>   * Adds sample to a categorical histogram defined in TelemetryHistogramEnums.h

Nit: "... to a keyed categorical histogram"
Attachment #8885219 - Flags: review?(gfritzsche) → review+
Alessio, are we able to test `AccumulateCategorical()` from our C++ tests?
Should we file a mentored bug on this?
Flags: needinfo?(alessio.placitelli)
(In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> Alessio, are we able to test `AccumulateCategorical()` from our C++ tests?
> Should we file a mentored bug on this?

We should be able to do that, there's already a mentored bug filed for this: bug 1351274.
Flags: needinfo?(alessio.placitelli)
Pushed by schien@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/44e0ece6ec05
Part 1, helper function for keyed categorical histogram. r=gfritzsche
https://hg.mozilla.org/integration/autoland/rev/e60029998612
Part 2, add telemetry for HTTPChannelChild OMT success/fail rate and reason. r=bsmedberg,mcmanus data-r=bsmedberg
https://hg.mozilla.org/mozilla-central/rev/44e0ece6ec05
https://hg.mozilla.org/mozilla-central/rev/e60029998612
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Flags: needinfo?(jduell.mcbugs)
You need to log in before you can comment on or make changes to this bug.