1357682 - Get telemetry for off-main-thread delivery success/fail

Reporter

Description

•

8 years ago

We've had a bunch of cases where we've broken off-main thread delivery without noticing (bug 1157561, possibly bug 1357678). We opened bug 1326299 to try to detect this in our test harness. Another option is to add some telemetry that reports what % of requests to do off-main delivery wind up succeeding/failing--that way telemetry alerts can notify us if this regresses. This may be easier to implement than bug 1326299, and it's worth doing anyway, since we'd actually like to know what's going on in our user base here. SC tells me that we've landed off-main delivery in the child. It doesn't improve performance yet--that will take PBackgroundHttp--but it does mean that we can still get meaningful telemetry here.

Jason Duell

Reporter

Comment 1

•

8 years ago

We need to sort this out before we land PBackgroundHttp

Blocks: PBg-HTTP

Jason Duell

Reporter

Comment 2

•

8 years ago

I'm going to give this one to SC

Flags: needinfo?(schien)

Dragana Damjanovic [:dragana]

Updated

•

8 years ago

Whiteboard: [necko-next]

Honza Bambas (:mayhemer)

Comment 3

•

8 years ago

So, we can simply add an individual probe to each OnDataAvail impl that expects to be redirected off the main thread, right?

Jason Duell

Reporter

Comment 4

•

8 years ago

I think it's much simpler to just record success/fail in channel.retargetTo. No need to hook lots of OnDataAvailable impls.

Comment hidden (mozreview-request)

Jason Duell

Reporter

Comment 6

•

8 years ago

mozreview-review

Comment on attachment 8860239 [details] Bug 1357682 - telemetry for monitoring OMT success rate in HttpChannelChild. https://reviewboard.mozilla.org/r/132242/#review135104 Note to Bsmedberg: the reason we want this probe to be permanent is because it's currently the only warning we're going to have if we break off-main thread delivery for the HTML/imglib parsers, etc. We've managed to silently break it for long stretches several times now (anything that attaches a non-threadsafe listener to the channel's listener chain will break it, so it's not simple a matter of necko folks being more careful--it's hard to audit). So we're hoping we can keep this around as our canary. Addons may also start breaking off-main delivery (by design: if we allow them to modify reply bytes we've got to hit the main thread-see bug 1255894, and we'd like to keep an eye on that indefinitely)

Attachment #8860239 - Flags: review?(jduell.mcbugs) → review+

Jason Duell

Reporter

Comment 7

•

8 years ago

mozreview-review

Comment on attachment 8860239 [details] Bug 1357682 - telemetry for monitoring OMT success rate in HttpChannelChild. https://reviewboard.mozilla.org/r/132242/#review135116 ::: toolkit/components/telemetry/Histograms.json:2171 (Diff revision 1) > + "alert_emails": ["necko@mozilla.com"], > + "bug_numbers": [1357682], > + "expires_in_version": "never", > + "kind": "enumerated", > + "n_values": 4, > + "description": "Stats about sussess rate of HTTP OMT request. (0=success, 1=success but request OMT to main thread, 2=failed since listener doesn't support, 3=failed since listener chain doesn't support)" s/sussess/success/

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 8

•

8 years ago

Discussed with @jduell, we'd like to record this telemetry as well in case any popular addon breaking the OMT mechanism. @bsmedberg, is adding |releaseChannelCollection: "opt-in"| necessary for achieving that? I've checked https://wiki.mozilla.org/Firefox/Data_Collection but didn't find document for that.

Flags: needinfo?(schien) → needinfo?(benjamin)

Benjamin Smedberg

Comment 9

•

8 years ago

I'm not concerned about collecting this data. I am concerned that I don't see how this data is going to map to actions. Some questions: * who is going to monitor this data long-term? * Is there also automation testing for this? If image requests in the absence of addons should always use the OMT path, can we assert that using an automation framework? * Is there anything other than an addon which would make it ok for these to be main thread? If so can we focus this on the addon case, and record the actual addon ID that is causing this? The opt-in/opt-out question is mainly about whether you need to collect this data from release users. opt-in is the default and that collects from prerelease only (nightly-beta). opt-out gives you a lot more coverage for release users, which may be necessary for addon correlations but requires more careful design of how you aggregate/monitor the data.

Flags: needinfo?(schien)

Flags: needinfo?(jduell.mcbugs)

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 10

•

8 years ago

mozreview-review

Comment on attachment 8860239 [details] Bug 1357682 - telemetry for monitoring OMT success rate in HttpChannelChild. https://reviewboard.mozilla.org/r/132242/#review136290 ::: toolkit/components/telemetry/Histograms.json:2169 (Diff revision 1) > }, > + "HTTP_CHILD_OMT_STATS": { > + "alert_emails": ["necko@mozilla.com"], > + "bug_numbers": [1357682], > + "expires_in_version": "never", > + "kind": "enumerated", This should use a categorical histogram rather than enumerated: that would also avoid having magic numbers above.

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 11

•

8 years ago

Trying to answer these questions by myself first, @mayhemer you may correct me if anything wrong. (In reply to Benjamin Smedberg [:bsmedberg] from comment #9) > I'm not concerned about collecting this data. I am concerned that I don't > see how this data is going to map to actions. > > Some questions: > * who is going to monitor this data long-term? Necko team will monitor this datax > * Is there also automation testing for this? If image requests in the > absence of addons should always use the OMT path, can we assert that using > an automation framework? By the natural of nsITraceableChannel interface that HTTPChannel provides it'll be hard to make a strong assertion in automation framework. It really depends on how our code base use it, right now devtools have two usecases (network monitor / throttling) that will disable the OMT. In addition bug 1357678 indicates our HTTP implementation might disabled OMT under certain scenario as well, so we cannot turn on such kind of assertion before these issues are fixed. Bug 1326299 is filed for investigating how we can > * Is there anything other than an addon which would make it ok for these to > be main thread? If so can we focus this on the addon case, and record the > actual addon ID that is causing this? Agreed that monitoring addon is something worth to do, however this telemetry is more about monitoring if our own code breaks OMT. As for monitoring addon that breaks OMT, I guess it can be done in WebRequest web extension API after we deprecate other types of addon. > > The opt-in/opt-out question is mainly about whether you need to collect this > data from release users. opt-in is the default and that collects from > prerelease only (nightly-beta). opt-out gives you a lot more coverage for > release users, which may be necessary for addon correlations but requires > more careful design of how you aggregate/monitor the data. For the purpose of monitoring if our new code breaks OMT, I think opt-in should be enough.

Flags: needinfo?(schien) → needinfo?(honzab.moz)

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Updated

•

8 years ago

Whiteboard: [necko-next] → [necko-next][PBg-HTTP-M2]

Comment hidden (mozreview-request)

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Updated

•

8 years ago

Assignee: nobody → schien

Shian-Yow Wu [:swu]

Updated

•

8 years ago

Whiteboard: [necko-next][PBg-HTTP-M2] → [necko-active][PBg-HTTP-M2]

Honza Bambas (:mayhemer)

Comment 13

•

8 years ago

Is there anything else that could break OMT after we reach the 'LABELS_HTTP_CHILD_OMT_STATS::success' point? If you want to check our code then probably an assertion would be more suitable. But I can hardly thing of a reliable condition when the decoders or parsers should run off the main thread. It's all so generic that finding one might be impossible. OTOH, have you tried? We depend on few bugs to fix first (converters for instance), but having a hard MOZ_ASSET(!MainThread) would be good to try anyway.

Flags: needinfo?(honzab.moz)

Benjamin Smedberg

Comment 14

•

8 years ago

> > Some questions: > > * who is going to monitor this data long-term? > Necko team will monitor this datax I'm sorry, I need more specifics. * Is somebody going to build an automated alerting system that sends email to the necko team? * Or is somebody in charge of looking at a dashboard periodically? If so, which person exactly is taking on that responsibility? * Do you have specific thresholds or techniques you're going to use to determine whether there was a regression in Firefox versus say, a popular addon changing their behavior or more people using devtools? As before, I don't have a problem with you collecting this data, but I need the plan for how you're going to use the data to be more concrete and less fuzzy. In general, for expires: never measurements we ask that there be an automated test to make sure that the probes don't break by accident. Is that possible for this measurement? If you'd like to start out experimenting with this and avoid writing up an analysis/monitoring plan, you can make it expire in 61 now and come back with a more definite plan later.

Benjamin Smedberg

Updated

•

8 years ago

Flags: needinfo?(schien)

Benjamin Smedberg

Updated

•

8 years ago

Attachment #8860239 - Flags: review?(benjamin)

Francis Lee [:frlee]

Updated

•

8 years ago

Whiteboard: [necko-active][PBg-HTTP-M2] → [necko-active][PBg-HTTP-M3]

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 15

•

7 years ago

Sorry for the late response. After a long thinking process, I would like to take the first version as an experiment to understand how HTTP OMT is being used in user environment. I will record the success and fail reason along with relevant channel attributes, e.g. mime type, so that we can have further information on what source of HTTP request is not utilizing the OMT mechanism. We can then remove more road blockers before we are able to add strong assertion in automated tests.

Flags: needinfo?(schien)

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 16

•

7 years ago

(In reply to Honza Bambas (:mayhemer) from comment #13) > > OTOH, have you tried? We depend on few bugs to fix first (converters for > instance), but having a hard MOZ_ASSET(!MainThread) would be good to try > anyway. Try MOZ_ASSERT for the retargetability of html5 parser and image decoder. https://treeherder.mozilla.org/#/jobs?repo=try&revision=a54e5284e1c8132b0b1cfef70683eb703abfb993 From the try server result, there are just too many combination that can break the HTTP OMT. I have to say that we are still too far away from using assertion to check the HTTP OMT availability.

Comment hidden (mozreview-request)

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Updated

•

7 years ago

Attachment #8860239 - Attachment is obsolete: true

Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz)

Assignee

Comment 19

•

7 years ago

I'd like to use this telemetry as an experiment to figure out the OMT success rate in user environment. I use content policy as the key to separate the success rate between different use cases. Based on my local experiment, this telemetry can easily tells that our JS script loader, CSS loader, and XHR are not leverage HTTP OMT.

Patrick McManus [:mcmanus]

Comment 20

•

7 years ago

mozreview-review

Comment on attachment 8885220 [details] Bug 1357682 - Part 2, add telemetry for HTTPChannelChild OMT success/fail rate and reason. data-r=bsmedberg https://reviewboard.mozilla.org/r/156102/#review161208

Attachment #8885220 - Flags: review?(mcmanus) → review+

Benjamin Smedberg

Comment 21

•

7 years ago

mozreview-review

Comment on attachment 8885220 [details] Bug 1357682 - Part 2, add telemetry for HTTPChannelChild OMT success/fail rate and reason. data-r=bsmedberg https://reviewboard.mozilla.org/r/156102/#review161394 data-r=me

Attachment #8885220 - Flags: review?(benjamin) → review+

Georg Fritzsche [:gfritzsche]

Comment 22

•

7 years ago

mozreview-review

Comment on attachment 8885219 [details] Bug 1357682 - Part 1, helper function for keyed categorical histogram. https://reviewboard.mozilla.org/r/156100/#review161600 ::: toolkit/components/telemetry/Telemetry.h:116 (Diff revision 1) > Accumulate(static_cast<HistogramID>(CategoricalLabelId<E>::value), > static_cast<uint32_t>(enumValue)); > }; > > /** > * Adds sample to a categorical histogram defined in TelemetryHistogramEnums.h Nit: "... to a keyed categorical histogram"

Attachment #8885219 - Flags: review?(gfritzsche) → review+

Georg Fritzsche [:gfritzsche]

Comment 23

•

7 years ago

Alessio, are we able to test `AccumulateCategorical()` from our C++ tests? Should we file a mentored bug on this?

Flags: needinfo?(alessio.placitelli)

Alessio Placitelli [:Dexter]

Comment 24

•

7 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #23) > Alessio, are we able to test `AccumulateCategorical()` from our C++ tests? > Should we file a mentored bug on this? We should be able to do that, there's already a mentored bug filed for this: bug 1351274.

Flags: needinfo?(alessio.placitelli)

Comment hidden (mozreview-request)

Pulsebot

Comment 27

•

7 years ago

Pushed by schien@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/44e0ece6ec05 Part 1, helper function for keyed categorical histogram. r=gfritzsche https://hg.mozilla.org/integration/autoland/rev/e60029998612 Part 2, add telemetry for HTTPChannelChild OMT success/fail rate and reason. r=bsmedberg,mcmanus data-r=bsmedberg

Ryan VanderMeulen [:RyanVM]

Comment 28

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/44e0ece6ec05 https://hg.mozilla.org/mozilla-central/rev/e60029998612

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox56: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla56

Jason Duell

Reporter

Updated

•

6 years ago

Flags: needinfo?(jduell.mcbugs)

Bug 1357682 - telemetry for monitoring OMT success rate in HttpChannelChild. 8 years ago Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) 59 bytes, text/x-review-board-request	jduell.mcbugs : review+	Details
Bug 1357682 - Part 1, helper function for keyed categorical histogram. 7 years ago Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) 59 bytes, text/x-review-board-request	gfritzsche : review+	Details
Bug 1357682 - Part 2, add telemetry for HTTPChannelChild OMT success/fail rate and reason. data-r=bsmedberg 7 years ago Shih-Chiang Chien [:schien] (UTC+8) (use ni? plz) 59 bytes, text/x-review-board-request	mcmanus : review+	Details