Closed Bug 1497365 Opened 7 years ago Closed 7 years ago

Prio-encoded telemetry runs after histograms are cleared following subsession

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla64
Tracking Status
firefox64 --- verified

People

(Reporter: amiyaguchi, Assigned: rhelmer)

References

Details

Attachments

(3 files)

The values for prio aggregates look like they are being under-counted. > prio_control prio_observed prio_diff prio_null_data prio_invalid > build_id > 20181007220217 [3, 0, 0] [0, 0, 0] [3, 0, 0] 0 0 > 20181007100131 [21952, 2001, 1] [13627, 1282, 0] [8325, 719, 1] 0 0 > 20181007044220 [15, 1, 0] [7, 0, 0] [8, 1, 0] 0 0 > 20181006220113 [29935, 3805, 3] [18229, 2616, 1] [11706, 1189, 2] 0 0 > 20181006100304 [20079, 4912, 2] [11407, 3099, 0] [8672, 1813, 2] 0 0 > 20181006014322 [12, 2, 0] [10, 1, 0] [2, 1, 0] 0 0 > 20181005220146 [6907, 2945, 1] [3122, 1695, 0] [3785, 1250, 1] 0 0 This is consistent across operating systems. The difference between the control and observed values should be 0. The code to generate this table can be found in bug 1465252 or (https://github.com/acmiyaguchi/python-libprio/pull/23).
:rhelmer implemented the prio encoder on the client-side. Maybe he can take a look.
Flags: needinfo?(rhelmer)
Anthony and I are working through it now, thanks for the heads-up though! To help troubleshoot, I've created a simpler stand-alone bit of privileged JS (intended to be run from xpcshell) to use as a test case for this: https://gist.github.com/rhelmer/84a1782b10344e52038908cda4d6fb3f This generates all possible permutations of the three bits we are interested in, and produces separate JSON samples separated by newline: https://gist.github.com/rhelmer/76afe512434c8bafa19b3701bb529faf
Flags: needinfo?(rhelmer)
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #0) > The values for prio aggregates look like they are being under-counted. > > > prio_control prio_observed prio_diff prio_null_data prio_invalid > > build_id > > 20181007220217 [3, 0, 0] [0, 0, 0] [3, 0, 0] 0 0 > > 20181007100131 [21952, 2001, 1] [13627, 1282, 0] [8325, 719, 1] 0 0 > > 20181007044220 [15, 1, 0] [7, 0, 0] [8, 1, 0] 0 0 > > 20181006220113 [29935, 3805, 3] [18229, 2616, 1] [11706, 1189, 2] 0 0 > > 20181006100304 [20079, 4912, 2] [11407, 3099, 0] [8672, 1813, 2] 0 0 > > 20181006014322 [12, 2, 0] [10, 1, 0] [2, 1, 0] 0 0 > > 20181005220146 [6907, 2945, 1] [3122, 1695, 0] [3785, 1250, 1] 0 0 > > This is consistent across operating systems. The difference between the > control and observed values should be 0. The code to generate this table can > be found in bug 1465252 or > (https://github.com/acmiyaguchi/python-libprio/pull/23). Here is what I see when using the same analysis tool (https://github.com/acmiyaguchi/python-libprio/blob/validation/examples/browser-validation/main.py) but using the data from comment 2: prio_control prio_observed prio_diff prio_null_data prio_invalid count build_id 1234 [400, 400, 400] [400, 400, 400] [0, 0, 0] 0 0 800
It looks like the aggregates aren't being collected when the subsession-split reason is "environment-change" or "daily". It works as expected on shutdown. This seems like unintended behavior. This should either be documented clearly, or attached to all of the reasons correctly.
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #5) > It looks like the aggregates aren't being collected when the > subsession-split reason is "environment-change" or "daily". It works as > expected on shutdown. > > This seems like unintended behavior. This should either be documented > clearly, or attached to all of the reasons correctly. So it looks like Prio is working correctly for "shutdown" reason codes, but it looks like `clearSubsession` is set to `true` for the other types which is passed to https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#741 It looks like this code causes histograms to be cleared when snapshot is called, which happens before `this._prioEncode()` in that function. Could you suggest how we should deal with this? Thanks!
Flags: needinfo?(jrediger)
Flags: needinfo?(gfritzsche)
Summary: Prio-encoded telemetry is under-counted in final aggregates → Prio-encoded telemetry runs after histograms are cleared following subsession
Right, this doesn't work out. You'll need to pass `measurements` or the `payloadObj` to `_prioEncode()` to work from the "data collected until now" here. https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#1175 Can we line up some test runs through typical scenarios for this? Is it possible to add test coverage here?
Flags: needinfo?(gfritzsche)
Flags: needinfo?(jrediger)
(In reply to Georg Fritzsche [:gfritzsche] from comment #7) > Right, this doesn't work out. > You'll need to pass `measurements` or the `payloadObj` to `_prioEncode()` to > work from the "data collected until now" here. > https://searchfox.org/mozilla-central/rev/ > 80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/ > TelemetrySession.jsm#1175 Thanks! I'll work up a patch for this. > Can we line up some test runs through typical scenarios for this? > Is it possible to add test coverage here? I think testing that the prio key is not null as the existing tests do will not be quite enough, since the problem is that it's running after histograms are cleared, so it gets called fine and the `prio` key is populated, but it's too late and all histograms are 0. Maybe it'd be best if I provide a way to mock `PrioEncoder.encode()`, then after subsession splits we can ensure that it was passed the correct data? For reference we do have a very complete test for `PrioEncoder.encode()` itself, which does the decryption and aggregate comparison and all that in a similar way to the Telemetry analysis tools: https://searchfox.org/mozilla-central/source/dom/prio/test/gtest/TestPrioEncoder.cpp#49 We *could* expose enough of those test methods to JS to do a complete Prio encode+decode+aggregation from Telemetry tests.. this might be worth doing for other reasons too, but I'd like to get something in place sooner so we can make it in time for the Beta merge :) If you're interested in this I could file a followup bug though. Oh, while I have you here - are pings with the aborted-session reason code worth keeping, or should we filter these out in analysis? The Prio measurements for shutdown are spot-on, but they are a tiny bit off for aborted-session, and we're not too sure why yet.
Flags: needinfo?(gfritzsche)
(In reply to Robert Helmer [:rhelmer] from comment #8) > > Can we line up some test runs through typical scenarios for this? > > Is it possible to add test coverage here? > > I think testing that the prio key is not null as the existing tests do will > not be quite enough, since the problem is that it's running after histograms > are cleared, so it gets called fine and the `prio` key is populated, but > it's too late and all histograms are 0. > > Maybe it'd be best if I provide a way to mock `PrioEncoder.encode()`, then > after subsession splits we can ensure that it was passed the correct data? Starting now with a mock sounds good (e.g. just the identity?). Maybe a follow-up on better testing if that's realistic to happen. A good mocking example should be `Policy.now()`: https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#121 > Oh, while I have you here - are pings with the aborted-session reason code > worth keeping, or should we filter these out in analysis? The Prio > measurements for shutdown are spot-on, but they are a tiny bit off for > aborted-session, and we're not too sure why yet. chutten, do you have input here?
Flags: needinfo?(gfritzsche) → needinfo?(chutten)
(In reply to Georg Fritzsche [:gfritzsche] from comment #9) > > Oh, while I have you here - are pings with the aborted-session reason code > > worth keeping, or should we filter these out in analysis? The Prio > > measurements for shutdown are spot-on, but they are a tiny bit off for > > aborted-session, and we're not too sure why yet. > > chutten, do you have input here? Not much more than just highlighting what aborted-session means: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html It's saved every 5min to disk and if shutdown goes wrong we send after next startup. I'm not sure what "tiny bit off" means, but perhaps there's something in its unique behaviour that accounts for it. Could you explain a bit about what "tiny bit off" means in this instance?
Flags: needinfo?(chutten)
(In reply to Chris H-C :chutten from comment #10) > (In reply to Georg Fritzsche [:gfritzsche] from comment #9) > > > Oh, while I have you here - are pings with the aborted-session reason code > > > worth keeping, or should we filter these out in analysis? The Prio > > > measurements for shutdown are spot-on, but they are a tiny bit off for > > > aborted-session, and we're not too sure why yet. > > > > chutten, do you have input here? > > Not much more than just highlighting what aborted-session means: > https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/ > telemetry/data/main-ping.html > > It's saved every 5min to disk and if shutdown goes wrong we send after next > startup. I'm not sure what "tiny bit off" means, but perhaps there's > something in its unique behaviour that accounts for it. Could you explain a > bit about what "tiny bit off" means in this instance? :amiyaguchi broke down the results we're seeing by reason in attachment 9015743 [details] - you can see that the difference between prio_control and prio_observed cols (reflected in prio_diff) are pretty close for aborted-session, completely zeroed out for daily/environment-change (will fix in this bug!), and shutdown matches exactly.
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
*reads a bunch of code* Ohhh, I see a mechanism where this could happen. In some instances we reschedule pings if pings would be sent or generated too close together. (see TelemetrySession's reschedulePings). In that case we save the environment-change payload as the aborted-session payload. Since we have a problem with environment-change, that could result in the undercounting we're seeing. If this is the case we can expect the environment-change payload fix to take care of aborted-session as well.
Set priority so it shows up in our bug tracking overview (P1=it's being worked on).
Priority: -- → P1
Thanks for reviewing! We're a bit stuck on verifying that Prio-encoded Telemetry totally works from the analysis side, until this is in... I'd like to make the beta merge for sure, but if this looks correct to you and there's any part we can defer (like ideas for more complete testing etc) I'd appreciate it. I am 100% fine with making sure that this has sufficient test coverage to ensure we're handling subsessions correctly of course.
Flags: needinfo?(jrediger)
I'll give it a look today. Sorry for the delay.
Flags: needinfo?(jrediger)
Pushed by rhelmer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/45a7ec8b7021 send Prio-encoded data for subsessions r=janerik
(In reply to Jan-Erik Rediger [:janerik] from comment #16) > I'll give it a look today. Sorry for the delay. No worries, thanks for jumping on it I appreciate it!
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Status: RESOLVED → VERIFIED
Updating flags based on Comment 20.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: