Closed Bug 1497365 Opened 7 years ago Closed 7 years ago

Prio-encoded telemetry runs after histograms are cleared following subsession

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla64

Tracking Flags:

Tracking

Status

firefox64

---

verified

People

(Reporter: amiyaguchi, Assigned: rhelmer)

References

Details

Attachments

(3 files)

prio-aggregates by payload.info.reason 7 years ago Anthony Miyaguchi [:amiyaguchi] 4.50 KB, text/plain		Details
Bug 1497365 - send Prio-encoded data for subsessions r?gfritzsche,janerik 7 years ago Robert Helmer [:rhelmer] 46 bytes, text/x-phabricator-request		Details \| Review
verified-fixed prio-aggregates by payload.info.reason 7 years ago Anthony Miyaguchi [:amiyaguchi] 3.05 KB, text/plain		Details

Anthony Miyaguchi [:amiyaguchi]

Reporter

Description

•

7 years ago

The values for prio aggregates look like they are being under-counted. > prio_control prio_observed prio_diff prio_null_data prio_invalid > build_id > 20181007220217 [3, 0, 0] [0, 0, 0] [3, 0, 0] 0 0 > 20181007100131 [21952, 2001, 1] [13627, 1282, 0] [8325, 719, 1] 0 0 > 20181007044220 [15, 1, 0] [7, 0, 0] [8, 1, 0] 0 0 > 20181006220113 [29935, 3805, 3] [18229, 2616, 1] [11706, 1189, 2] 0 0 > 20181006100304 [20079, 4912, 2] [11407, 3099, 0] [8672, 1813, 2] 0 0 > 20181006014322 [12, 2, 0] [10, 1, 0] [2, 1, 0] 0 0 > 20181005220146 [6907, 2945, 1] [3122, 1695, 0] [3785, 1250, 1] 0 0 This is consistent across operating systems. The difference between the control and observed values should be 0. The code to generate this table can be found in bug 1465252 or (https://github.com/acmiyaguchi/python-libprio/pull/23).

Jan-Erik Rediger [:janerik]

Comment 1

•

7 years ago

:rhelmer implemented the prio encoder on the client-side. Maybe he can take a look.

Flags: needinfo?(rhelmer)

Robert Helmer [:rhelmer]

Assignee

Comment 2

•

7 years ago

Anthony and I are working through it now, thanks for the heads-up though! To help troubleshoot, I've created a simpler stand-alone bit of privileged JS (intended to be run from xpcshell) to use as a test case for this: https://gist.github.com/rhelmer/84a1782b10344e52038908cda4d6fb3f This generates all possible permutations of the three bits we are interested in, and produces separate JSON samples separated by newline: https://gist.github.com/rhelmer/76afe512434c8bafa19b3701bb529faf

Flags: needinfo?(rhelmer)

Robert Helmer [:rhelmer]

Assignee

Comment 3

•

7 years ago

(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #0) > The values for prio aggregates look like they are being under-counted. > > > prio_control prio_observed prio_diff prio_null_data prio_invalid > > build_id > > 20181007220217 [3, 0, 0] [0, 0, 0] [3, 0, 0] 0 0 > > 20181007100131 [21952, 2001, 1] [13627, 1282, 0] [8325, 719, 1] 0 0 > > 20181007044220 [15, 1, 0] [7, 0, 0] [8, 1, 0] 0 0 > > 20181006220113 [29935, 3805, 3] [18229, 2616, 1] [11706, 1189, 2] 0 0 > > 20181006100304 [20079, 4912, 2] [11407, 3099, 0] [8672, 1813, 2] 0 0 > > 20181006014322 [12, 2, 0] [10, 1, 0] [2, 1, 0] 0 0 > > 20181005220146 [6907, 2945, 1] [3122, 1695, 0] [3785, 1250, 1] 0 0 > > This is consistent across operating systems. The difference between the > control and observed values should be 0. The code to generate this table can > be found in bug 1465252 or > (https://github.com/acmiyaguchi/python-libprio/pull/23). Here is what I see when using the same analysis tool (https://github.com/acmiyaguchi/python-libprio/blob/validation/examples/browser-validation/main.py) but using the data from comment 2: prio_control prio_observed prio_diff prio_null_data prio_invalid count build_id 1234 [400, 400, 400] [400, 400, 400] [0, 0, 0] 0 0 800

Anthony Miyaguchi [:amiyaguchi]

Reporter

Comment 4

•

7 years ago

Attached file prio-aggregates by payload.info.reason — Details

Anthony Miyaguchi [:amiyaguchi]

Reporter

Comment 5

•

7 years ago

It looks like the aggregates aren't being collected when the subsession-split reason is "environment-change" or "daily". It works as expected on shutdown. This seems like unintended behavior. This should either be documented clearly, or attached to all of the reasons correctly.

Robert Helmer [:rhelmer]

Assignee

Comment 6

•

7 years ago

(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #5) > It looks like the aggregates aren't being collected when the > subsession-split reason is "environment-change" or "daily". It works as > expected on shutdown. > > This seems like unintended behavior. This should either be documented > clearly, or attached to all of the reasons correctly. So it looks like Prio is working correctly for "shutdown" reason codes, but it looks like `clearSubsession` is set to `true` for the other types which is passed to https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#741 It looks like this code causes histograms to be cleared when snapshot is called, which happens before `this._prioEncode()` in that function. Could you suggest how we should deal with this? Thanks!

Flags: needinfo?(jrediger)

Flags: needinfo?(gfritzsche)

Robert Helmer [:rhelmer]

Assignee

Updated

•

7 years ago

Summary: Prio-encoded telemetry is under-counted in final aggregates → Prio-encoded telemetry runs after histograms are cleared following subsession

Georg Fritzsche [:gfritzsche]

Comment 7

•

7 years ago

Right, this doesn't work out. You'll need to pass `measurements` or the `payloadObj` to `_prioEncode()` to work from the "data collected until now" here. https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#1175 Can we line up some test runs through typical scenarios for this? Is it possible to add test coverage here?

Flags: needinfo?(gfritzsche)

Jan-Erik Rediger [:janerik]

Updated

•

7 years ago

Flags: needinfo?(jrediger)

Robert Helmer [:rhelmer]

Assignee

Comment 8

•

7 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #7) > Right, this doesn't work out. > You'll need to pass `measurements` or the `payloadObj` to `_prioEncode()` to > work from the "data collected until now" here. > https://searchfox.org/mozilla-central/rev/ > 80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/ > TelemetrySession.jsm#1175 Thanks! I'll work up a patch for this. > Can we line up some test runs through typical scenarios for this? > Is it possible to add test coverage here? I think testing that the prio key is not null as the existing tests do will not be quite enough, since the problem is that it's running after histograms are cleared, so it gets called fine and the `prio` key is populated, but it's too late and all histograms are 0. Maybe it'd be best if I provide a way to mock `PrioEncoder.encode()`, then after subsession splits we can ensure that it was passed the correct data? For reference we do have a very complete test for `PrioEncoder.encode()` itself, which does the decryption and aggregate comparison and all that in a similar way to the Telemetry analysis tools: https://searchfox.org/mozilla-central/source/dom/prio/test/gtest/TestPrioEncoder.cpp#49 We *could* expose enough of those test methods to JS to do a complete Prio encode+decode+aggregation from Telemetry tests.. this might be worth doing for other reasons too, but I'd like to get something in place sooner so we can make it in time for the Beta merge :) If you're interested in this I could file a followup bug though. Oh, while I have you here - are pings with the aborted-session reason code worth keeping, or should we filter these out in analysis? The Prio measurements for shutdown are spot-on, but they are a tiny bit off for aborted-session, and we're not too sure why yet.

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 9

•

7 years ago

(In reply to Robert Helmer [:rhelmer] from comment #8) > > Can we line up some test runs through typical scenarios for this? > > Is it possible to add test coverage here? > > I think testing that the prio key is not null as the existing tests do will > not be quite enough, since the problem is that it's running after histograms > are cleared, so it gets called fine and the `prio` key is populated, but > it's too late and all histograms are 0. > > Maybe it'd be best if I provide a way to mock `PrioEncoder.encode()`, then > after subsession splits we can ensure that it was passed the correct data? Starting now with a mock sounds good (e.g. just the identity?). Maybe a follow-up on better testing if that's realistic to happen. A good mocking example should be `Policy.now()`: https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#121 > Oh, while I have you here - are pings with the aborted-session reason code > worth keeping, or should we filter these out in analysis? The Prio > measurements for shutdown are spot-on, but they are a tiny bit off for > aborted-session, and we're not too sure why yet. chutten, do you have input here?

Flags: needinfo?(gfritzsche) → needinfo?(chutten)

Chris H-C :chutten

Comment 10

•

7 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #9) > > Oh, while I have you here - are pings with the aborted-session reason code > > worth keeping, or should we filter these out in analysis? The Prio > > measurements for shutdown are spot-on, but they are a tiny bit off for > > aborted-session, and we're not too sure why yet. > > chutten, do you have input here? Not much more than just highlighting what aborted-session means: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html It's saved every 5min to disk and if shutdown goes wrong we send after next startup. I'm not sure what "tiny bit off" means, but perhaps there's something in its unique behaviour that accounts for it. Could you explain a bit about what "tiny bit off" means in this instance?

Flags: needinfo?(chutten)

Robert Helmer [:rhelmer]

Assignee

Comment 11

•

7 years ago

(In reply to Chris H-C :chutten from comment #10) > (In reply to Georg Fritzsche [:gfritzsche] from comment #9) > > > Oh, while I have you here - are pings with the aborted-session reason code > > > worth keeping, or should we filter these out in analysis? The Prio > > > measurements for shutdown are spot-on, but they are a tiny bit off for > > > aborted-session, and we're not too sure why yet. > > > > chutten, do you have input here? > > Not much more than just highlighting what aborted-session means: > https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/ > telemetry/data/main-ping.html > > It's saved every 5min to disk and if shutdown goes wrong we send after next > startup. I'm not sure what "tiny bit off" means, but perhaps there's > something in its unique behaviour that accounts for it. Could you explain a > bit about what "tiny bit off" means in this instance? :amiyaguchi broke down the results we're seeing by reason in attachment 9015743 [details] - you can see that the difference between prio_control and prio_observed cols (reflected in prio_diff) are pretty close for aborted-session, completely zeroed out for daily/environment-change (will fix in this bug!), and shutdown matches exactly.

Robert Helmer [:rhelmer]

Assignee

Updated

•

7 years ago

Assignee: nobody → rhelmer

Status: NEW → ASSIGNED

Chris H-C :chutten

Comment 12

•

7 years ago

*reads a bunch of code* Ohhh, I see a mechanism where this could happen. In some instances we reschedule pings if pings would be sent or generated too close together. (see TelemetrySession's reschedulePings). In that case we save the environment-change payload as the aborted-session payload. Since we have a problem with environment-change, that could result in the undercounting we're seeing. If this is the case we can expect the environment-change payload fix to take care of aborted-session as well.

Robert Helmer [:rhelmer]

Assignee

Comment 13

•

7 years ago

Attached file Bug 1497365 - send Prio-encoded data for subsessions r?gfritzsche,janerik — Details

Jan-Erik Rediger [:janerik]

Comment 14

•

7 years ago

Set priority so it shows up in our bug tracking overview (P1=it's being worked on).

Priority: -- → P1

Robert Helmer [:rhelmer]

Assignee

Comment 15

•

7 years ago

Thanks for reviewing! We're a bit stuck on verifying that Prio-encoded Telemetry totally works from the analysis side, until this is in... I'd like to make the beta merge for sure, but if this looks correct to you and there's any part we can defer (like ideas for more complete testing etc) I'd appreciate it. I am 100% fine with making sure that this has sufficient test coverage to ensure we're handling subsessions correctly of course.

Flags: needinfo?(jrediger)

Jan-Erik Rediger [:janerik]

Comment 16

•

7 years ago

I'll give it a look today. Sorry for the delay.

Flags: needinfo?(jrediger)

Pulsebot

Comment 17

•

7 years ago

Pushed by rhelmer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/45a7ec8b7021 send Prio-encoded data for subsessions r=janerik

Robert Helmer [:rhelmer]

Assignee

Comment 18

•

7 years ago

(In reply to Jan-Erik Rediger [:janerik] from comment #16) > I'll give it a look today. Sorry for the delay. No worries, thanks for jumping on it I appreciate it!

Andrei Ciure[:aciure]

Comment 19

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/45a7ec8b7021

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox64: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla64

Anthony Miyaguchi [:amiyaguchi]

Reporter

Comment 20

•

7 years ago

Attached file verified-fixed prio-aggregates by payload.info.reason — Details

Anthony Miyaguchi [:amiyaguchi]

Reporter

Updated

•

7 years ago

Status: RESOLVED → VERIFIED