Closed
Bug 1497365
Opened 7 years ago
Closed 7 years ago
Prio-encoded telemetry runs after histograms are cleared following subsession
Categories
(Toolkit :: Telemetry, defect, P1)
Toolkit
Telemetry
Tracking
()
VERIFIED
FIXED
mozilla64
Tracking | Status | |
---|---|---|
firefox64 | --- | verified |
People
(Reporter: amiyaguchi, Assigned: rhelmer)
References
Details
Attachments
(3 files)
The values for prio aggregates look like they are being under-counted.
> prio_control prio_observed prio_diff prio_null_data prio_invalid
> build_id
> 20181007220217 [3, 0, 0] [0, 0, 0] [3, 0, 0] 0 0
> 20181007100131 [21952, 2001, 1] [13627, 1282, 0] [8325, 719, 1] 0 0
> 20181007044220 [15, 1, 0] [7, 0, 0] [8, 1, 0] 0 0
> 20181006220113 [29935, 3805, 3] [18229, 2616, 1] [11706, 1189, 2] 0 0
> 20181006100304 [20079, 4912, 2] [11407, 3099, 0] [8672, 1813, 2] 0 0
> 20181006014322 [12, 2, 0] [10, 1, 0] [2, 1, 0] 0 0
> 20181005220146 [6907, 2945, 1] [3122, 1695, 0] [3785, 1250, 1] 0 0
This is consistent across operating systems. The difference between the control and observed values should be 0. The code to generate this table can be found in bug 1465252 or (https://github.com/acmiyaguchi/python-libprio/pull/23).
Comment 1•7 years ago
|
||
:rhelmer implemented the prio encoder on the client-side. Maybe he can take a look.
Flags: needinfo?(rhelmer)
Assignee | ||
Comment 2•7 years ago
|
||
Anthony and I are working through it now, thanks for the heads-up though!
To help troubleshoot, I've created a simpler stand-alone bit of privileged JS (intended to be run from xpcshell) to use as a test case for this:
https://gist.github.com/rhelmer/84a1782b10344e52038908cda4d6fb3f
This generates all possible permutations of the three bits we are interested in, and produces separate JSON samples separated by newline:
https://gist.github.com/rhelmer/76afe512434c8bafa19b3701bb529faf
Flags: needinfo?(rhelmer)
Assignee | ||
Comment 3•7 years ago
|
||
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #0)
> The values for prio aggregates look like they are being under-counted.
>
> > prio_control prio_observed prio_diff prio_null_data prio_invalid
> > build_id
> > 20181007220217 [3, 0, 0] [0, 0, 0] [3, 0, 0] 0 0
> > 20181007100131 [21952, 2001, 1] [13627, 1282, 0] [8325, 719, 1] 0 0
> > 20181007044220 [15, 1, 0] [7, 0, 0] [8, 1, 0] 0 0
> > 20181006220113 [29935, 3805, 3] [18229, 2616, 1] [11706, 1189, 2] 0 0
> > 20181006100304 [20079, 4912, 2] [11407, 3099, 0] [8672, 1813, 2] 0 0
> > 20181006014322 [12, 2, 0] [10, 1, 0] [2, 1, 0] 0 0
> > 20181005220146 [6907, 2945, 1] [3122, 1695, 0] [3785, 1250, 1] 0 0
>
> This is consistent across operating systems. The difference between the
> control and observed values should be 0. The code to generate this table can
> be found in bug 1465252 or
> (https://github.com/acmiyaguchi/python-libprio/pull/23).
Here is what I see when using the same analysis tool (https://github.com/acmiyaguchi/python-libprio/blob/validation/examples/browser-validation/main.py) but using the data from comment 2:
prio_control prio_observed prio_diff prio_null_data prio_invalid count
build_id
1234 [400, 400, 400] [400, 400, 400] [0, 0, 0] 0 0 800
Reporter | ||
Comment 4•7 years ago
|
||
Reporter | ||
Comment 5•7 years ago
|
||
It looks like the aggregates aren't being collected when the subsession-split reason is "environment-change" or "daily". It works as expected on shutdown.
This seems like unintended behavior. This should either be documented clearly, or attached to all of the reasons correctly.
Assignee | ||
Comment 6•7 years ago
|
||
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #5)
> It looks like the aggregates aren't being collected when the
> subsession-split reason is "environment-change" or "daily". It works as
> expected on shutdown.
>
> This seems like unintended behavior. This should either be documented
> clearly, or attached to all of the reasons correctly.
So it looks like Prio is working correctly for "shutdown" reason codes, but it looks like `clearSubsession` is set to `true` for the other types which is passed to https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#741
It looks like this code causes histograms to be cleared when snapshot is called, which happens before `this._prioEncode()` in that function.
Could you suggest how we should deal with this? Thanks!
Flags: needinfo?(jrediger)
Flags: needinfo?(gfritzsche)
Assignee | ||
Updated•7 years ago
|
Summary: Prio-encoded telemetry is under-counted in final aggregates → Prio-encoded telemetry runs after histograms are cleared following subsession
Comment 7•7 years ago
|
||
Right, this doesn't work out.
You'll need to pass `measurements` or the `payloadObj` to `_prioEncode()` to work from the "data collected until now" here.
https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#1175
Can we line up some test runs through typical scenarios for this?
Is it possible to add test coverage here?
Flags: needinfo?(gfritzsche)
Updated•7 years ago
|
Flags: needinfo?(jrediger)
Assignee | ||
Comment 8•7 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #7)
> Right, this doesn't work out.
> You'll need to pass `measurements` or the `payloadObj` to `_prioEncode()` to
> work from the "data collected until now" here.
> https://searchfox.org/mozilla-central/rev/
> 80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/
> TelemetrySession.jsm#1175
Thanks! I'll work up a patch for this.
> Can we line up some test runs through typical scenarios for this?
> Is it possible to add test coverage here?
I think testing that the prio key is not null as the existing tests do will not be quite enough, since the problem is that it's running after histograms are cleared, so it gets called fine and the `prio` key is populated, but it's too late and all histograms are 0.
Maybe it'd be best if I provide a way to mock `PrioEncoder.encode()`, then after subsession splits we can ensure that it was passed the correct data? For reference we do have a very complete test for `PrioEncoder.encode()` itself, which does the decryption and aggregate comparison and all that in a similar way to the Telemetry analysis tools:
https://searchfox.org/mozilla-central/source/dom/prio/test/gtest/TestPrioEncoder.cpp#49
We *could* expose enough of those test methods to JS to do a complete Prio encode+decode+aggregation from Telemetry tests.. this might be worth doing for other reasons too, but I'd like to get something in place sooner so we can make it in time for the Beta merge :) If you're interested in this I could file a followup bug though.
Oh, while I have you here - are pings with the aborted-session reason code worth keeping, or should we filter these out in analysis? The Prio measurements for shutdown are spot-on, but they are a tiny bit off for aborted-session, and we're not too sure why yet.
Flags: needinfo?(gfritzsche)
Comment 9•7 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #8)
> > Can we line up some test runs through typical scenarios for this?
> > Is it possible to add test coverage here?
>
> I think testing that the prio key is not null as the existing tests do will
> not be quite enough, since the problem is that it's running after histograms
> are cleared, so it gets called fine and the `prio` key is populated, but
> it's too late and all histograms are 0.
>
> Maybe it'd be best if I provide a way to mock `PrioEncoder.encode()`, then
> after subsession splits we can ensure that it was passed the correct data?
Starting now with a mock sounds good (e.g. just the identity?).
Maybe a follow-up on better testing if that's realistic to happen.
A good mocking example should be `Policy.now()`:
https://searchfox.org/mozilla-central/rev/80ac71c1c54af788b32e851192dfd2de2ec18e18/toolkit/components/telemetry/pings/TelemetrySession.jsm#121
> Oh, while I have you here - are pings with the aborted-session reason code
> worth keeping, or should we filter these out in analysis? The Prio
> measurements for shutdown are spot-on, but they are a tiny bit off for
> aborted-session, and we're not too sure why yet.
chutten, do you have input here?
Flags: needinfo?(gfritzsche) → needinfo?(chutten)
Comment 10•7 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > Oh, while I have you here - are pings with the aborted-session reason code
> > worth keeping, or should we filter these out in analysis? The Prio
> > measurements for shutdown are spot-on, but they are a tiny bit off for
> > aborted-session, and we're not too sure why yet.
>
> chutten, do you have input here?
Not much more than just highlighting what aborted-session means: https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/data/main-ping.html
It's saved every 5min to disk and if shutdown goes wrong we send after next startup. I'm not sure what "tiny bit off" means, but perhaps there's something in its unique behaviour that accounts for it. Could you explain a bit about what "tiny bit off" means in this instance?
Flags: needinfo?(chutten)
Assignee | ||
Comment 11•7 years ago
|
||
(In reply to Chris H-C :chutten from comment #10)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > > Oh, while I have you here - are pings with the aborted-session reason code
> > > worth keeping, or should we filter these out in analysis? The Prio
> > > measurements for shutdown are spot-on, but they are a tiny bit off for
> > > aborted-session, and we're not too sure why yet.
> >
> > chutten, do you have input here?
>
> Not much more than just highlighting what aborted-session means:
> https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/
> telemetry/data/main-ping.html
>
> It's saved every 5min to disk and if shutdown goes wrong we send after next
> startup. I'm not sure what "tiny bit off" means, but perhaps there's
> something in its unique behaviour that accounts for it. Could you explain a
> bit about what "tiny bit off" means in this instance?
:amiyaguchi broke down the results we're seeing by reason in attachment 9015743 [details] - you can see that the difference between prio_control and prio_observed cols (reflected in prio_diff) are pretty close for aborted-session, completely zeroed out for daily/environment-change (will fix in this bug!), and shutdown matches exactly.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → rhelmer
Status: NEW → ASSIGNED
Comment 12•7 years ago
|
||
*reads a bunch of code*
Ohhh, I see a mechanism where this could happen. In some instances we reschedule pings if pings would be sent or generated too close together. (see TelemetrySession's reschedulePings). In that case we save the environment-change payload as the aborted-session payload. Since we have a problem with environment-change, that could result in the undercounting we're seeing.
If this is the case we can expect the environment-change payload fix to take care of aborted-session as well.
Assignee | ||
Comment 13•7 years ago
|
||
Comment 14•7 years ago
|
||
Set priority so it shows up in our bug tracking overview (P1=it's being worked on).
Priority: -- → P1
Assignee | ||
Comment 15•7 years ago
|
||
Thanks for reviewing! We're a bit stuck on verifying that Prio-encoded Telemetry totally works from the analysis side, until this is in... I'd like to make the beta merge for sure, but if this looks correct to you and there's any part we can defer (like ideas for more complete testing etc) I'd appreciate it.
I am 100% fine with making sure that this has sufficient test coverage to ensure we're handling subsessions correctly of course.
Flags: needinfo?(jrediger)
Comment 17•7 years ago
|
||
Pushed by rhelmer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/45a7ec8b7021
send Prio-encoded data for subsessions r=janerik
Assignee | ||
Comment 18•7 years ago
|
||
(In reply to Jan-Erik Rediger [:janerik] from comment #16)
> I'll give it a look today. Sorry for the delay.
No worries, thanks for jumping on it I appreciate it!
Comment 19•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox64:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Reporter | ||
Comment 20•7 years ago
|
||
Reporter | ||
Updated•7 years ago
|
Status: RESOLVED → VERIFIED
Comment 21•7 years ago
|
||
Updating flags based on Comment 20.
You need to log in
before you can comment on or make changes to this bug.
Description
•