Last Comment Bug 1240522 - Generate low-latency "ADI per channel" data
: Generate low-latency "ADI per channel" data
Status: NEW
:
Product: Cloud Services
Classification: Client Software
Component: Metrics: Pipeline (show other bugs)
: unspecified
: Unspecified Unspecified
P3 normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
:
Mentors:
Depends on: 1250897
Blocks: 1146863 1246675
  Show dependency treegraph
 
Reported: 2016-01-18 07:16 PST by Georg Fritzsche [:gfritzsche]
Modified: 2016-10-31 09:26 PDT (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: 3


Attachments
Example data (1.25 KB, text/plain)
2016-01-18 07:42 PST, Georg Fritzsche [:gfritzsche]
no flags Details

User Story
From mail thread:

> My goal is relatively simple. I would like to have a clear picture the usage of
> Firefox on the beta and release channel per version.
> 
> For example, I would like to have:
> * 42.0 - XXX users
> * 43.0 - YY users
> * 41.0.1 - ZZ users
> * 40.0 - 2 users
> 
> same for beta
> (42.0b9, 43.0b4, etc).
> This will help us when we start new builds for the partial generation
> (the binary diff between two versions).
> 
> I need an update of the number every hours (more if it is expensive)
> and if the data could be fresh (~ 1 day), this would be perfect.

...
> I don't need any graph. I am just interested by the current numbers.      
Description User image Georg Fritzsche [:gfritzsche] 2016-01-18 07:16:00 PST
Release management wants to see ADI/uptake per channel in a low-latency dashboard.
Comment 1 User image Georg Fritzsche [:gfritzsche] 2016-01-18 07:42:07 PST
Created attachment 8709057 [details]
Example data

Example of required data (without the desired beta version information above).
Comment 2 User image Mike Trinkala [:trink] 2016-01-26 08:42:48 PST
We can do this analysis real time with a hyperloglog per partition if ~99% accuracy is acceptable and we can easily refresh the output once a minute.
Comment 3 User image Georg Fritzsche [:gfritzsche] 2016-02-01 09:29:20 PST
Some questions that came up here and affect implementation:

What accuracy is required for the data? (see comment 2)

What ADI definition should we apply here exactly?
Count of unique clients seen on a channel in the last 24h (i.e. rolling 24h window)?
Or per calendar day?

There are subtleties here for how to account for clients:
(1) "we received Telemetry data from client now" vs.
(2) "that data was generated three days ago"
... but i think we can safely go with (1) for this use-case.
Comment 4 User image Sylvestre Ledru [:sylvestre] 2016-02-01 09:32:51 PST
> What accuracy is required for the data? (see comment 2)
I would be happy with a 90% accuracy :)

> What ADI definition should we apply here exactly?
What version of Firefox users have on their system?

> Count of unique clients seen on a channel in the last 24h (i.e. rolling 24h
> window)?
This one

> Or per calendar day?
> 
> There are subtleties here for how to account for clients:
> (1) "we received Telemetry data from client now" vs.
> (2) "that data was generated three days ago"
> ... but i think we can safely go with (1) for this use-case.
OK, I trust you :)
Comment 5 User image Georg Fritzsche [:gfritzsche] 2016-02-01 09:40:50 PST
(In reply to Mike Trinkala [:trink] from comment #2)
> We can do this analysis real time with a hyperloglog per partition if ~99%
> accuracy is acceptable and we can easily refresh the output once a minute.

Trink, it sounds like this is the ideal way forward then?
Do we have examples for that? Any other pointers?
Comment 6 User image Mike Trinkala [:trink] 2016-02-01 20:16:29 PST
This can be stripped down (since you don't need the graphs, daily, weekly, and monthly rollups) and it use the HyperLogLog for ADI

https://github.com/mozilla-services/data-pipeline/blob/b17a11805ae3666f5938a62d815204fc81c595f9/heka/sandbox/filters/firefox_active_instances.lua
Comment 7 User image Ben Hearsum (:bhearsum) 2016-02-09 05:42:23 PST
This looks very useful for bug 1246675 as well, which needs real-time ADI.
Comment 8 User image Georg Fritzsche [:gfritzsche] 2016-02-09 08:12:21 PST
I am actively working on this, but got stuck with the data-pipeline project not building locally on OS X.
This part is now sorted: https://github.com/mozilla-services/data-pipeline/pull/187
... so i can finally move on to prototyping this locally.
Comment 9 User image Georg Fritzsche [:gfritzsche] 2016-02-17 06:56:00 PST
(In reply to Ben Hearsum (:bhearsum) from comment #7)
> This looks very useful for bug 1246675 as well, which needs real-time ADI.

What maximum latency does this require? 1h, 5min, 1min, ...?
Comment 10 User image Ben Hearsum (:bhearsum) 2016-02-17 07:30:22 PST
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> 
> What maximum latency does this require? 1h, 5min, 1min, ...?

15min, if possible (obviously the lower the better though).

This is based on a current rough uptake rate of ~800,000 installs/hour on the release channel (~200,000/15min), which gives us about a 1% margin of error when trying to hit 20,000,000 installs.
Comment 11 User image Georg Fritzsche [:gfritzsche] 2016-02-18 06:52:56 PST
(In reply to Ben Hearsum (:bhearsum) from comment #10)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> > 
> > What maximum latency does this require? 1h, 5min, 1min, ...?
> 
> 15min, if possible (obviously the lower the better though).
> 
> This is based on a current rough uptake rate of ~800,000 installs/hour on
> the release channel (~200,000/15min), which gives us about a 1% margin of
> error when trying to hit 20,000,000 installs.

Ok, for release-throttling you are hit by another factor:
After a fresh install or update, we currently don't send out a ping immediately.
Bug 1120370 & bug 1120372 are about sending out pings immediately in these cases.
Until we have those, we have an additional error margin from the reporting latency (for which we could run an analysis job to find the average/95th percentile/...).
Comment 12 User image Ben Hearsum (:bhearsum) 2016-02-18 07:05:58 PST
(In reply to Georg Fritzsche [:gfritzsche] from comment #11)
> (In reply to Ben Hearsum (:bhearsum) from comment #10)
> > (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > > (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > > > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> > > 
> > > What maximum latency does this require? 1h, 5min, 1min, ...?
> > 
> > 15min, if possible (obviously the lower the better though).
> > 
> > This is based on a current rough uptake rate of ~800,000 installs/hour on
> > the release channel (~200,000/15min), which gives us about a 1% margin of
> > error when trying to hit 20,000,000 installs.
> 
> Ok, for release-throttling you are hit by another factor:
> After a fresh install or update, we currently don't send out a ping
> immediately.
> Bug 1120370 & bug 1120372 are about sending out pings immediately in these
> cases.
> Until we have those, we have an additional error margin from the reporting
> latency (for which we could run an analysis job to find the average/95th
> percentile/...).

Hm, I'm surprised to hear this. When I was doing some initial poking I was under the impression that Telemetry's UPDATE_STATE_CODE_COMPLETE_STARTUP value is sent by default from all users on the release channel - and the dashboard's seem to show that. As I understand it, that value wouldn't cover new installs, but it would be sent after users restart after applying an update. I could be wrong about that, though.
Comment 13 User image Benjamin Smedberg [:bsmedberg] 2016-02-18 08:29:43 PST
That histogram is recorded immediately, but it's typically not going to be sent to Mozilla in the telemetry ping until either:

* the next local midnight
* the user shuts down and restarts the browser

There is inherently a pretty large latency associated with this, so it's not something that can drive realtime dashboards. Which is why bug 1120370 and bug 1120372 exist, so that we can do this in realtime.

FWIW, most this data already exists in the daily rollups at https://analysis-output.telemetry.mozilla.org/stability-rollups/2016/20160216-active-daily.csv.gz except I removed the buildid facet because it made the dataset larger and wasn't necessary for my purposes. Building a periscope version of this would be pretty straightforward.
Comment 14 User image Ben Hearsum (:bhearsum) 2016-02-18 08:47:29 PST
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> That histogram is recorded immediately, but it's typically not going to be
> sent to Mozilla in the telemetry ping until either:
> 
> * the next local midnight
> * the user shuts down and restarts the browser
> 
> There is inherently a pretty large latency associated with this, so it's not
> something that can drive realtime dashboards. Which is why bug 1120370 and
> bug 1120372 exist, so that we can do this in realtime.

Would UPDATE_STATE_CODE_COMPLETE_STAGE be more appropriate for this (which I assume is sent immediately after staging the update), or should we just wait for the new pings?
Comment 15 User image Georg Fritzsche [:gfritzsche] 2016-02-18 08:50:53 PST
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> FWIW, most this data already exists in the daily rollups at
> https://analysis-output.telemetry.mozilla.org/stability-rollups/2016/
> 20160216-active-daily.csv.gz except I removed the buildid facet because it
> made the dataset larger and wasn't necessary for my purposes. Building a
> periscope version of this would be pretty straightforward.

But this is a rollup that is only generated daily, right?
Not a rolling window and suitable for low-latency needs?
Comment 16 User image Benjamin Smedberg [:bsmedberg] 2016-02-18 09:06:56 PST
Daily, correct. There is no data that would let us drive a low-latency dashboard like this.
Comment 17 User image Georg Fritzsche [:gfritzsche] 2016-02-18 10:33:19 PST
(In reply to Ben Hearsum (:bhearsum) from comment #14)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> > That histogram is recorded immediately, but it's typically not going to be
> > sent to Mozilla in the telemetry ping until either:
> > 
> > * the next local midnight
> > * the user shuts down and restarts the browser
> > 
> > There is inherently a pretty large latency associated with this, so it's not
> > something that can drive realtime dashboards. Which is why bug 1120370 and
> > bug 1120372 exist, so that we can do this in realtime.
> 
> Would UPDATE_STATE_CODE_COMPLETE_STAGE be more appropriate for this (which I
> assume is sent immediately after staging the update), or should we just wait
> for the new pings?

This doesn't tell us which version a client updated to and we don't receive that without lag either.
We should do the new ping types, lets talk timelines by mail or on the bugs about those pings.
Comment 18 User image Georg Fritzsche [:gfritzsche] 2016-02-23 09:28:04 PST
PR for the pipeline additions:
https://github.com/mozilla-services/data-pipeline/pull/190
Comment 19 User image Ben Hearsum (:bhearsum) 2016-02-23 09:37:38 PST
(In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> PR for the pipeline additions:
> https://github.com/mozilla-services/data-pipeline/pull/190

I'm a bit confused comment #16 says that we don't have data that would make this possible...but this PR seems to say otherwise. What am I missing?
Comment 20 User image Georg Fritzsche [:gfritzsche] 2016-02-23 09:41:48 PST
(In reply to Ben Hearsum (:bhearsum) from comment #19)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> > PR for the pipeline additions:
> > https://github.com/mozilla-services/data-pipeline/pull/190
> 
> I'm a bit confused comment #16 says that we don't have data that would make
> this possible...but this PR seems to say otherwise. What am I missing?

We do have data, but it is not sent timely enough from the client to cover your needs.
The work here is a first step to your requirements too.
We would also need the mentioned bug 1120370 and bug 1120372 to happen client-side, then we can make the code from this bug take those pings into account too to get lower latency.
Comment 21 User image Georg Fritzsche [:gfritzsche] 2016-02-25 03:23:22 PST
As it turns out, this does not provide the versions in the form "45.0b5" etc., so beta builds can't be told apart.
We don't have this information in the Telemetry data yet.
Other use-cases do a buildid to version number lookup, but i don't think this works well with Heka filters: we would have to regularly update that data from some location with some mechanism.

Per bug 1250897 Benjamin doesn't mind adding that data to the Telemetry data and it's relatively easy to do.
This seems the preferred path forward here, as this is much easier to handle in the filter.

I will have to push a follow-up to the PR once we have the client design for this.
Comment 22 User image Ben Hearsum (:bhearsum) 2016-04-25 07:25:45 PDT
(In reply to Georg Fritzsche [:gfritzsche] from comment #21)
> As it turns out, this does not provide the versions in the form "45.0b5"
> etc., so beta builds can't be told apart.
> We don't have this information in the Telemetry data yet.
> Other use-cases do a buildid to version number lookup, but i don't think
> this works well with Heka filters: we would have to regularly update that
> data from some location with some mechanism.
> 
> Per bug 1250897 Benjamin doesn't mind adding that data to the Telemetry data
> and it's relatively easy to do.
> This seems the preferred path forward here, as this is much easier to handle
> in the filter.
> 
> I will have to push a follow-up to the PR once we have the client design for
> this.

Any update here, Georg? For my use case in bug 1246675, I don't need "beta" channel data.
Comment 23 User image Georg Fritzsche [:gfritzsche] 2016-04-25 07:29:53 PDT
Sorry for the missing updates here. This first lagged from waiting on bug 1250897.
More recently there were concerns off-bug about addressing other (related) use-cases from one single setup; currently i am waiting on an update on that before we can pick this up again.
Comment 24 User image Ben Hearsum (:bhearsum) 2016-04-27 07:10:12 PDT
> In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> Sorry for the missing updates here. This first lagged from waiting on bug
> 1250897.
> More recently there were concerns off-bug about addressing other (related)
> use-cases from one single setup; currently i am waiting on an update on that
> before we can pick this up again.

Does the unassigning mean this has been deprioritized, or is that just a symptom of waiting on the aforementioned update?
Comment 25 User image Georg Fritzsche [:gfritzsche] 2016-04-27 08:24:08 PDT
Usually we only assign ourselves to bugs we are actively working on.
I'm waiting for an update, so i'm updating the bug state to match that.
Comment 26 User image Nick Thomas [:nthomas] 2016-06-06 02:42:53 PDT
Any news on that update ?
Comment 27 User image Ben Hearsum (:bhearsum) 2016-07-22 06:32:10 PDT
(In reply to Nick Thomas [:nthomas] from comment #26)
> Any news on that update ?
Comment 28 User image Georg Fritzsche [:gfritzsche] 2016-07-22 07:16:46 PDT
Katie, did we schedule this?
Comment 29 User image Katie Parlante 2016-10-13 15:09:33 PDT
FWIW, we're putting this back on the front-burner for Q4, though I can't guarantee that it will be completed this quarter.
Comment 30 User image Ben Hearsum (:bhearsum) 2016-10-31 09:26:42 PDT
(In reply to Katie Parlante from comment #29)
> FWIW, we're putting this back on the front-burner for Q4, though I can't
> guarantee that it will be completed this quarter.

Thanks for the update Katie, it helps with planning!

Note You need to log in before you can comment on or make changes to this bug.