Closed Bug 1240522 Opened 8 years ago Closed 7 years ago

Generate low-latency "ADI per channel" data

Categories

(Data Platform and Tools :: General, defect, P3)

defect
Points:
3

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gfritzsche, Unassigned)

References

Details

User Story

From mail thread:

> My goal is relatively simple. I would like to have a clear picture the usage of
> Firefox on the beta and release channel per version.
> 
> For example, I would like to have:
> * 42.0 - XXX users
> * 43.0 - YY users
> * 41.0.1 - ZZ users
> * 40.0 - 2 users
> 
> same for beta
> (42.0b9, 43.0b4, etc).
> This will help us when we start new builds for the partial generation
> (the binary diff between two versions).
> 
> I need an update of the number every hours (more if it is expensive)
> and if the data could be fresh (~ 1 day), this would be perfect.

...
> I don't need any graph. I am just interested by the current numbers.

Attachments

(1 file)

Release management wants to see ADI/uptake per channel in a low-latency dashboard.
User Story: (updated)
Attached file Example data
Example of required data (without the desired beta version information above).
We can do this analysis real time with a hyperloglog per partition if ~99% accuracy is acceptable and we can easily refresh the output once a minute.
Assignee: nobody → gfritzsche
Some questions that came up here and affect implementation:

What accuracy is required for the data? (see comment 2)

What ADI definition should we apply here exactly?
Count of unique clients seen on a channel in the last 24h (i.e. rolling 24h window)?
Or per calendar day?

There are subtleties here for how to account for clients:
(1) "we received Telemetry data from client now" vs.
(2) "that data was generated three days ago"
... but i think we can safely go with (1) for this use-case.
Flags: needinfo?(sledru)
> What accuracy is required for the data? (see comment 2)
I would be happy with a 90% accuracy :)

> What ADI definition should we apply here exactly?
What version of Firefox users have on their system?

> Count of unique clients seen on a channel in the last 24h (i.e. rolling 24h
> window)?
This one

> Or per calendar day?
> 
> There are subtleties here for how to account for clients:
> (1) "we received Telemetry data from client now" vs.
> (2) "that data was generated three days ago"
> ... but i think we can safely go with (1) for this use-case.
OK, I trust you :)
Flags: needinfo?(sledru)
(In reply to Mike Trinkala [:trink] from comment #2)
> We can do this analysis real time with a hyperloglog per partition if ~99%
> accuracy is acceptable and we can easily refresh the output once a minute.

Trink, it sounds like this is the ideal way forward then?
Do we have examples for that? Any other pointers?
Flags: needinfo?(mtrinkala)
This can be stripped down (since you don't need the graphs, daily, weekly, and monthly rollups) and it use the HyperLogLog for ADI

https://github.com/mozilla-services/data-pipeline/blob/b17a11805ae3666f5938a62d815204fc81c595f9/heka/sandbox/filters/firefox_active_instances.lua
Flags: needinfo?(mtrinkala)
This looks very useful for bug 1246675 as well, which needs real-time ADI.
Blocks: 1246675
I am actively working on this, but got stuck with the data-pipeline project not building locally on OS X.
This part is now sorted: https://github.com/mozilla-services/data-pipeline/pull/187
... so i can finally move on to prototyping this locally.
(In reply to Ben Hearsum (:bhearsum) from comment #7)
> This looks very useful for bug 1246675 as well, which needs real-time ADI.

What maximum latency does this require? 1h, 5min, 1min, ...?
Flags: needinfo?(bhearsum)
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> 
> What maximum latency does this require? 1h, 5min, 1min, ...?

15min, if possible (obviously the lower the better though).

This is based on a current rough uptake rate of ~800,000 installs/hour on the release channel (~200,000/15min), which gives us about a 1% margin of error when trying to hit 20,000,000 installs.
Flags: needinfo?(bhearsum)
(In reply to Ben Hearsum (:bhearsum) from comment #10)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> > 
> > What maximum latency does this require? 1h, 5min, 1min, ...?
> 
> 15min, if possible (obviously the lower the better though).
> 
> This is based on a current rough uptake rate of ~800,000 installs/hour on
> the release channel (~200,000/15min), which gives us about a 1% margin of
> error when trying to hit 20,000,000 installs.

Ok, for release-throttling you are hit by another factor:
After a fresh install or update, we currently don't send out a ping immediately.
Bug 1120370 & bug 1120372 are about sending out pings immediately in these cases.
Until we have those, we have an additional error margin from the reporting latency (for which we could run an analysis job to find the average/95th percentile/...).
Summary: Implement low-latency "ADI per channel" dashboard → Generate low-latency "ADI per channel" data
(In reply to Georg Fritzsche [:gfritzsche] from comment #11)
> (In reply to Ben Hearsum (:bhearsum) from comment #10)
> > (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > > (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > > > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> > > 
> > > What maximum latency does this require? 1h, 5min, 1min, ...?
> > 
> > 15min, if possible (obviously the lower the better though).
> > 
> > This is based on a current rough uptake rate of ~800,000 installs/hour on
> > the release channel (~200,000/15min), which gives us about a 1% margin of
> > error when trying to hit 20,000,000 installs.
> 
> Ok, for release-throttling you are hit by another factor:
> After a fresh install or update, we currently don't send out a ping
> immediately.
> Bug 1120370 & bug 1120372 are about sending out pings immediately in these
> cases.
> Until we have those, we have an additional error margin from the reporting
> latency (for which we could run an analysis job to find the average/95th
> percentile/...).

Hm, I'm surprised to hear this. When I was doing some initial poking I was under the impression that Telemetry's UPDATE_STATE_CODE_COMPLETE_STARTUP value is sent by default from all users on the release channel - and the dashboard's seem to show that. As I understand it, that value wouldn't cover new installs, but it would be sent after users restart after applying an update. I could be wrong about that, though.
That histogram is recorded immediately, but it's typically not going to be sent to Mozilla in the telemetry ping until either:

* the next local midnight
* the user shuts down and restarts the browser

There is inherently a pretty large latency associated with this, so it's not something that can drive realtime dashboards. Which is why bug 1120370 and bug 1120372 exist, so that we can do this in realtime.

FWIW, most this data already exists in the daily rollups at https://analysis-output.telemetry.mozilla.org/stability-rollups/2016/20160216-active-daily.csv.gz except I removed the buildid facet because it made the dataset larger and wasn't necessary for my purposes. Building a periscope version of this would be pretty straightforward.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> That histogram is recorded immediately, but it's typically not going to be
> sent to Mozilla in the telemetry ping until either:
> 
> * the next local midnight
> * the user shuts down and restarts the browser
> 
> There is inherently a pretty large latency associated with this, so it's not
> something that can drive realtime dashboards. Which is why bug 1120370 and
> bug 1120372 exist, so that we can do this in realtime.

Would UPDATE_STATE_CODE_COMPLETE_STAGE be more appropriate for this (which I assume is sent immediately after staging the update), or should we just wait for the new pings?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> FWIW, most this data already exists in the daily rollups at
> https://analysis-output.telemetry.mozilla.org/stability-rollups/2016/
> 20160216-active-daily.csv.gz except I removed the buildid facet because it
> made the dataset larger and wasn't necessary for my purposes. Building a
> periscope version of this would be pretty straightforward.

But this is a rollup that is only generated daily, right?
Not a rolling window and suitable for low-latency needs?
Daily, correct. There is no data that would let us drive a low-latency dashboard like this.
(In reply to Ben Hearsum (:bhearsum) from comment #14)
> (In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> > That histogram is recorded immediately, but it's typically not going to be
> > sent to Mozilla in the telemetry ping until either:
> > 
> > * the next local midnight
> > * the user shuts down and restarts the browser
> > 
> > There is inherently a pretty large latency associated with this, so it's not
> > something that can drive realtime dashboards. Which is why bug 1120370 and
> > bug 1120372 exist, so that we can do this in realtime.
> 
> Would UPDATE_STATE_CODE_COMPLETE_STAGE be more appropriate for this (which I
> assume is sent immediately after staging the update), or should we just wait
> for the new pings?

This doesn't tell us which version a client updated to and we don't receive that without lag either.
We should do the new ping types, lets talk timelines by mail or on the bugs about those pings.
(In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> PR for the pipeline additions:
> https://github.com/mozilla-services/data-pipeline/pull/190

I'm a bit confused comment #16 says that we don't have data that would make this possible...but this PR seems to say otherwise. What am I missing?
(In reply to Ben Hearsum (:bhearsum) from comment #19)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> > PR for the pipeline additions:
> > https://github.com/mozilla-services/data-pipeline/pull/190
> 
> I'm a bit confused comment #16 says that we don't have data that would make
> this possible...but this PR seems to say otherwise. What am I missing?

We do have data, but it is not sent timely enough from the client to cover your needs.
The work here is a first step to your requirements too.
We would also need the mentioned bug 1120370 and bug 1120372 to happen client-side, then we can make the code from this bug take those pings into account too to get lower latency.
Depends on: 1250897
As it turns out, this does not provide the versions in the form "45.0b5" etc., so beta builds can't be told apart.
We don't have this information in the Telemetry data yet.
Other use-cases do a buildid to version number lookup, but i don't think this works well with Heka filters: we would have to regularly update that data from some location with some mechanism.

Per bug 1250897 Benjamin doesn't mind adding that data to the Telemetry data and it's relatively easy to do.
This seems the preferred path forward here, as this is much easier to handle in the filter.

I will have to push a follow-up to the PR once we have the client design for this.
Priority: P1 → P2
(In reply to Georg Fritzsche [:gfritzsche] from comment #21)
> As it turns out, this does not provide the versions in the form "45.0b5"
> etc., so beta builds can't be told apart.
> We don't have this information in the Telemetry data yet.
> Other use-cases do a buildid to version number lookup, but i don't think
> this works well with Heka filters: we would have to regularly update that
> data from some location with some mechanism.
> 
> Per bug 1250897 Benjamin doesn't mind adding that data to the Telemetry data
> and it's relatively easy to do.
> This seems the preferred path forward here, as this is much easier to handle
> in the filter.
> 
> I will have to push a follow-up to the PR once we have the client design for
> this.

Any update here, Georg? For my use case in bug 1246675, I don't need "beta" channel data.
Flags: needinfo?(gfritzsche)
Sorry for the missing updates here. This first lagged from waiting on bug 1250897.
More recently there were concerns off-bug about addressing other (related) use-cases from one single setup; currently i am waiting on an update on that before we can pick this up again.
Flags: needinfo?(gfritzsche)
Assignee: gfritzsche → nobody
> In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> Sorry for the missing updates here. This first lagged from waiting on bug
> 1250897.
> More recently there were concerns off-bug about addressing other (related)
> use-cases from one single setup; currently i am waiting on an update on that
> before we can pick this up again.

Does the unassigning mean this has been deprioritized, or is that just a symptom of waiting on the aforementioned update?
Flags: needinfo?(gfritzsche)
Usually we only assign ourselves to bugs we are actively working on.
I'm waiting for an update, so i'm updating the bug state to match that.
Flags: needinfo?(gfritzsche)
Any news on that update ?
(In reply to Nick Thomas [:nthomas] from comment #26)
> Any news on that update ?
Flags: needinfo?(gfritzsche)
Katie, did we schedule this?
Flags: needinfo?(gfritzsche) → needinfo?(kparlante)
Priority: P2 → P3
FWIW, we're putting this back on the front-burner for Q4, though I can't guarantee that it will be completed this quarter.
Flags: needinfo?(kparlante)
Whiteboard: [measurement:client]
(In reply to Katie Parlante from comment #29)
> FWIW, we're putting this back on the front-burner for Q4, though I can't
> guarantee that it will be completed this quarter.

Thanks for the update Katie, it helps with planning!
Is this plan still valid? Should we set bug 1120370 and bug 1120372 as blockers?
Component: Metrics: Pipeline → General
Product: Cloud Services → Data Platform and Tools
I think this is no longer relevant, and will be tackled as part of the Mission Control project.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: