Closed
Bug 1240522
Opened 9 years ago
Closed 7 years ago
Generate low-latency "ADI per channel" data
Categories
(Data Platform and Tools :: General, defect, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: gfritzsche, Unassigned)
References
Details
User Story
From mail thread: > My goal is relatively simple. I would like to have a clear picture the usage of > Firefox on the beta and release channel per version. > > For example, I would like to have: > * 42.0 - XXX users > * 43.0 - YY users > * 41.0.1 - ZZ users > * 40.0 - 2 users > > same for beta > (42.0b9, 43.0b4, etc). > This will help us when we start new builds for the partial generation > (the binary diff between two versions). > > I need an update of the number every hours (more if it is expensive) > and if the data could be fresh (~ 1 day), this would be perfect. ... > I don't need any graph. I am just interested by the current numbers.
Attachments
(1 file)
1.25 KB,
text/plain
|
Details |
Release management wants to see ADI/uptake per channel in a low-latency dashboard.
Reporter | ||
Updated•9 years ago
|
User Story: (updated)
Reporter | ||
Comment 1•9 years ago
|
||
Example of required data (without the desired beta version information above).
Comment 2•9 years ago
|
||
We can do this analysis real time with a hyperloglog per partition if ~99% accuracy is acceptable and we can easily refresh the output once a minute.
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → gfritzsche
Reporter | ||
Comment 3•9 years ago
|
||
Some questions that came up here and affect implementation:
What accuracy is required for the data? (see comment 2)
What ADI definition should we apply here exactly?
Count of unique clients seen on a channel in the last 24h (i.e. rolling 24h window)?
Or per calendar day?
There are subtleties here for how to account for clients:
(1) "we received Telemetry data from client now" vs.
(2) "that data was generated three days ago"
... but i think we can safely go with (1) for this use-case.
Flags: needinfo?(sledru)
Comment 4•9 years ago
|
||
> What accuracy is required for the data? (see comment 2)
I would be happy with a 90% accuracy :)
> What ADI definition should we apply here exactly?
What version of Firefox users have on their system?
> Count of unique clients seen on a channel in the last 24h (i.e. rolling 24h
> window)?
This one
> Or per calendar day?
>
> There are subtleties here for how to account for clients:
> (1) "we received Telemetry data from client now" vs.
> (2) "that data was generated three days ago"
> ... but i think we can safely go with (1) for this use-case.
OK, I trust you :)
Flags: needinfo?(sledru)
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to Mike Trinkala [:trink] from comment #2)
> We can do this analysis real time with a hyperloglog per partition if ~99%
> accuracy is acceptable and we can easily refresh the output once a minute.
Trink, it sounds like this is the ideal way forward then?
Do we have examples for that? Any other pointers?
Flags: needinfo?(mtrinkala)
Comment 6•9 years ago
|
||
This can be stripped down (since you don't need the graphs, daily, weekly, and monthly rollups) and it use the HyperLogLog for ADI
https://github.com/mozilla-services/data-pipeline/blob/b17a11805ae3666f5938a62d815204fc81c595f9/heka/sandbox/filters/firefox_active_instances.lua
Flags: needinfo?(mtrinkala)
Comment 7•9 years ago
|
||
This looks very useful for bug 1246675 as well, which needs real-time ADI.
Blocks: 1246675
Reporter | ||
Comment 8•9 years ago
|
||
I am actively working on this, but got stuck with the data-pipeline project not building locally on OS X.
This part is now sorted: https://github.com/mozilla-services/data-pipeline/pull/187
... so i can finally move on to prototyping this locally.
Reporter | ||
Comment 9•9 years ago
|
||
(In reply to Ben Hearsum (:bhearsum) from comment #7)
> This looks very useful for bug 1246675 as well, which needs real-time ADI.
What maximum latency does this require? 1h, 5min, 1min, ...?
Flags: needinfo?(bhearsum)
Comment 10•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > This looks very useful for bug 1246675 as well, which needs real-time ADI.
>
> What maximum latency does this require? 1h, 5min, 1min, ...?
15min, if possible (obviously the lower the better though).
This is based on a current rough uptake rate of ~800,000 installs/hour on the release channel (~200,000/15min), which gives us about a 1% margin of error when trying to hit 20,000,000 installs.
Flags: needinfo?(bhearsum)
Reporter | ||
Comment 11•9 years ago
|
||
(In reply to Ben Hearsum (:bhearsum) from comment #10)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> >
> > What maximum latency does this require? 1h, 5min, 1min, ...?
>
> 15min, if possible (obviously the lower the better though).
>
> This is based on a current rough uptake rate of ~800,000 installs/hour on
> the release channel (~200,000/15min), which gives us about a 1% margin of
> error when trying to hit 20,000,000 installs.
Ok, for release-throttling you are hit by another factor:
After a fresh install or update, we currently don't send out a ping immediately.
Bug 1120370 & bug 1120372 are about sending out pings immediately in these cases.
Until we have those, we have an additional error margin from the reporting latency (for which we could run an analysis job to find the average/95th percentile/...).
Reporter | ||
Updated•9 years ago
|
Summary: Implement low-latency "ADI per channel" dashboard → Generate low-latency "ADI per channel" data
Comment 12•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #11)
> (In reply to Ben Hearsum (:bhearsum) from comment #10)
> > (In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> > > (In reply to Ben Hearsum (:bhearsum) from comment #7)
> > > > This looks very useful for bug 1246675 as well, which needs real-time ADI.
> > >
> > > What maximum latency does this require? 1h, 5min, 1min, ...?
> >
> > 15min, if possible (obviously the lower the better though).
> >
> > This is based on a current rough uptake rate of ~800,000 installs/hour on
> > the release channel (~200,000/15min), which gives us about a 1% margin of
> > error when trying to hit 20,000,000 installs.
>
> Ok, for release-throttling you are hit by another factor:
> After a fresh install or update, we currently don't send out a ping
> immediately.
> Bug 1120370 & bug 1120372 are about sending out pings immediately in these
> cases.
> Until we have those, we have an additional error margin from the reporting
> latency (for which we could run an analysis job to find the average/95th
> percentile/...).
Hm, I'm surprised to hear this. When I was doing some initial poking I was under the impression that Telemetry's UPDATE_STATE_CODE_COMPLETE_STARTUP value is sent by default from all users on the release channel - and the dashboard's seem to show that. As I understand it, that value wouldn't cover new installs, but it would be sent after users restart after applying an update. I could be wrong about that, though.
Comment 13•9 years ago
|
||
That histogram is recorded immediately, but it's typically not going to be sent to Mozilla in the telemetry ping until either:
* the next local midnight
* the user shuts down and restarts the browser
There is inherently a pretty large latency associated with this, so it's not something that can drive realtime dashboards. Which is why bug 1120370 and bug 1120372 exist, so that we can do this in realtime.
FWIW, most this data already exists in the daily rollups at https://analysis-output.telemetry.mozilla.org/stability-rollups/2016/20160216-active-daily.csv.gz except I removed the buildid facet because it made the dataset larger and wasn't necessary for my purposes. Building a periscope version of this would be pretty straightforward.
Comment 14•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> That histogram is recorded immediately, but it's typically not going to be
> sent to Mozilla in the telemetry ping until either:
>
> * the next local midnight
> * the user shuts down and restarts the browser
>
> There is inherently a pretty large latency associated with this, so it's not
> something that can drive realtime dashboards. Which is why bug 1120370 and
> bug 1120372 exist, so that we can do this in realtime.
Would UPDATE_STATE_CODE_COMPLETE_STAGE be more appropriate for this (which I assume is sent immediately after staging the update), or should we just wait for the new pings?
Reporter | ||
Comment 15•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> FWIW, most this data already exists in the daily rollups at
> https://analysis-output.telemetry.mozilla.org/stability-rollups/2016/
> 20160216-active-daily.csv.gz except I removed the buildid facet because it
> made the dataset larger and wasn't necessary for my purposes. Building a
> periscope version of this would be pretty straightforward.
But this is a rollup that is only generated daily, right?
Not a rolling window and suitable for low-latency needs?
Comment 16•9 years ago
|
||
Daily, correct. There is no data that would let us drive a low-latency dashboard like this.
Reporter | ||
Comment 17•9 years ago
|
||
(In reply to Ben Hearsum (:bhearsum) from comment #14)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #13)
> > That histogram is recorded immediately, but it's typically not going to be
> > sent to Mozilla in the telemetry ping until either:
> >
> > * the next local midnight
> > * the user shuts down and restarts the browser
> >
> > There is inherently a pretty large latency associated with this, so it's not
> > something that can drive realtime dashboards. Which is why bug 1120370 and
> > bug 1120372 exist, so that we can do this in realtime.
>
> Would UPDATE_STATE_CODE_COMPLETE_STAGE be more appropriate for this (which I
> assume is sent immediately after staging the update), or should we just wait
> for the new pings?
This doesn't tell us which version a client updated to and we don't receive that without lag either.
We should do the new ping types, lets talk timelines by mail or on the bugs about those pings.
Reporter | ||
Comment 18•9 years ago
|
||
PR for the pipeline additions:
https://github.com/mozilla-services/data-pipeline/pull/190
Comment 19•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> PR for the pipeline additions:
> https://github.com/mozilla-services/data-pipeline/pull/190
I'm a bit confused comment #16 says that we don't have data that would make this possible...but this PR seems to say otherwise. What am I missing?
Reporter | ||
Comment 20•9 years ago
|
||
(In reply to Ben Hearsum (:bhearsum) from comment #19)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> > PR for the pipeline additions:
> > https://github.com/mozilla-services/data-pipeline/pull/190
>
> I'm a bit confused comment #16 says that we don't have data that would make
> this possible...but this PR seems to say otherwise. What am I missing?
We do have data, but it is not sent timely enough from the client to cover your needs.
The work here is a first step to your requirements too.
We would also need the mentioned bug 1120370 and bug 1120372 to happen client-side, then we can make the code from this bug take those pings into account too to get lower latency.
Reporter | ||
Comment 21•9 years ago
|
||
As it turns out, this does not provide the versions in the form "45.0b5" etc., so beta builds can't be told apart.
We don't have this information in the Telemetry data yet.
Other use-cases do a buildid to version number lookup, but i don't think this works well with Heka filters: we would have to regularly update that data from some location with some mechanism.
Per bug 1250897 Benjamin doesn't mind adding that data to the Telemetry data and it's relatively easy to do.
This seems the preferred path forward here, as this is much easier to handle in the filter.
I will have to push a follow-up to the PR once we have the client design for this.
Reporter | ||
Updated•9 years ago
|
Priority: P1 → P2
Comment 22•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #21)
> As it turns out, this does not provide the versions in the form "45.0b5"
> etc., so beta builds can't be told apart.
> We don't have this information in the Telemetry data yet.
> Other use-cases do a buildid to version number lookup, but i don't think
> this works well with Heka filters: we would have to regularly update that
> data from some location with some mechanism.
>
> Per bug 1250897 Benjamin doesn't mind adding that data to the Telemetry data
> and it's relatively easy to do.
> This seems the preferred path forward here, as this is much easier to handle
> in the filter.
>
> I will have to push a follow-up to the PR once we have the client design for
> this.
Any update here, Georg? For my use case in bug 1246675, I don't need "beta" channel data.
Flags: needinfo?(gfritzsche)
Reporter | ||
Comment 23•9 years ago
|
||
Sorry for the missing updates here. This first lagged from waiting on bug 1250897.
More recently there were concerns off-bug about addressing other (related) use-cases from one single setup; currently i am waiting on an update on that before we can pick this up again.
Flags: needinfo?(gfritzsche)
Reporter | ||
Updated•9 years ago
|
Assignee: gfritzsche → nobody
Comment 24•9 years ago
|
||
> In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> Sorry for the missing updates here. This first lagged from waiting on bug
> 1250897.
> More recently there were concerns off-bug about addressing other (related)
> use-cases from one single setup; currently i am waiting on an update on that
> before we can pick this up again.
Does the unassigning mean this has been deprioritized, or is that just a symptom of waiting on the aforementioned update?
Flags: needinfo?(gfritzsche)
Reporter | ||
Comment 25•9 years ago
|
||
Usually we only assign ourselves to bugs we are actively working on.
I'm waiting for an update, so i'm updating the bug state to match that.
Flags: needinfo?(gfritzsche)
Comment 26•8 years ago
|
||
Any news on that update ?
Comment 27•8 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #26)
> Any news on that update ?
Flags: needinfo?(gfritzsche)
Reporter | ||
Comment 28•8 years ago
|
||
Katie, did we schedule this?
Flags: needinfo?(gfritzsche) → needinfo?(kparlante)
Priority: P2 → P3
Comment 29•8 years ago
|
||
FWIW, we're putting this back on the front-burner for Q4, though I can't guarantee that it will be completed this quarter.
Flags: needinfo?(kparlante)
Reporter | ||
Updated•8 years ago
|
Whiteboard: [measurement:client]
Comment 30•8 years ago
|
||
(In reply to Katie Parlante from comment #29)
> FWIW, we're putting this back on the front-burner for Q4, though I can't
> guarantee that it will be completed this quarter.
Thanks for the update Katie, it helps with planning!
Comment 31•8 years ago
|
||
Is this plan still valid? Should we set bug 1120370 and bug 1120372 as blockers?
Updated•7 years ago
|
Component: Metrics: Pipeline → General
Product: Cloud Services → Data Platform and Tools
Comment 32•7 years ago
|
||
I think this is no longer relevant, and will be tackled as part of the Mission Control project.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•