Closed Bug 1017269 Opened 10 years ago Closed 8 years ago

upload latency is at least 2 days

Categories

(Toolkit :: Telemetry, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: mmc, Unassigned)

Details

Attachments

(1 file)

20.27 KB, application/force-download
Details
I notice that after watching the telemetry dashboard for pinning for about 2 weeks that submissions by build date take several days (4-5) to stabilize even for Nightly users. According to bug 630880 it seems that Nightly users should upload every 2 hours. Is this really working as intended, or do Nightly users really take that long to update on average?
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #0)
> I notice that after watching the telemetry dashboard for pinning for about 2
> weeks that submissions by build date take several days (4-5) to stabilize
> even for Nightly users. According to bug 630880 it seems that Nightly users
> should upload every 2 hours. Is this really working as intended, or do
> Nightly users really take that long to update on average?

Yes they take that long to stabilize. Telemetry only does rollups within a 24hour window, so that can add a maximum of a day of lag.

John's team might be able to provide you with an adoption curve based on fhr data.

The only thing we can do in telemetry is tweak time to first signal.
Saptarshi has a nice summary report on Nightly daily build adoption. TL;DR is that it takes longer than you might expect. I've cc'd him here.
Attached file growth.pdf
Indeed, it takes time for users to update. For all builds in March, 2014 the median days to update was 3 days. The percentiles are below

   Percentile   Days to Update
 1: 0.00	 0.0000
 2: 0.05	 0.7500
 3: 0.10	 1.0000
 4: 0.15	 1.0400
 5: 0.20	 1.2500
 6: 0.25	 1.5000
 7: 0.30	 1.7300
 8: 0.35	 2.0000
 9: 0.40	 2.1900
10: 0.45	 2.5000
11: 0.50	 3.0000
12: 0.55	 3.5000
13: 0.60	 4.0000
14: 0.65	 5.0000
15: 0.70	 6.0000
16: 0.75	 7.8125
17: 0.80	 10.0000
18: 0.85	 14.0000
19: 0.90	 18.5000
20: 0.95 27.0000

Attached is a PDF of growth rates of 30 builds in March. Each line
corresponds to the growth rate (fitted) of a build. The red line is
the mean.  The graph plots the proportion of total profiles ever on
that build vs number of days since release.

As you can see from the red line, 50% growth is reached in 3 days and
80% in ~ 10 days.

This is how long it takes for users to update to a build. Not every
user updates to every build. After that you have to wait for users to
use their browser. Telemetry sends once a day at most.

hope it helps
Saptarshi
Thanks, Saptarshi! That's really interesting/horrifying :) Where did this data come from? I'd like to plot it against my dashboard so we understand what percent of users are contributing by build date so far. I have access to peach.
So I guess there are 3 components to latency:
- Update time
- Telemetry rollup time, which I think is set 2 hours for Nightly users if bug 630880 is still in effect
- Telemetry aggregator time, which seems to be about a day.

The aggregator delay means that even for users who update immediately, or for submissions by calendar date instead of build date, dashboards are consistently 2 days behind. Taras, can this be improved?
Flags: needinfo?(taras.mozilla)
Monica,
1) Update time
2) Telemetry rollup time, which I think is set 2 hours for Nightly users if bug 630880 is still in effect
3) Telemetry aggregator time, which seems to be about a day.

1) is outside of my control.
2) i have no idea what this has to do with telemetry
3) 
a)We were waiting for a usecase to make our dashboards closer to realtime. I think a 30-60min time from submission to json aggregation is reasonable and achievable. To do this across the board we need to wait for a production version of our upcoming task scheduling system(http://docs.taskcluster.net/) so we can port telemetry to it. This is atleast 2 months out.
b) if rollup delay is critical for you we can specialcase your path through the code and give you the 30-60min latency. This is hard to do in the general case, but it's easy to put in specific hacks.

We can also tweak telemetry clientside to not wait for idle-daily if builddate is within 1 day of current date. This will give you a bigger(but biased) early signal.

Before we commit to doing anything here, you'd have to describe a solid usecase to justify switching gears on this.
Flags: needinfo?(taras.mozilla)
The usecase is using telemetry to respond to outages. A 2-day delay basically means that by the time the telemetry dashboard knows about mistakes, users will have already escalated through bugzilla. By then the only thing dashboards can do is verify that there was a problem.
(In reply to [:mmc] Monica Chew (please use needinfo) from comment #8)
> The usecase is using telemetry to respond to outages. 
Do you mean telemetry outages in general? Or using telemetry to infer outages in other services?

For the former, we already have monitoring in place for submission rates, etc.
I mean using telemetry to monitor for outages in other services.
Summary: upload latency seems long → upload latency is at least 2 days
@Saptarshi: Would you be willing to redo that growth data in comment 4 for nightly and/or aurora builds from around May 25 to Jun 25?

Specifically we're looking to see if there was an impact from bug 1003159 landing on June 6 on nightly, and a week later on aurora. Theoretically that should have increased the uptake.
Flags: needinfo?(sguha)
Nothing happened in this bug in a while.
We have now much improved latency for Telemetry with "main" pings (with immediate uploading except for shutdown etc.).
Bug 1120370 & bug 1120372 will improve Telemetry latency after updates & new installs.

Lets take other future latency improvements to a new bug driven by the current needs.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(sguha)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: