Telemetry aggregator should use main pings instead of saved_session ones.

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
P1
normal
RESOLVED FIXED
10 months ago
4 months ago

People

(Reporter: rvitillo, Assigned: frank)

Tracking

(Depends on: 1 bug, Blocks: 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

Comment hidden (empty)
Blocks: 1287481

Updated

9 months ago
Points: --- → 3
Priority: -- → P2
Blocks: 1281208
(Assignee)

Updated

6 months ago
Assignee: nobody → fbertsch
Priority: P2 → P1
(Assignee)

Comment 1

6 months ago
It took more than 2x as long to use main pings. 11.4 hours versus 5.1 hours using saved-session pings. I'm going to run it on double the nodes.
(Assignee)

Comment 2

5 months ago
I just ran it on a 12 node cluster, and it took 7.28 hours.
(Assignee)

Comment 3

5 months ago
We've talked about using a subset of clients to reduce the aggregation time. At this time, my plan is the following:

1. Configure our current setup to use main pings. This may include database configuration, and still is yet to determine whether it results in the same data (which it should). Unfortunately, this is going to increase costs drastically for aggregation (more than double).

2. Create a bug for using a sample of clients. Our goal is to understand trends, and as such, we don't need all the raw data. However, there will be issues incorporating this data into our current data; not only will counts be off, but raw histograms will be different. We can consider methods for doing this - weight historical data differently, for example.
Which population are you processing? Since saved-session pings were opt-in telemetry only, are you now processing all of release instead of just the opt-in subset?

I think that we need to step back and design aggregation and sampling together.
(Assignee)

Comment 5

5 months ago
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Which population are you processing? Since saved-session pings were opt-in
> telemetry only, are you now processing all of release instead of just the
> opt-in subset?
> 
> I think that we need to step back and design aggregation and sampling
> together.

I will be processing just the opt-in data. The goal is first to reproduce what we have with main pings. Next steps can be designing sampling or opt-out data.
(Assignee)

Comment 6

5 months ago
Just ran a comparison of saved-session versus main ping. I compared a single days (2016-11-14) build-id aggregates results.

Main Ping: 18548 Distinct Sets of Dimensional Values (e.g. nightly, US, Windows, etc.)
Saved-Session: 17602 Distinct Sets of Dimensional Values
15305 of those sets of dimensions were shared between Main and Saved-Session; note that aggregates matched exactly in 22 these cases.

Total counts of histograms was roughly the same per-ping.

I think these results are good enough to continue with switching over. On a side note, when filtering to just opt-in histograms, processing took ~4.5 hours on a 6 node cluster, which is no increase over the time with saved-session pings.

Next steps are run on the test-database that I have setup, and if that is satisfactory as well (performance-wise) we can deploy the changes.
(Assignee)

Comment 7

5 months ago
Just ran an end-to-end with the database. It took 420 minutes in total, 20% of which was time spent with the database. This was on a 6-node cluster, so while it will take a bit longer, I think we should be good to go with deploying this on the same size.

for 20161112, we had: 16716 build-id aggregates, and 2861 submission-date aggregates
(Assignee)

Comment 8

5 months ago
This change has been deployed, and the telemetry-aggregates db snapshot from 2016-11-23 is being preserved, in case it causes any mishaps. For now, we can mark this as complete.
Status: NEW → RESOLVED
Last Resolved: 5 months ago
Resolution: --- → FIXED
(Reporter)

Comment 9

5 months ago
I am reopening this Bug as the change caused several failures: at first it was crashing when parsing some invalid JSON and once I fixed that it started timing out. I am currently back-filling data.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 10

5 months ago
There was a theory that it ran more quickly because it 20161112 was a Saturday. I'm running for 20161113 right now on  a 7-node cluster.
(Assignee)

Comment 11

5 months ago
It finished in <7 hours (the cluster was actually 9 nodes), but with an error I haven't seen before:

DataError: unsupported Unicode escape sequence
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":...
COPY staging_build_id_beta_50_20161104, line 84332, column dimensions: "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":"\u0001\u0000\u0000\u0000\u7000\ub82c\u0..."

This is a postgres error.
(Assignee)

Comment 12

5 months ago
(In reply to Frank Bertsch [:frank] from comment #11)
> It finished in <7 hours (the cluster was actually 9 nodes), but with an
> error I haven't seen before:
> 
> DataError: unsupported Unicode escape sequence
> DETAIL:  \u0000 cannot be converted to text.
> CONTEXT:  JSON data, line 1:
> ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":...
> COPY staging_build_id_beta_50_20161104, line 84332, column dimensions:
> "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":
> "\u0001\u0000\u0000\u0000\u7000\ub82c\u0..."
> 
> This is a postgres error.

Note that this killed the job, so the time is unreliable (it might not have processed everything).
(Assignee)

Comment 13

5 months ago
Finished running 11/14. No errors, it took 7.2 hours on a 9-node cluster. Next steps are as follows:

1. Backtrack and figure out what caused the error above
2. Extend the airflow job to include a longer timeout in addition to more machines
(Assignee)

Comment 14

5 months ago
Just submitted a PR to fix the Postgres error. Once we merge that, I'll re-PR the main pings patch.
(Assignee)

Comment 15

4 months ago
There was some concern that upping the number of machines would cause the DB to be extremely slow to requests. I have confirmed that the increase in tasks is not overloading the machine, and the DB is still reachable. In addition, we can implement some cloudwatch alerts or monitors to ensure that the db is not overloaded.
(Assignee)

Comment 16

4 months ago
Latest run took 6.5 hours on 10 nodes. With the bump in nodes I think we should be good to go, and keep an eye on the job for the first few days (in addition to some monitoring I mentioned in the previous comment.

Comment 17

4 months ago
Created attachment 8818929 [details] [review]
[telemetry-airflow] fbertsch:update_aggregates_job > mozilla:master
(Assignee)

Updated

4 months ago
Status: REOPENED → RESOLVED
Last Resolved: 5 months ago4 months ago
Resolution: --- → FIXED

Comment 18

4 months ago
So, uh, we forgot that Fennec doesn't send "main" pings: https://mzl.la/2hYI4mi

This means that we have no aggregates for Firefox Mobile from after December 14, 2016.
Depends on: 1329228
You need to log in before you can comment on or make changes to this bug.