Closed
Bug 1286868
Opened 8 years ago
Closed 7 years ago
Telemetry aggregator should use main pings instead of saved_session ones.
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rvitillo, Assigned: frank)
References
Details
Attachments
(1 file)
No description provided.
Updated•8 years ago
|
Points: --- → 3
Priority: -- → P2
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → fbertsch
Priority: P2 → P1
Assignee | ||
Comment 1•8 years ago
|
||
It took more than 2x as long to use main pings. 11.4 hours versus 5.1 hours using saved-session pings. I'm going to run it on double the nodes.
Assignee | ||
Comment 2•8 years ago
|
||
I just ran it on a 12 node cluster, and it took 7.28 hours.
Assignee | ||
Comment 3•8 years ago
|
||
We've talked about using a subset of clients to reduce the aggregation time. At this time, my plan is the following: 1. Configure our current setup to use main pings. This may include database configuration, and still is yet to determine whether it results in the same data (which it should). Unfortunately, this is going to increase costs drastically for aggregation (more than double). 2. Create a bug for using a sample of clients. Our goal is to understand trends, and as such, we don't need all the raw data. However, there will be issues incorporating this data into our current data; not only will counts be off, but raw histograms will be different. We can consider methods for doing this - weight historical data differently, for example.
Comment 4•8 years ago
|
||
Which population are you processing? Since saved-session pings were opt-in telemetry only, are you now processing all of release instead of just the opt-in subset? I think that we need to step back and design aggregation and sampling together.
Assignee | ||
Comment 5•8 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4) > Which population are you processing? Since saved-session pings were opt-in > telemetry only, are you now processing all of release instead of just the > opt-in subset? > > I think that we need to step back and design aggregation and sampling > together. I will be processing just the opt-in data. The goal is first to reproduce what we have with main pings. Next steps can be designing sampling or opt-out data.
Assignee | ||
Comment 6•8 years ago
|
||
Just ran a comparison of saved-session versus main ping. I compared a single days (2016-11-14) build-id aggregates results. Main Ping: 18548 Distinct Sets of Dimensional Values (e.g. nightly, US, Windows, etc.) Saved-Session: 17602 Distinct Sets of Dimensional Values 15305 of those sets of dimensions were shared between Main and Saved-Session; note that aggregates matched exactly in 22 these cases. Total counts of histograms was roughly the same per-ping. I think these results are good enough to continue with switching over. On a side note, when filtering to just opt-in histograms, processing took ~4.5 hours on a 6 node cluster, which is no increase over the time with saved-session pings. Next steps are run on the test-database that I have setup, and if that is satisfactory as well (performance-wise) we can deploy the changes.
Assignee | ||
Comment 7•8 years ago
|
||
Just ran an end-to-end with the database. It took 420 minutes in total, 20% of which was time spent with the database. This was on a 6-node cluster, so while it will take a bit longer, I think we should be good to go with deploying this on the same size. for 20161112, we had: 16716 build-id aggregates, and 2861 submission-date aggregates
Assignee | ||
Comment 8•8 years ago
|
||
This change has been deployed, and the telemetry-aggregates db snapshot from 2016-11-23 is being preserved, in case it causes any mishaps. For now, we can mark this as complete.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 9•8 years ago
|
||
I am reopening this Bug as the change caused several failures: at first it was crashing when parsing some invalid JSON and once I fixed that it started timing out. I am currently back-filling data.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 10•8 years ago
|
||
There was a theory that it ran more quickly because it 20161112 was a Saturday. I'm running for 20161113 right now on a 7-node cluster.
Assignee | ||
Comment 11•8 years ago
|
||
It finished in <7 hours (the cluster was actually 9 nodes), but with an error I haven't seen before: DataError: unsupported Unicode escape sequence DETAIL: \u0000 cannot be converted to text. CONTEXT: JSON data, line 1: ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":... COPY staging_build_id_beta_50_20161104, line 84332, column dimensions: "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":"\u0001\u0000\u0000\u0000\u7000\ub82c\u0..." This is a postgres error.
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Frank Bertsch [:frank] from comment #11) > It finished in <7 hours (the cluster was actually 9 nodes), but with an > error I haven't seen before: > > DataError: unsupported Unicode escape sequence > DETAIL: \u0000 cannot be converted to text. > CONTEXT: JSON data, line 1: > ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":... > COPY staging_build_id_beta_50_20161104, line 84332, column dimensions: > "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label": > "\u0001\u0000\u0000\u0000\u7000\ub82c\u0..." > > This is a postgres error. Note that this killed the job, so the time is unreliable (it might not have processed everything).
Assignee | ||
Comment 13•8 years ago
|
||
Finished running 11/14. No errors, it took 7.2 hours on a 9-node cluster. Next steps are as follows: 1. Backtrack and figure out what caused the error above 2. Extend the airflow job to include a longer timeout in addition to more machines
Assignee | ||
Comment 14•8 years ago
|
||
Just submitted a PR to fix the Postgres error. Once we merge that, I'll re-PR the main pings patch.
Assignee | ||
Comment 15•8 years ago
|
||
There was some concern that upping the number of machines would cause the DB to be extremely slow to requests. I have confirmed that the increase in tasks is not overloading the machine, and the DB is still reachable. In addition, we can implement some cloudwatch alerts or monitors to ensure that the db is not overloaded.
Assignee | ||
Comment 16•8 years ago
|
||
Latest run took 6.5 hours on 10 nodes. With the bump in nodes I think we should be good to go, and keep an eye on the job for the first few days (in addition to some monitoring I mentioned in the previous comment.
Comment 17•8 years ago
|
||
Assignee | ||
Updated•7 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 7 years ago
Resolution: --- → FIXED
Comment 18•7 years ago
|
||
So, uh, we forgot that Fennec doesn't send "main" pings: https://mzl.la/2hYI4mi This means that we have no aggregates for Firefox Mobile from after December 14, 2016.
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•