Closed Bug 1286868 Opened 8 years ago Closed 7 years ago

Telemetry aggregator should use main pings instead of saved_session ones.

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: frank)

References

Details

Attachments

(1 file)

      No description provided.
Points: --- → 3
Priority: -- → P2
Assignee: nobody → fbertsch
Priority: P2 → P1
It took more than 2x as long to use main pings. 11.4 hours versus 5.1 hours using saved-session pings. I'm going to run it on double the nodes.
I just ran it on a 12 node cluster, and it took 7.28 hours.
We've talked about using a subset of clients to reduce the aggregation time. At this time, my plan is the following:

1. Configure our current setup to use main pings. This may include database configuration, and still is yet to determine whether it results in the same data (which it should). Unfortunately, this is going to increase costs drastically for aggregation (more than double).

2. Create a bug for using a sample of clients. Our goal is to understand trends, and as such, we don't need all the raw data. However, there will be issues incorporating this data into our current data; not only will counts be off, but raw histograms will be different. We can consider methods for doing this - weight historical data differently, for example.
Which population are you processing? Since saved-session pings were opt-in telemetry only, are you now processing all of release instead of just the opt-in subset?

I think that we need to step back and design aggregation and sampling together.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Which population are you processing? Since saved-session pings were opt-in
> telemetry only, are you now processing all of release instead of just the
> opt-in subset?
> 
> I think that we need to step back and design aggregation and sampling
> together.

I will be processing just the opt-in data. The goal is first to reproduce what we have with main pings. Next steps can be designing sampling or opt-out data.
Just ran a comparison of saved-session versus main ping. I compared a single days (2016-11-14) build-id aggregates results.

Main Ping: 18548 Distinct Sets of Dimensional Values (e.g. nightly, US, Windows, etc.)
Saved-Session: 17602 Distinct Sets of Dimensional Values
15305 of those sets of dimensions were shared between Main and Saved-Session; note that aggregates matched exactly in 22 these cases.

Total counts of histograms was roughly the same per-ping.

I think these results are good enough to continue with switching over. On a side note, when filtering to just opt-in histograms, processing took ~4.5 hours on a 6 node cluster, which is no increase over the time with saved-session pings.

Next steps are run on the test-database that I have setup, and if that is satisfactory as well (performance-wise) we can deploy the changes.
Just ran an end-to-end with the database. It took 420 minutes in total, 20% of which was time spent with the database. This was on a 6-node cluster, so while it will take a bit longer, I think we should be good to go with deploying this on the same size.

for 20161112, we had: 16716 build-id aggregates, and 2861 submission-date aggregates
This change has been deployed, and the telemetry-aggregates db snapshot from 2016-11-23 is being preserved, in case it causes any mishaps. For now, we can mark this as complete.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
I am reopening this Bug as the change caused several failures: at first it was crashing when parsing some invalid JSON and once I fixed that it started timing out. I am currently back-filling data.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
There was a theory that it ran more quickly because it 20161112 was a Saturday. I'm running for 20161113 right now on  a 7-node cluster.
It finished in <7 hours (the cluster was actually 9 nodes), but with an error I haven't seen before:

DataError: unsupported Unicode escape sequence
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":...
COPY staging_build_id_beta_50_20161104, line 84332, column dimensions: "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":"\u0001\u0000\u0000\u0000\u7000\ub82c\u0..."

This is a postgres error.
(In reply to Frank Bertsch [:frank] from comment #11)
> It finished in <7 hours (the cluster was actually 9 nodes), but with an
> error I haven't seen before:
> 
> DataError: unsupported Unicode escape sequence
> DETAIL:  \u0000 cannot be converted to text.
> CONTEXT:  JSON data, line 1:
> ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":...
> COPY staging_build_id_beta_50_20161104, line 84332, column dimensions:
> "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":
> "\u0001\u0000\u0000\u0000\u7000\ub82c\u0..."
> 
> This is a postgres error.

Note that this killed the job, so the time is unreliable (it might not have processed everything).
Finished running 11/14. No errors, it took 7.2 hours on a 9-node cluster. Next steps are as follows:

1. Backtrack and figure out what caused the error above
2. Extend the airflow job to include a longer timeout in addition to more machines
Just submitted a PR to fix the Postgres error. Once we merge that, I'll re-PR the main pings patch.
There was some concern that upping the number of machines would cause the DB to be extremely slow to requests. I have confirmed that the increase in tasks is not overloading the machine, and the DB is still reachable. In addition, we can implement some cloudwatch alerts or monitors to ensure that the db is not overloaded.
Latest run took 6.5 hours on 10 nodes. With the bump in nodes I think we should be good to go, and keep an eye on the job for the first few days (in addition to some monitoring I mentioned in the previous comment.
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
So, uh, we forgot that Fennec doesn't send "main" pings: https://mzl.la/2hYI4mi

This means that we have no aggregates for Firefox Mobile from after December 14, 2016.
Depends on: 1329228
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: