Closed Bug 1286868 Opened 8 years ago Closed 7 years ago

Telemetry aggregator should use main pings instead of saved_session ones.

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: frank)

References

Details

Attachments

(1 file)

[telemetry-airflow] fbertsch:update_aggregates_job > mozilla:master 8 years ago GitHub Autolander Bot 52 bytes, text/x-github-pull-request		Details \| Review

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

8 years ago

      No description provided.

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Blocks: 1287481

Thomas Huelbert

Updated

•

8 years ago

Points: --- → 3

Priority: -- → P2

Georg Fritzsche [:gfritzsche]

Updated

•

8 years ago

Blocks: 1281208

Frank Bertsch [:frank]

Assignee

Updated

•

8 years ago

Assignee: nobody → fbertsch

Priority: P2 → P1

Frank Bertsch [:frank]

Assignee

Comment 1

•

8 years ago

It took more than 2x as long to use main pings. 11.4 hours versus 5.1 hours using saved-session pings. I'm going to run it on double the nodes.

Frank Bertsch [:frank]

Assignee

Comment 2

•

8 years ago

I just ran it on a 12 node cluster, and it took 7.28 hours.

Frank Bertsch [:frank]

Assignee

Comment 3

•

8 years ago

We've talked about using a subset of clients to reduce the aggregation time. At this time, my plan is the following:

1. Configure our current setup to use main pings. This may include database configuration, and still is yet to determine whether it results in the same data (which it should). Unfortunately, this is going to increase costs drastically for aggregation (more than double).

2. Create a bug for using a sample of clients. Our goal is to understand trends, and as such, we don't need all the raw data. However, there will be issues incorporating this data into our current data; not only will counts be off, but raw histograms will be different. We can consider methods for doing this - weight historical data differently, for example.

Benjamin Smedberg

Comment 4

•

8 years ago

Which population are you processing? Since saved-session pings were opt-in telemetry only, are you now processing all of release instead of just the opt-in subset?

I think that we need to step back and design aggregation and sampling together.

Frank Bertsch [:frank]

Assignee

Comment 5

•

8 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #4)
> Which population are you processing? Since saved-session pings were opt-in
> telemetry only, are you now processing all of release instead of just the
> opt-in subset?
> 
> I think that we need to step back and design aggregation and sampling
> together.

I will be processing just the opt-in data. The goal is first to reproduce what we have with main pings. Next steps can be designing sampling or opt-out data.

Frank Bertsch [:frank]

Assignee

Comment 6

•

8 years ago

Just ran a comparison of saved-session versus main ping. I compared a single days (2016-11-14) build-id aggregates results.

Main Ping: 18548 Distinct Sets of Dimensional Values (e.g. nightly, US, Windows, etc.)
Saved-Session: 17602 Distinct Sets of Dimensional Values
15305 of those sets of dimensions were shared between Main and Saved-Session; note that aggregates matched exactly in 22 these cases.

Total counts of histograms was roughly the same per-ping.

I think these results are good enough to continue with switching over. On a side note, when filtering to just opt-in histograms, processing took ~4.5 hours on a 6 node cluster, which is no increase over the time with saved-session pings.

Next steps are run on the test-database that I have setup, and if that is satisfactory as well (performance-wise) we can deploy the changes.

Frank Bertsch [:frank]

Assignee

Comment 7

•

8 years ago

Just ran an end-to-end with the database. It took 420 minutes in total, 20% of which was time spent with the database. This was on a 6-node cluster, so while it will take a bit longer, I think we should be good to go with deploying this on the same size.

for 20161112, we had: 16716 build-id aggregates, and 2861 submission-date aggregates

Frank Bertsch [:frank]

Assignee

Comment 8

•

8 years ago

This change has been deployed, and the telemetry-aggregates db snapshot from 2016-11-23 is being preserved, in case it causes any mishaps. For now, we can mark this as complete.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 9

•

8 years ago

I am reopening this Bug as the change caused several failures: at first it was crashing when parsing some invalid JSON and once I fixed that it started timing out. I am currently back-filling data.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Frank Bertsch [:frank]

Assignee

Comment 10

•

8 years ago

There was a theory that it ran more quickly because it 20161112 was a Saturday. I'm running for 20161113 right now on  a 7-node cluster.

Frank Bertsch [:frank]

Assignee

Comment 11

•

8 years ago

It finished in <7 hours (the cluster was actually 9 nodes), but with an error I haven't seen before:

DataError: unsupported Unicode escape sequence
DETAIL:  \u0000 cannot be converted to text.
CONTEXT:  JSON data, line 1: ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":...
COPY staging_build_id_beta_50_20161104, line 84332, column dimensions: "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":"\u0001\u0000\u0000\u0000\u7000\ub82c\u0..."

This is a postgres error.

Frank Bertsch [:frank]

Assignee

Comment 12

•

8 years ago

(In reply to Frank Bertsch [:frank] from comment #11)
> It finished in <7 hours (the cluster was actually 9 nodes), but with an
> error I haven't seen before:
> 
> DataError: unsupported Unicode escape sequence
> DETAIL:  \u0000 cannot be converted to text.
> CONTEXT:  JSON data, line 1:
> ...[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":...
> COPY staging_build_id_beta_50_20161104, line 84332, column dimensions:
> "{"metric":"[[COUNT]]_JS_TELEMETRY_ADDON_EXCEPTIONS","label":
> "\u0001\u0000\u0000\u0000\u7000\ub82c\u0..."
> 
> This is a postgres error.

Note that this killed the job, so the time is unreliable (it might not have processed everything).

Frank Bertsch [:frank]

Assignee

Comment 13

•

8 years ago

Finished running 11/14. No errors, it took 7.2 hours on a 9-node cluster. Next steps are as follows:

1. Backtrack and figure out what caused the error above
2. Extend the airflow job to include a longer timeout in addition to more machines

Frank Bertsch [:frank]

Assignee

Comment 14

•

8 years ago

Just submitted a PR to fix the Postgres error. Once we merge that, I'll re-PR the main pings patch.

Frank Bertsch [:frank]

Assignee

Comment 15

•

8 years ago

There was some concern that upping the number of machines would cause the DB to be extremely slow to requests. I have confirmed that the increase in tasks is not overloading the machine, and the DB is still reachable. In addition, we can implement some cloudwatch alerts or monitors to ensure that the db is not overloaded.

Frank Bertsch [:frank]

Assignee

Comment 16

•

8 years ago

Latest run took 6.5 hours on 10 nodes. With the bump in nodes I think we should be good to go, and keep an eye on the job for the first few days (in addition to some monitoring I mentioned in the previous comment.

GitHub Autolander Bot

Comment 17

•

8 years ago

Attached file [telemetry-airflow] fbertsch:update_aggregates_job > mozilla:master — Details

Frank Bertsch [:frank]

Assignee

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 7 years ago

Resolution: --- → FIXED

Chris H-C :chutten

Comment 18

•

7 years ago

So, uh, we forgot that Fennec doesn't send "main" pings: https://mzl.la/2hYI4mi

This means that we have no aggregates for Firefox Mobile from after December 14, 2016.

Georg Fritzsche [:gfritzsche]

Updated

•

7 years ago

Depends on: 1329228

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.