Closed Bug 1472621 Opened 6 years ago Closed 6 years ago

TMO aggregator has been failing since 30th of June 2018

Categories

(Data Platform and Tools :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: klukas)

Details

Looks like the data on TMO is stuck at last Friday. Airflow seems to confirm that the job has been failing since then.
Hey, :klukas, any ideas?
Flags: needinfo?(jklukas)
Note: I wonder if this is related to bug 1472627
If bug 1472627 was because of the Events.yaml problem, then I don't think it's likely. The aggregator doesn't care about non-histogram/non-scalar/non-simpleMeasurements.
Are there any failure logs anywhere to look at?
Note that the Airflow job is marked as "depends on history" so the one failure on 6/30 has prevented subsequent runs from being scheduled. I'm taking a look at logs to see if I can understand why 6/30 failed.
Assignee: nobody → jklukas
Flags: needinfo?(jklukas)
Priority: -- → P1
Found a log that looks to show the failure [0]: DataError: value "10141356896210918027" is out of range for type bigint So this may be due to particulars of the data rather than any code regression. I'll try to reproduce on the DB directly. [0] https://telemetry-airflow.s3-us-west-2.amazonaws.com/logs/frank%40mozilla.com/Telemetry Aggregate View/j-25D1GA2RQ3WW5/node/i-0162dc5ee3444fde9/applications/spark/spark.log.gz
I don't particularly know what to expect for values in these histograms, but it does appear that build_id_beta_62_20180625 contains at least one particularly large value: > select max(x) from (select dimensions, unnest(histogram) as x from build_id_beta_62_20180625) as s; max --------------------- 7594948312643864240 which is about 80% the size of the max allowable bigint in postgres [1], so the problem looks to be that the aggregation job is trying to add another large value to this already massive integer. This may need to wait until :frank is back to help wade through the logic and understand if something is out of whack causing these huge values to appear. How urgent is this problem? Is it feasible to wait until early next week? [1] https://www.postgresql.org/docs/10/static/datatype-numeric.html
Priority: P1 → P2
Here's the specific metric containing the huge value: select * from build_id_beta_62_20180625 where histogram @> ARRAY[7594948312643864240]; -[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------- dimensions | {"os": "Windows_NT", "child": "true", "label": "dl", "metric": "CONTENT_SMALL_PAINT_PHASE_WEIGHT", "osVersion": "6.1", "application": "Firefox", "architecture": "x86"} histogram | {4228796587,281492766,1605874755,5794721423,12484094600,12637154836,9152820930,3940763966,3918114875,3833815464,7594948312643864240,3713482}
It's a keyed metric, so I expected it to have a "#keyname" suffix or something. "paint phases in content" sounds like something that might legitimately have a large value in it. Can we perform a truncating addition to keep it within representable values? I'm not a fan of how it'll unevenly truncate, but I don't see many other choices.
Truncating addition may be the only reasonable option. We'll get frank's take and attack this early next week.
Flags: needinfo?(fbertsch)
Priority: P2 → P1
Discussed with Frank this morning and we settled on truncating, but logging what the value would have been so we have some way of knowing the magnitude of truncation. Posted a PR: https://github.com/mozilla/python_mozaggregator/pull/99
Flags: needinfo?(fbertsch)
PR is merged and I've cleared Airflow state for the failed job. It's running again and we should know if ~5 hours whether it completes successfully. Assuming it does, the subsequent days' jobs will start running as well and we should be caught up to present by EOD tomorrow.
When this is resolved we'll need a PR to telemetry-dashboard to comment out the error notice on the Measurement Dashboard.
We've worked down 4 days so far, taking about 5 hours per job. 20180704 is currently processing, so we likely won't be completely up to date until Friday. Makes sense to keep the error notice until then.
Airflow is all green for this job again! We can remove the error notice.
Posted a PR to remove the error notice: https://github.com/mozilla/telemetry-dashboard/pull/568 Can :chutten or other review, and we can close this out?
PR 568 merged & deployed!
Thanks Jeff and Jan-Erik!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Telemetry Aggregation Service → General
You need to log in before you can comment on or make changes to this bug.