Closed
Bug 1472621
Opened 6 years ago
Closed 6 years ago
TMO aggregator has been failing since 30th of June 2018
Categories
(Data Platform and Tools :: General, defect, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Dexter, Assigned: klukas)
Details
Looks like the data on TMO is stuck at last Friday. Airflow seems to confirm that the job has been failing since then.
Reporter | ||
Comment 2•6 years ago
|
||
Note: I wonder if this is related to bug 1472627
Comment 3•6 years ago
|
||
If bug 1472627 was because of the Events.yaml problem, then I don't think it's likely. The aggregator doesn't care about non-histogram/non-scalar/non-simpleMeasurements.
Comment 4•6 years ago
|
||
Are there any failure logs anywhere to look at?
Assignee | ||
Comment 5•6 years ago
|
||
Note that the Airflow job is marked as "depends on history" so the one failure on 6/30 has prevented subsequent runs from being scheduled. I'm taking a look at logs to see if I can understand why 6/30 failed.
Assignee: nobody → jklukas
Flags: needinfo?(jklukas)
Priority: -- → P1
Assignee | ||
Comment 6•6 years ago
|
||
Found a log that looks to show the failure [0]:
DataError: value "10141356896210918027" is out of range for type bigint
So this may be due to particulars of the data rather than any code regression. I'll try to reproduce on the DB directly.
[0] https://telemetry-airflow.s3-us-west-2.amazonaws.com/logs/frank%40mozilla.com/Telemetry Aggregate View/j-25D1GA2RQ3WW5/node/i-0162dc5ee3444fde9/applications/spark/spark.log.gz
Assignee | ||
Comment 7•6 years ago
|
||
I don't particularly know what to expect for values in these histograms, but it does appear that build_id_beta_62_20180625 contains at least one particularly large value:
> select max(x) from (select dimensions, unnest(histogram) as x from build_id_beta_62_20180625) as s;
max
---------------------
7594948312643864240
which is about 80% the size of the max allowable bigint in postgres [1], so the problem looks to be that the aggregation job is trying to add another large value to this already massive integer.
This may need to wait until :frank is back to help wade through the logic and understand if something is out of whack causing these huge values to appear. How urgent is this problem? Is it feasible to wait until early next week?
[1] https://www.postgresql.org/docs/10/static/datatype-numeric.html
Assignee | ||
Updated•6 years ago
|
Priority: P1 → P2
Assignee | ||
Comment 8•6 years ago
|
||
Here's the specific metric containing the huge value:
select * from build_id_beta_62_20180625 where histogram @> ARRAY[7594948312643864240];
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
dimensions | {"os": "Windows_NT", "child": "true", "label": "dl", "metric": "CONTENT_SMALL_PAINT_PHASE_WEIGHT", "osVersion": "6.1", "application": "Firefox", "architecture": "x86"}
histogram | {4228796587,281492766,1605874755,5794721423,12484094600,12637154836,9152820930,3940763966,3918114875,3833815464,7594948312643864240,3713482}
Comment 9•6 years ago
|
||
It's a keyed metric, so I expected it to have a "#keyname" suffix or something.
"paint phases in content" sounds like something that might legitimately have a large value in it. Can we perform a truncating addition to keep it within representable values? I'm not a fan of how it'll unevenly truncate, but I don't see many other choices.
Assignee | ||
Comment 10•6 years ago
|
||
Truncating addition may be the only reasonable option. We'll get frank's take and attack this early next week.
Flags: needinfo?(fbertsch)
Assignee | ||
Updated•6 years ago
|
Priority: P2 → P1
Assignee | ||
Comment 11•6 years ago
|
||
Discussed with Frank this morning and we settled on truncating, but logging what the value would have been so we have some way of knowing the magnitude of truncation.
Posted a PR: https://github.com/mozilla/python_mozaggregator/pull/99
Flags: needinfo?(fbertsch)
Assignee | ||
Comment 12•6 years ago
|
||
PR is merged and I've cleared Airflow state for the failed job. It's running again and we should know if ~5 hours whether it completes successfully. Assuming it does, the subsequent days' jobs will start running as well and we should be caught up to present by EOD tomorrow.
Comment 13•6 years ago
|
||
When this is resolved we'll need a PR to telemetry-dashboard to comment out the error notice on the Measurement Dashboard.
Assignee | ||
Comment 14•6 years ago
|
||
We've worked down 4 days so far, taking about 5 hours per job. 20180704 is currently processing, so we likely won't be completely up to date until Friday. Makes sense to keep the error notice until then.
Assignee | ||
Comment 15•6 years ago
|
||
Airflow is all green for this job again! We can remove the error notice.
Assignee | ||
Comment 16•6 years ago
|
||
Posted a PR to remove the error notice: https://github.com/mozilla/telemetry-dashboard/pull/568
Can :chutten or other review, and we can close this out?
PR 568 merged & deployed!
Comment 18•6 years ago
|
||
Thanks Jeff and Jan-Erik!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Component: Telemetry Aggregation Service → General
You need to log in
before you can comment on or make changes to this bug.
Description
•