1472621 - TMO aggregator has been failing since 30th of June 2018

If bug 1472627 was because of the Events.yaml problem, then I don't think it's likely. The aggregator doesn't care about non-histogram/non-scalar/non-simpleMeasurements.

Georg Fritzsche [:gfritzsche]

Comment 4

•

6 years ago

Are there any failure logs anywhere to look at?

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 5

•

6 years ago

Note that the Airflow job is marked as "depends on history" so the one failure on 6/30 has prevented subsequent runs from being scheduled. I'm taking a look at logs to see if I can understand why 6/30 failed.

Assignee: nobody → jklukas

Flags: needinfo?(jklukas)

Priority: -- → P1

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 6

•

6 years ago

Found a log that looks to show the failure [0]: DataError: value "10141356896210918027" is out of range for type bigint So this may be due to particulars of the data rather than any code regression. I'll try to reproduce on the DB directly. [0] https://telemetry-airflow.s3-us-west-2.amazonaws.com/logs/frank%40mozilla.com/Telemetry Aggregate View/j-25D1GA2RQ3WW5/node/i-0162dc5ee3444fde9/applications/spark/spark.log.gz

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 7

•

6 years ago

I don't particularly know what to expect for values in these histograms, but it does appear that build_id_beta_62_20180625 contains at least one particularly large value: > select max(x) from (select dimensions, unnest(histogram) as x from build_id_beta_62_20180625) as s; max --------------------- 7594948312643864240 which is about 80% the size of the max allowable bigint in postgres [1], so the problem looks to be that the aggregation job is trying to add another large value to this already massive integer. This may need to wait until :frank is back to help wade through the logic and understand if something is out of whack causing these huge values to appear. How urgent is this problem? Is it feasible to wait until early next week? [1] https://www.postgresql.org/docs/10/static/datatype-numeric.html

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

6 years ago

Priority: P1 → P2

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 8

•

6 years ago

Here's the specific metric containing the huge value: select * from build_id_beta_62_20180625 where histogram @> ARRAY[7594948312643864240]; -[ RECORD 1 ]----------------------------------------------------------------------------------------------------------------------------------------------------------------------- dimensions | {"os": "Windows_NT", "child": "true", "label": "dl", "metric": "CONTENT_SMALL_PAINT_PHASE_WEIGHT", "osVersion": "6.1", "application": "Firefox", "architecture": "x86"} histogram | {4228796587,281492766,1605874755,5794721423,12484094600,12637154836,9152820930,3940763966,3918114875,3833815464,7594948312643864240,3713482}

Chris H-C :chutten

Comment 9

•

6 years ago

It's a keyed metric, so I expected it to have a "#keyname" suffix or something. "paint phases in content" sounds like something that might legitimately have a large value in it. Can we perform a truncating addition to keep it within representable values? I'm not a fan of how it'll unevenly truncate, but I don't see many other choices.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 10

•

6 years ago

Truncating addition may be the only reasonable option. We'll get frank's take and attack this early next week.

Flags: needinfo?(fbertsch)

Jeff Klukas [:klukas] (UTC-4)

Assignee

Updated

•

6 years ago

Priority: P2 → P1

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 11

•

6 years ago

Discussed with Frank this morning and we settled on truncating, but logging what the value would have been so we have some way of knowing the magnitude of truncation. Posted a PR: https://github.com/mozilla/python_mozaggregator/pull/99

Flags: needinfo?(fbertsch)

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 12

•

6 years ago

PR is merged and I've cleared Airflow state for the failed job. It's running again and we should know if ~5 hours whether it completes successfully. Assuming it does, the subsequent days' jobs will start running as well and we should be caught up to present by EOD tomorrow.

Chris H-C :chutten

Comment 13

•

6 years ago

When this is resolved we'll need a PR to telemetry-dashboard to comment out the error notice on the Measurement Dashboard.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 14

•

6 years ago

We've worked down 4 days so far, taking about 5 hours per job. 20180704 is currently processing, so we likely won't be completely up to date until Friday. Makes sense to keep the error notice until then.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 15

•

6 years ago

Airflow is all green for this job again! We can remove the error notice.

Jeff Klukas [:klukas] (UTC-4)

Assignee

Comment 16

•

6 years ago

Posted a PR to remove the error notice: https://github.com/mozilla/telemetry-dashboard/pull/568 Can :chutten or other review, and we can close this out?

Jan-Erik Rediger [:janerik] (Away until 2024-12-23)

Comment 17

•

6 years ago

PR 568 merged & deployed!

Frank Bertsch [:frank]

Comment 18

•

6 years ago

Thanks Jeff and Jan-Erik!

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

2 years ago

Component: Telemetry Aggregation Service → General

Bugzilla

TMO aggregator has been failing since 30th of June 2018

Categories

(Data Platform and Tools :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: Dexter, Assigned: klukas)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated