1287585 - Backfill: Some data for 20160705 appears to be corrupted

Reporter

Description

•

9 years ago

I'm seeing some invalid JSON data in the 'payload.histograms' and 'payload.keyedHistograms' fields in data for the above day. None of the other backfilled days (20160704 through 20160709) appear to be affected.

Mark Reid [:mreid]

Reporter

Updated

•

9 years ago

Blocks: 1285621, 1275889

Depends on: 1286220

Georg Fritzsche [:gfritzsche]

Comment 1

•

9 years ago

My fennec-dashboard backfill was failing for possibly similar errors. json.loads() in heka_message_parser.py was throwing "ValueError: Unexpected character in found when decoding object value" However, this is using "core", not "main" pings.

Wesley Dawson [:whd]

Assignee

Comment 2

•

9 years ago

Looking through all the core pings for 20160705 I was able to find at least one file that had a corrupted snappy/protobuf stream and records where the Payload field was not valid JSON (despite correct e.g. clientId field which was derived from parsed data). This suggests some kind of corruption going on, but I don't know where. In the mean time, I've started to re-run the backfill using a single instance (previously, this was parallelized by hour) but as I didn't change anything else the corruption will probably still be there. The new data will be at s3://net-mozaws-prod-us-west-2-pipeline-analysis/backfill_bug1285621_5/, and the code being used to backfill is at telemetry-backfill-bug1285621-0.4.tar.gz s3://telemetry-analysis-code-2/jobs/telemetry-backfill-bug1285621/telemetry-backfill-bug1285621-0.4.tar.gz

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Blocks: 1286227

Mark Reid [:mreid]

Reporter

Updated

•

9 years ago

Blocks: 1286226

Thomas Huelbert

Updated

•

9 years ago

Assignee: nobody → whd

Points: --- → 2

Priority: -- → P1

Whiteboard: [SvcOps]

Wesley Dawson [:whd]

Assignee

Comment 3

•

9 years ago

The new backfill appears to look good, so we've kicked off the delete-replace cycle for that day which should finish in a couple of hours.

Wesley Dawson [:whd]

Assignee

Comment 4

•

9 years ago

delete-replace cycle is complete. It looks like there were fewer lambda indexing errors (due to only backfilling one day) so I don't think we need to do the simpledb reindex.

Wesley Dawson [:whd]

Assignee

Comment 5

•

9 years ago

Per :gfritzsche backfilling core ping data it looks like we actually need to redo the simpledb index again... NI'ing :rvitillo since I still don't know how to do this.

Flags: needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Comment 6

•

9 years ago

[hadoop@ip-172-31-13-141 moztelemetry]$ python filter_service.py -f 20160705 -t 20160705 Bucket: net-mozaws-prod-us-west-2-pipeline-data - Prefix: telemetry-2 - Date: 20160705 Looked at 0 total records in 0.784965 seconds, added 0 Looked at 100000 total records in 113.501136 seconds, added 100000 Overall, added 185856 of 185856 in 208.982343 seconds Filter service stats: Note that the following numbers are correct only if there isn't another entity concurrently pushing new submissions: Day 20160705 - previous: 194709, current: 380564, added: 185855 (48.8367265427% missing) AWS lambda stats: Day 20160705 - lambda 194708, total 380564, (48.8369893106% missing)

Flags: needinfo?(rvitillo)

Georg Fritzsche [:gfritzsche]

Comment 7

•

9 years ago

I still see heavy drops in the 20160701 - 20160707 timeframe, for both "core" ping & client counts. I mailed a gist off-bug due to DAU numbers. Could the corruption affect more than just 20160705?

Georg Fritzsche [:gfritzsche]

Comment 8

•

9 years ago

After more simpledb index refreshing, missing data in the 20160701 - 20160708 time frame is now fixed. Quoting from mail: Day 20160701 - previous: 333230, current: 545730, added: 212500 (38.9386693053% missing) Day 20160702 - previous: 219267, current: 428817, added: 209550 (48.8669992095% missing) Day 20160703 - previous: 213955, current: 417777, added: 203822 (48.7872716784% missing) Day 20160704 - previous: 86640, current: 183390, added: 96750 (52.7564207427% missing) Day 20160705 - previous: 380564, current: 380564, added: 0 (0.0% missing) Day 20160706 - previous: 82517, current: 177966, added: 95449 (53.6332782666% missing) Day 20160707 - previous: 82166, current: 178316, added: 96150 (53.9211287826% missing) Day 20160708 - previous: 142590, current: 173790, added: 31200 (17.9527015363% missing) Day 20160709 - previous: 168564, current: 168564, added: 0 (0.0% missing)

Mark Reid [:mreid]

Reporter

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Backfill: Some data for 20160705 appears to be corrupted

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: mreid, Assigned: whd)

References

Details

(Whiteboard: [SvcOps])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Updated

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Updated