Closed
Bug 1287585
Opened 8 years ago
Closed 8 years ago
Backfill: Some data for 20160705 appears to be corrupted
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mreid, Assigned: whd)
References
Details
(Whiteboard: [SvcOps])
I'm seeing some invalid JSON data in the 'payload.histograms' and 'payload.keyedHistograms' fields in data for the above day. None of the other backfilled days (20160704 through 20160709) appear to be affected.
Reporter | ||
Updated•8 years ago
|
Comment 1•8 years ago
|
||
My fennec-dashboard backfill was failing for possibly similar errors. json.loads() in heka_message_parser.py was throwing "ValueError: Unexpected character in found when decoding object value" However, this is using "core", not "main" pings.
Assignee | ||
Comment 2•8 years ago
|
||
Looking through all the core pings for 20160705 I was able to find at least one file that had a corrupted snappy/protobuf stream and records where the Payload field was not valid JSON (despite correct e.g. clientId field which was derived from parsed data). This suggests some kind of corruption going on, but I don't know where. In the mean time, I've started to re-run the backfill using a single instance (previously, this was parallelized by hour) but as I didn't change anything else the corruption will probably still be there. The new data will be at s3://net-mozaws-prod-us-west-2-pipeline-analysis/backfill_bug1285621_5/, and the code being used to backfill is at telemetry-backfill-bug1285621-0.4.tar.gz s3://telemetry-analysis-code-2/jobs/telemetry-backfill-bug1285621/telemetry-backfill-bug1285621-0.4.tar.gz
Updated•8 years ago
|
Assignee: nobody → whd
Points: --- → 2
Priority: -- → P1
Whiteboard: [SvcOps]
Assignee | ||
Comment 3•8 years ago
|
||
The new backfill appears to look good, so we've kicked off the delete-replace cycle for that day which should finish in a couple of hours.
Assignee | ||
Comment 4•8 years ago
|
||
delete-replace cycle is complete. It looks like there were fewer lambda indexing errors (due to only backfilling one day) so I don't think we need to do the simpledb reindex.
Assignee | ||
Comment 5•8 years ago
|
||
Per :gfritzsche backfilling core ping data it looks like we actually need to redo the simpledb index again... NI'ing :rvitillo since I still don't know how to do this.
Flags: needinfo?(rvitillo)
Comment 6•8 years ago
|
||
[hadoop@ip-172-31-13-141 moztelemetry]$ python filter_service.py -f 20160705 -t 20160705 Bucket: net-mozaws-prod-us-west-2-pipeline-data - Prefix: telemetry-2 - Date: 20160705 Looked at 0 total records in 0.784965 seconds, added 0 Looked at 100000 total records in 113.501136 seconds, added 100000 Overall, added 185856 of 185856 in 208.982343 seconds Filter service stats: Note that the following numbers are correct only if there isn't another entity concurrently pushing new submissions: Day 20160705 - previous: 194709, current: 380564, added: 185855 (48.8367265427% missing) AWS lambda stats: Day 20160705 - lambda 194708, total 380564, (48.8369893106% missing)
Flags: needinfo?(rvitillo)
Comment 7•8 years ago
|
||
I still see heavy drops in the 20160701 - 20160707 timeframe, for both "core" ping & client counts. I mailed a gist off-bug due to DAU numbers. Could the corruption affect more than just 20160705?
Comment 8•8 years ago
|
||
After more simpledb index refreshing, missing data in the 20160701 - 20160708 time frame is now fixed. Quoting from mail: Day 20160701 - previous: 333230, current: 545730, added: 212500 (38.9386693053% missing) Day 20160702 - previous: 219267, current: 428817, added: 209550 (48.8669992095% missing) Day 20160703 - previous: 213955, current: 417777, added: 203822 (48.7872716784% missing) Day 20160704 - previous: 86640, current: 183390, added: 96750 (52.7564207427% missing) Day 20160705 - previous: 380564, current: 380564, added: 0 (0.0% missing) Day 20160706 - previous: 82517, current: 177966, added: 95449 (53.6332782666% missing) Day 20160707 - previous: 82166, current: 178316, added: 96150 (53.9211287826% missing) Day 20160708 - previous: 142590, current: 173790, added: 31200 (17.9527015363% missing) Day 20160709 - previous: 168564, current: 168564, added: 0 (0.0% missing)
Reporter | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•