I'm seeing some invalid JSON data in the 'payload.histograms' and 'payload.keyedHistograms' fields in data for the above day. None of the other backfilled days (20160704 through 20160709) appear to be affected.
My fennec-dashboard backfill was failing for possibly similar errors. json.loads() in heka_message_parser.py was throwing "ValueError: Unexpected character in found when decoding object value" However, this is using "core", not "main" pings.
Looking through all the core pings for 20160705 I was able to find at least one file that had a corrupted snappy/protobuf stream and records where the Payload field was not valid JSON (despite correct e.g. clientId field which was derived from parsed data). This suggests some kind of corruption going on, but I don't know where. In the mean time, I've started to re-run the backfill using a single instance (previously, this was parallelized by hour) but as I didn't change anything else the corruption will probably still be there. The new data will be at s3://net-mozaws-prod-us-west-2-pipeline-analysis/backfill_bug1285621_5/, and the code being used to backfill is at telemetry-backfill-bug1285621-0.4.tar.gz s3://telemetry-analysis-code-2/jobs/telemetry-backfill-bug1285621/telemetry-backfill-bug1285621-0.4.tar.gz
Assignee: nobody → whd
Points: --- → 2
Priority: -- → P1
The new backfill appears to look good, so we've kicked off the delete-replace cycle for that day which should finish in a couple of hours.
delete-replace cycle is complete. It looks like there were fewer lambda indexing errors (due to only backfilling one day) so I don't think we need to do the simpledb reindex.
Per :gfritzsche backfilling core ping data it looks like we actually need to redo the simpledb index again... NI'ing :rvitillo since I still don't know how to do this.
[hadoop@ip-172-31-13-141 moztelemetry]$ python filter_service.py -f 20160705 -t 20160705 Bucket: net-mozaws-prod-us-west-2-pipeline-data - Prefix: telemetry-2 - Date: 20160705 Looked at 0 total records in 0.784965 seconds, added 0 Looked at 100000 total records in 113.501136 seconds, added 100000 Overall, added 185856 of 185856 in 208.982343 seconds Filter service stats: Note that the following numbers are correct only if there isn't another entity concurrently pushing new submissions: Day 20160705 - previous: 194709, current: 380564, added: 185855 (48.8367265427% missing) AWS lambda stats: Day 20160705 - lambda 194708, total 380564, (48.8369893106% missing)
I still see heavy drops in the 20160701 - 20160707 timeframe, for both "core" ping & client counts. I mailed a gist off-bug due to DAU numbers. Could the corruption affect more than just 20160705?
After more simpledb index refreshing, missing data in the 20160701 - 20160708 time frame is now fixed. Quoting from mail: Day 20160701 - previous: 333230, current: 545730, added: 212500 (38.9386693053% missing) Day 20160702 - previous: 219267, current: 428817, added: 209550 (48.8669992095% missing) Day 20160703 - previous: 213955, current: 417777, added: 203822 (48.7872716784% missing) Day 20160704 - previous: 86640, current: 183390, added: 96750 (52.7564207427% missing) Day 20160705 - previous: 380564, current: 380564, added: 0 (0.0% missing) Day 20160706 - previous: 82517, current: 177966, added: 95449 (53.6332782666% missing) Day 20160707 - previous: 82166, current: 178316, added: 96150 (53.9211287826% missing) Day 20160708 - previous: 142590, current: 173790, added: 31200 (17.9527015363% missing) Day 20160709 - previous: 168564, current: 168564, added: 0 (0.0% missing)
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.