Some crash_summary_* and main_summary_* tables are missing

RESOLVED FIXED

Status

Cloud Services
Metrics: Pipeline
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Benjamin Smedberg, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

2 years ago
Missing:
crash_summary_20151214
crash_summary_20151217

The matching main_summary_* tables are also missing.

Comment 1

2 years ago
We ran into a "SEARCH_COUNTS" histogram that was malformed, it blew up the derived streams.

PR to fix it: https://github.com/mozilla-services/data-pipeline/pull/176

Comment 2

2 years ago
Backfill jobs for those two days are running now. I will update / close this bug when they complete.

Comment 3

2 years ago
Missing tables have been populated.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED

Comment 4

2 years ago
For future reference, steps for backfilling:
- Fix the problem in the job code
- Update the job code using the relevant 'package.sh' script[1]
- Remove any existing data for the backfill target days from redshift and S3 (drop tables, delete data files)
- Run a scheduled job with a "run command" of "./run.sh 20151214" (one or each backfill date)
- Once the job has launched, delete the scheduled job. This can safely be done before the job finishes.
(Reporter)

Comment 5

2 years ago
20160102 has also failed
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Updated

2 years ago
Flags: needinfo?(mreid)

Comment 6

2 years ago
I kicked off the backfill job a bit over an hour ago. I will update when it completes. Meanwhile, I'll add some monitoring / alerting to the job.
Flags: needinfo?(mreid)

Comment 7

2 years ago
The missing tables are now available.
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → FIXED
(Reporter)

Comment 8

2 years ago
20160106 is broken now as well.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Updated

2 years ago
Flags: needinfo?(mreid)

Comment 9

2 years ago
Backfill job kicked off. Trink deployed the updated code to fix the underlying problem too, so this particular problem shouldn't come up again.

I filed Bug 1238676 to add monitoring so we can deal with this in a more timely way going forward.

Again, I'll update the bug when the job completes, which I estimate to be around Jan 12 @ 09:00 UTC
Flags: needinfo?(mreid)

Comment 10

2 years ago
The tables are now available.
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → FIXED
(Reporter)

Comment 11

2 years ago
20160115 is also broken
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Updated

2 years ago
Flags: needinfo?(mreid)

Comment 12

2 years ago
This is a new issue. The job for that day appears to not have run.

I'm running it now, should be available in the morning.

I'll clear the needinfo when the missing tables are actually available.

Thanks for your patience while we get monitoring up and running in bug 1238676.

Comment 13

2 years ago
The tables from 20160115 are available.
Flags: needinfo?(mreid)

Updated

2 years ago
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → FIXED

Comment 14

2 years ago
20160122 is missing
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 15

2 years ago
Update on Comment 12 - looks like the analysis.t.m.o service encountered an OOM error, we tracked down the cause to running a whole bunch of backfill jobs at the same time. We're planning to boost the ec2 instance type to give a bit more breathing room for that case.

The job from 20160122 is missing due to a job timeout after 1400 minutes (!!). I'm running the backfill now. It shouldn't take that long to process a day, so we will need to investigate further if the job times out again.

Comment 16

2 years ago
The tables for 20160122 are now in place. 

The backfill job ran in about the usual amount of time, around 14 hours.

Updated

2 years ago
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.