Closed Bug 1234286 Opened 9 years ago Closed 8 years ago

Some crash_summary_* and main_summary_* tables are missing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benjamin, Unassigned)

Details

Missing:
crash_summary_20151214
crash_summary_20151217

The matching main_summary_* tables are also missing.
We ran into a "SEARCH_COUNTS" histogram that was malformed, it blew up the derived streams.

PR to fix it: https://github.com/mozilla-services/data-pipeline/pull/176
Backfill jobs for those two days are running now. I will update / close this bug when they complete.
Missing tables have been populated.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
For future reference, steps for backfilling:
- Fix the problem in the job code
- Update the job code using the relevant 'package.sh' script[1]
- Remove any existing data for the backfill target days from redshift and S3 (drop tables, delete data files)
- Run a scheduled job with a "run command" of "./run.sh 20151214" (one or each backfill date)
- Once the job has launched, delete the scheduled job. This can safely be done before the job finishes.
20160102 has also failed
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(mreid)
I kicked off the backfill job a bit over an hour ago. I will update when it completes. Meanwhile, I'll add some monitoring / alerting to the job.
Flags: needinfo?(mreid)
The missing tables are now available.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
20160106 is broken now as well.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(mreid)
Backfill job kicked off. Trink deployed the updated code to fix the underlying problem too, so this particular problem shouldn't come up again.

I filed Bug 1238676 to add monitoring so we can deal with this in a more timely way going forward.

Again, I'll update the bug when the job completes, which I estimate to be around Jan 12 @ 09:00 UTC
Flags: needinfo?(mreid)
The tables are now available.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
20160115 is also broken
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(mreid)
This is a new issue. The job for that day appears to not have run.

I'm running it now, should be available in the morning.

I'll clear the needinfo when the missing tables are actually available.

Thanks for your patience while we get monitoring up and running in bug 1238676.
The tables from 20160115 are available.
Flags: needinfo?(mreid)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
20160122 is missing
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Update on Comment 12 - looks like the analysis.t.m.o service encountered an OOM error, we tracked down the cause to running a whole bunch of backfill jobs at the same time. We're planning to boost the ec2 instance type to give a bit more breathing room for that case.

The job from 20160122 is missing due to a job timeout after 1400 minutes (!!). I'm running the backfill now. It shouldn't take that long to process a day, so we will need to investigate further if the job times out again.
The tables for 20160122 are now in place. 

The backfill job ran in about the usual amount of time, around 14 hours.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.