1234286 - Some crash_summary_* and main_summary_* tables are missing

Reporter

Description

•

9 years ago

Missing:
crash_summary_20151214
crash_summary_20151217

The matching main_summary_* tables are also missing.

Mark Reid [:mreid]

Comment 1

•

9 years ago

We ran into a "SEARCH_COUNTS" histogram that was malformed, it blew up the derived streams.

PR to fix it: https://github.com/mozilla-services/data-pipeline/pull/176

Mark Reid [:mreid]

Comment 2

•

8 years ago

Backfill jobs for those two days are running now. I will update / close this bug when they complete.

Mark Reid [:mreid]

Comment 3

•

8 years ago

Missing tables have been populated.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Mark Reid [:mreid]

Comment 4

•

8 years ago

For future reference, steps for backfilling:
- Fix the problem in the job code
- Update the job code using the relevant 'package.sh' script[1]
- Remove any existing data for the backfill target days from redshift and S3 (drop tables, delete data files)
- Run a scheduled job with a "run command" of "./run.sh 20151214" (one or each backfill date)
- Once the job has launched, delete the scheduled job. This can safely be done before the job finishes.

Benjamin Smedberg

Reporter

Comment 5

•

8 years ago

20160102 has also failed

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 6

•

8 years ago

I kicked off the backfill job a bit over an hour ago. I will update when it completes. Meanwhile, I'll add some monitoring / alerting to the job.

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 7

•

8 years ago

The missing tables are now available.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Benjamin Smedberg

Reporter

Comment 8

•

8 years ago

20160106 is broken now as well.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 9

•

8 years ago

Backfill job kicked off. Trink deployed the updated code to fix the underlying problem too, so this particular problem shouldn't come up again.

I filed Bug 1238676 to add monitoring so we can deal with this in a more timely way going forward.

Again, I'll update the bug when the job completes, which I estimate to be around Jan 12 @ 09:00 UTC

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 10

•

8 years ago

The tables are now available.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Benjamin Smedberg

Reporter

Comment 11

•

8 years ago

20160115 is also broken

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Benjamin Smedberg

Reporter

Updated

•

8 years ago

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Comment 12

•

8 years ago

This is a new issue. The job for that day appears to not have run.

I'm running it now, should be available in the morning.

I'll clear the needinfo when the missing tables are actually available.

Thanks for your patience while we get monitoring up and running in bug 1238676.

Mark Reid [:mreid]

Comment 13

•

8 years ago

The tables from 20160115 are available.

Flags: needinfo?(mreid)

Mark Reid [:mreid]

Updated

•

8 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Mark Reid [:mreid]

Comment 14

•

8 years ago

20160122 is missing

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Mark Reid [:mreid]

Comment 15

•

8 years ago

Update on Comment 12 - looks like the analysis.t.m.o service encountered an OOM error, we tracked down the cause to running a whole bunch of backfill jobs at the same time. We're planning to boost the ec2 instance type to give a bit more breathing room for that case.

The job from 20160122 is missing due to a job timeout after 1400 minutes (!!). I'm running the backfill now. It shouldn't take that long to process a day, so we will need to investigate further if the job times out again.

Mark Reid [:mreid]

Comment 16

•

8 years ago

The tables for 20160122 are now in place. 

The backfill job ran in about the usual amount of time, around 14 hours.

Mark Reid [:mreid]

Updated

•

8 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Quick Search

Some crash_summary_* and main_summary_* tables are missing

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

Tracking

(Not tracked)

People

(Reporter: benjamin, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Updated

Comment 14

Comment 15

Comment 16

Updated

Updated