Status

RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: rhelmer, Assigned: rhelmer)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

3 years ago
Socorro is complaining that there is no ADI for 2015-07-09, however the ADI crontabber job in SCL (on socorro-adi1.metrics.scl3) ran ok:

2015-07-10 08:01:58,530 DEBUG - MainThread -  - MainThread - successfully ran <class 'socorro.cron.jobs.fetch_adi_from_hive.FetchADIFromHiveCronApp'> on 2015-07-10 08:00:00+00:00
(Assignee)

Comment 1

3 years ago
I forced this job to run again and it still found no results in Hive.

The Hive query being executed is:
https://github.com/mozilla/socorro/blob/master/socorro/cron/jobs/fetch_adi_from_hive.py#L42-L66
(Assignee)

Updated

3 years ago
Assignee: rhelmer → nobody
(Assignee)

Comment 2

3 years ago
Looks like ADI is there now (I was out on PTO all day, but I saw kairo and sheeri discussing it in #breakpad - thanks!)

The current "fetch ADI from hive" job doesn't detect missing ADI as a failure, so I had to reset crontabber's "last_success" column for this job:

UPDATE crontabber SET last_success = '2015-07-09 08:00:00+00' WHERE app_name = 'fetch-adi-from-hive'

Then force the job to run on the socorro-adi1.metrics node in SCL:

cd /data/socorro
./socorro-virtualenv/bin/python application/socorro/cron/crontabber_app.py --admin.conf=/etc/socorro/crontabber.ini --job='socorro.cron.jobs.fetch_adi_from_hive.FetchADIFromHiveCronApp' --force
Assignee: nobody → rhelmer
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Assignee)

Comment 3

3 years ago
Socorro should self-heal reports for the 9th now that the ADI data is there.

Comment 4

3 years ago
Somehow looks like most jobs just didn't schedule though to produce matviews for the 10th (on https://crash-stats.mozilla.com/crontabber-state/ they say that last run "a day ago" and the next run is scheduled "in a day"). Rob, did your reset cause this?
Status: RESOLVED → REOPENED
Flags: needinfo?(rhelmer)
Resolution: FIXED → ---
(Assignee)

Comment 5

3 years ago
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #4)
> Somehow looks like most jobs just didn't schedule though to produce matviews
> for the 10th (on https://crash-stats.mozilla.com/crontabber-state/ they say
> that last run "a day ago" and the next run is scheduled "in a day"). Rob,
> did your reset cause this?

Hm, I don't think it should have... I only reset the single job (fetch-adi-from-hive)

I'll go ahead and run a backfill for the 10th right now, and hopefully crontabber will be back to normal for the 11th.

peterbe, any idea why this would have happened?
Flags: needinfo?(rhelmer) → needinfo?(peterbe)
(Assignee)

Comment 6

3 years ago
Oh, I see - the fetch-adi-from-hive skipped today's run. I reset the next_run to be in the past which should hopefully get it to run:

update crontabber set next_run = '2015-07-10 08:00:00+00' where app_name = 'fetch-adi-from-hive';

Comment 7

3 years ago
FWIW, doesn't look like(In reply to Robert Helmer [:rhelmer] from comment #5)
> I'll go ahead and run a backfill for the 10th right now, and hopefully
> crontabber will be back to normal for the 11th.

Thanks, looks like the data is there now.
We're still missing adi for 2015-07-09 https://errormill.mozilla.org/webtools/socorro-prod/group/166958/
the backfillers are struggling to get over that.
Flags: needinfo?(peterbe)
(Assignee)

Comment 9

3 years ago
(In reply to Peter Bengtsson [:peterbe] from comment #8)
> We're still missing adi for 2015-07-09
> https://errormill.mozilla.org/webtools/socorro-prod/group/166958/
> the backfillers are struggling to get over that.

Er, the data is there and we have reports for that day... also we have no errors in https://crash-stats.mozilla.com/crontabber-state/ on prod, how can this be that it's still erroring?
(Assignee)

Comment 10

3 years ago
(In reply to Robert Helmer [:rhelmer] from comment #9)
> (In reply to Peter Bengtsson [:peterbe] from comment #8)
> > We're still missing adi for 2015-07-09
> > https://errormill.mozilla.org/webtools/socorro-prod/group/166958/
> > the backfillers are struggling to get over that.
> 
> Er, the data is there and we have reports for that day... also we have no
> errors in https://crash-stats.mozilla.com/crontabber-state/ on prod, how can
> this be that it's still erroring?

This hasn't been in the crontabber.log in prod since the 11th... I don't know why this is showing up on errormill?
(In reply to Robert Helmer [:rhelmer] from comment #10)
> (In reply to Robert Helmer [:rhelmer] from comment #9)
> > (In reply to Peter Bengtsson [:peterbe] from comment #8)
> > > We're still missing adi for 2015-07-09
> > > https://errormill.mozilla.org/webtools/socorro-prod/group/166958/
> > > the backfillers are struggling to get over that.
> > 
> > Er, the data is there and we have reports for that day... also we have no
> > errors in https://crash-stats.mozilla.com/crontabber-state/ on prod, how can
> > this be that it's still erroring?
> 
> This hasn't been in the crontabber.log in prod since the 11th... I don't
> know why this is showing up on errormill?

JP and I are investigating this. Clearly the new staging crontabber is struggling (https://crash-stats.mocotoolsstaging.net/crontabber-state/) but when we looked into Consul settings we found that secrets.sentry.dsn was set to null :)

It's been updated now and now we're getting errors sent to Socorro Stage (https://errormill.mozilla.org/webtools/socorro-stage/group/170207/) instead of Socorro Prod (https://errormill.mozilla.org/webtools/socorro-prod/)

Keeping an eye on it.
(Assignee)

Comment 12

3 years ago
I think the problem of missing ADI on 2015-07-09 is resolved, if there's a new problem please file a new bug. Thanks!
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
For completeness, the problem was that we are still running the old PHX admin node and it too suffered from the 2015-07-09 "outage" and when Rob recovered the new AWS admin node, he didn't do it for the old one and that's the one still sending errors to Errormill.
You need to log in before you can comment on or make changes to this bug.