[prod] crontabber hanging since 2014-07-11 13:00:07

RESOLVED FIXED

Status

Socorro
Infra
RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: rhelmer, Assigned: rhelmer)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

4 years ago
Crontabber hasn't been running in prod - here's the last output to the log:

2014-07-11 13:00:07,391 DEBUG  - MainThread - about to run <class 'socorro.cron.jobs.fetch_adi_from_hive.FetchADIFromHiveCronApp'>
(Assignee)

Comment 1

4 years ago
Not much going on:

$ sudo strace -p 778
Process 778 attached - interrupt to quit
recvfrom(6,
(Assignee)

Comment 2

4 years ago
Looks like it's connected to hive, postgres and its output file, the descriptor that seems to be hanging on recvfrom() is hive(6):

python      778   socorro    4u     IPv4           62889546      0t0        TCP 
sp-admin01.phx1.mozilla.com:39256->tp-socorro01-rw-zeus.phx1.mozilla.com:postgre
s (ESTABLISHED)
python      778   socorro    5w      REG              253,0        0    6986758 
/tmp/2014-07-10.raw_adi_logs.TEMPORARY.txt
python      778   socorro    6u     IPv4           62889576      0t0        TCP sp-admin01.phx1.mozilla.com:47766->peach-gw.peach.metrics.scl3.mozilla.com:ndmp (ESTABLISHED)
(Assignee)

Comment 3

4 years ago
Killed the hanging crontabber so things can resolve.
(Assignee)

Updated

4 years ago
Depends on: 1037873
(Assignee)

Updated

4 years ago
Depends on: 1037874
(Assignee)

Updated

4 years ago
No longer depends on: 1037873, 1037874
See Also: → bug 1037874, bug 1037873
(Assignee)

Comment 5

4 years ago
Prod is back to normal - see bug 1037873 and bug 1037874 for followups to make sure this doesn't happen again.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED

Updated

4 years ago
Blocks: 1037959
:rhelmer this happened again, same symtoms. Is there a reason the crontabber doesn't have timeouts for jobs that hang forever?

<nagios-phx1:#sysadmins> Mon 03:56:25 PDT [1025] 
  sp-admin01.phx1.mozilla.com:Socorro Admin - crontab log file age is WARNING: 
  FILE_AGE WARNING: /var/log/socorro/crontabber.log is 11175 seconds old and 
  33275644 bytes (http://m.mozilla.org/Socorro+Admin+-+crontab+log+file+age)

socorro  21914  0.0  0.3 188756 24296 ?        S    00:50   0:02 /data/socorro/socorro-virtualenv/bin/python /data/socorro/application/socorro/cron/crontabber_app.py --admin.conf=/etc/socorro/crontabber.ini
[root@sp-admin01.phx1 pradcliffe]# kill 21914
[root@sp-admin01.phx1 pradcliffe]# kill 21914
bash: kill: (21914) - No such process


2015-05-04 00:50:06,884 DEBUG  - MainThread - about to run <class 'socorro.cron.
jobs.matviews.ReportsCleanCronApp'>
/data/socorro/socorro-virtualenv/lib/python2.6/site-packages/configman/config_ma
nager.py:743: UserWarning: Invalid options: secrets.exacttarget.exacttarget_pass
word, resource.elasticsearch.use_mapping_file, crontabber.database_file, crontab
ber.class-ServerStatusCronApp.queue_class, crontabber.class-ReprocessingJobsApp.
queue_class, crontabber.class-ReprocessingJobsApp.filter_on_legacy_processing, r
esource.elasticsearch.elasticsearch_index_settings, resource.elasticsearch.timeout, secrets.exacttarget.exacttarget_user, resource.elasticsearch.elasticSearchHostname, crontabber.class-AutomaticEmailsCronApp.email_template, crontabber.class-ReprocessingJobsApp.routing_key
  'Invalid options: %s' % ', '.join(unmatched_keys)
2015-05-04 04:00:05,792 INFO  - MainThread - app_name: crontabber
2015-05-04 04:00:05,793 INFO  - MainThread - app_version: 0.16.1
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
> :rhelmer this happened again, same symtoms. Is there a reason the crontabber doesn't have timeouts for jobs that hang forever?

No particular reason. Just complexity. Crontabber runs in a single thread single processor and execution of jobs is just like executing a python function. Making them die after a timeout is very hard. 

Which of that information above shows how long it had been running?
FILE_AGE WARNING: /var/log/socorro/crontabber.log is 11175 seconds old
(Assignee)

Comment 9

3 years ago
This has been moved to AWS.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.