Closed
Bug 1157436
Opened 10 years ago
Closed 10 years ago
When pulse is unavailable, pulse_publisher will exit and not restart
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: coop, Unassigned)
Details
We had Zeus issues in PHX this morning (bug 1157387) and this knocked pulse offline. The pulse_publisher service that runs on some buildbot masters failed out when this happened and didn't restart. This led to high queue counts for outgoing results on those masters.
On each puppet iteration, we should check whether pulse_publisher is running on the master, and restart the service if it is not.
Comment 1•10 years ago
|
||
Doesn't this run in supervisord? Doesn't supervisord automatically restart failed processes?
Comment 2•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Doesn't this run in supervisord? Doesn't supervisord automatically restart
> failed processes?
assuming this is managed by puppet, I see the following avail of: supervisord::supervise -> releaserunner, shipit, mozpool, and things that inherit from mockbuild (our linux build slaves).
I don't see any supervisord stuff for masters in puppet but I could easily be groking incorrectly.
Comment 3•10 years ago
|
||
Indeed, it looks like it's run directly from initd:
modules/buildmaster/templates/pulse_publisher.initd.erb
and that's started by a service statement:
modules/buildmaster/manifests/queue.pp
service {
...
"pulse_publisher":
hasstatus => true,
require => [
File["/etc/init.d/pulse_publisher"],
File["${buildmaster::settings::queue_dir}/passwords.py"],
Exec["install-tools"],
],
enable => true,
ensure => running;
}
so it should be started if it's not running. How did you end up restarting it? Did '/etc/init.d/pulse_publisher status' give inaccurate results?
Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #3)
> so it should be started if it's not running. How did you end up restarting
> it? Did '/etc/init.d/pulse_publisher status' give inaccurate results?
We didn't run status directly. I used |/etc/init.d/pulse_publisher restart| to restart which resulted in the affected masters returning:
Stopping pulse publisher[FAILED]
Starting pulse publisher[ OK ]
Hard to reconstruct now, but this probably means pulse_publisher crashed and left it's LOCKFILE/PIDFILE around. Again, not sure why the service didn't recognize this.
I'm going to run through a few potential failure states on bm100 (orphaned LOCKFILE/PIDFILE) and run puppet by hand to see what happens. It seems that there are no puppet logs on bm100, so I can't tell what happened previously.
Reporter | ||
Comment 5•10 years ago
|
||
After checking the failure states (no process + LOCKFILE and/or PIDFILE present), a puppet agent run successfully restarted the pulse_publisher service. I think I may have just been impatient.
The puppet check for masters runs at 28 and 58 minutes after the hour. Pulse was down until ~1:15pm PT. I kicked off the pulse_publisher restart at 1:38pm PT, and based on the output, about 1/4 of the masters had already recovered on their own. Given the slowdowns inherent in all the buildbot masters hitting puppet masters at the same time, and a 30min delay between nagios checks, I suspect all these buildbot masters would have fixed themselves given another nagios cycle (30min).
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Comment 6•10 years ago
|
||
The minutes-past-the-hour are randomized per host, so they're not *all* at that time. That doesn't change your conclusion -- just worth a note.
You need to log in
before you can comment on or make changes to this bug.
Description
•