Closed Bug 1157436 Opened 9 years ago Closed 9 years ago

When pulse is unavailable, pulse_publisher will exit and not restart

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: coop, Unassigned)

Details

We had Zeus issues in PHX this morning (bug 1157387) and this knocked pulse offline. The pulse_publisher service that runs on some buildbot masters failed out when this happened and didn't restart. This led to high queue counts for outgoing results on those masters.

On each puppet iteration, we should check whether pulse_publisher is running on the master, and restart the service if it is not.
Doesn't this run in supervisord?  Doesn't supervisord automatically restart failed processes?
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Doesn't this run in supervisord?  Doesn't supervisord automatically restart
> failed processes?

assuming this is managed by puppet, I see the following avail of: supervisord::supervise -> releaserunner, shipit, mozpool, and things that inherit from mockbuild (our linux build slaves).

I don't see any supervisord stuff for masters in puppet but I could easily be groking incorrectly.
Indeed, it looks like it's run directly from initd:

  modules/buildmaster/templates/pulse_publisher.initd.erb

and that's started by a service statement:

modules/buildmaster/manifests/queue.pp
    service {
        ...
        "pulse_publisher":
            hasstatus => true,
            require => [
                File["/etc/init.d/pulse_publisher"],
                File["${buildmaster::settings::queue_dir}/passwords.py"],
                Exec["install-tools"],
                ],  
            enable => true,
            ensure => running;
    }   

so it should be started if it's not running.  How did you end up restarting it?  Did '/etc/init.d/pulse_publisher status' give inaccurate results?
(In reply to Dustin J. Mitchell [:dustin] from comment #3) 
> so it should be started if it's not running.  How did you end up restarting
> it?  Did '/etc/init.d/pulse_publisher status' give inaccurate results?

We didn't run status directly. I used |/etc/init.d/pulse_publisher restart| to restart which resulted in the affected masters returning:

Stopping pulse publisher[FAILED]
Starting pulse publisher[  OK  ]

Hard to reconstruct now, but this probably means pulse_publisher crashed and left it's LOCKFILE/PIDFILE around. Again, not sure why the service didn't recognize this.

I'm going to run through a few potential failure states on bm100 (orphaned LOCKFILE/PIDFILE) and run puppet by hand to see what happens. It seems that there are no puppet logs on bm100, so I can't tell what happened previously.
After checking the failure states (no process + LOCKFILE and/or PIDFILE present), a puppet agent run successfully restarted the pulse_publisher service. I think I may have just been impatient. 

The puppet check for masters runs at 28 and 58 minutes after the hour. Pulse was down until ~1:15pm PT. I kicked off the pulse_publisher restart at 1:38pm PT, and based on the output, about 1/4 of the masters had already recovered on their own. Given the slowdowns inherent in all the buildbot masters hitting puppet masters at the same time, and a 30min delay between nagios checks, I suspect all these buildbot masters would have fixed themselves given another nagios cycle (30min).
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WORKSFORME
The minutes-past-the-hour are randomized per host, so they're not *all* at that time.  That doesn't change your conclusion -- just worth a note.
You need to log in before you can comment on or make changes to this bug.