Closed Bug 1157436 Opened 9 years ago Closed 9 years ago

When pulse is unavailable, pulse_publisher will exit and not restart

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: coop, Unassigned)

Details

Chris Cooper [:coop] (he/him)

Reporter

Description

•

9 years ago

We had Zeus issues in PHX this morning (bug 1157387) and this knocked pulse offline. The pulse_publisher service that runs on some buildbot masters failed out when this happened and didn't restart. This led to high queue counts for outgoing results on those masters.

On each puppet iteration, we should check whether pulse_publisher is running on the master, and restart the service if it is not.

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

9 years ago

Doesn't this run in supervisord?  Doesn't supervisord automatically restart failed processes?

Jordan Lund (:jlund)

Comment 2

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Doesn't this run in supervisord?  Doesn't supervisord automatically restart
> failed processes?

assuming this is managed by puppet, I see the following avail of: supervisord::supervise -> releaserunner, shipit, mozpool, and things that inherit from mockbuild (our linux build slaves).

I don't see any supervisord stuff for masters in puppet but I could easily be groking incorrectly.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

9 years ago

Indeed, it looks like it's run directly from initd:

  modules/buildmaster/templates/pulse_publisher.initd.erb

and that's started by a service statement:

modules/buildmaster/manifests/queue.pp
    service {
        ...
        "pulse_publisher":
            hasstatus => true,
            require => [
                File["/etc/init.d/pulse_publisher"],
                File["${buildmaster::settings::queue_dir}/passwords.py"],
                Exec["install-tools"],
                ],  
            enable => true,
            ensure => running;
    }   

so it should be started if it's not running.  How did you end up restarting it?  Did '/etc/init.d/pulse_publisher status' give inaccurate results?

Chris Cooper [:coop] (he/him)

Reporter

Comment 4

•

9 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #3) 
> so it should be started if it's not running.  How did you end up restarting
> it?  Did '/etc/init.d/pulse_publisher status' give inaccurate results?

We didn't run status directly. I used |/etc/init.d/pulse_publisher restart| to restart which resulted in the affected masters returning:

Stopping pulse publisher[FAILED]
Starting pulse publisher[  OK  ]

Hard to reconstruct now, but this probably means pulse_publisher crashed and left it's LOCKFILE/PIDFILE around. Again, not sure why the service didn't recognize this.

I'm going to run through a few potential failure states on bm100 (orphaned LOCKFILE/PIDFILE) and run puppet by hand to see what happens. It seems that there are no puppet logs on bm100, so I can't tell what happened previously.

Chris Cooper [:coop] (he/him)

Reporter

Comment 5

•

9 years ago

After checking the failure states (no process + LOCKFILE and/or PIDFILE present), a puppet agent run successfully restarted the pulse_publisher service. I think I may have just been impatient. 

The puppet check for masters runs at 28 and 58 minutes after the hour. Pulse was down until ~1:15pm PT. I kicked off the pulse_publisher restart at 1:38pm PT, and based on the output, about 1/4 of the masters had already recovered on their own. Given the slowdowns inherent in all the buildbot masters hitting puppet masters at the same time, and a 30min delay between nagios checks, I suspect all these buildbot masters would have fixed themselves given another nagios cycle (30min).

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

9 years ago

The minutes-past-the-hour are randomized per host, so they're not *all* at that time.  That doesn't change your conclusion -- just worth a note.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

When pulse is unavailable, pulse_publisher will exit and not restart

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

Tracking

(Not tracked)

People

(Reporter: coop, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6