967 bytes, text/plain
62 bytes, text/plain
1.04 KB, text/plain
67 bytes, text/plain
During today's maintenance a db connection used by buildapi01 went away during work on the network. buildapi crashed out at this point, and we don't have anything to bring it back up again (no active puppet, no supervisord).
Grabbing to apply some bandaids
Assignee: nobody → hwine
Status: NEW → ASSIGNED
Created attachment 819246 [details] hourly_check -- cronjob to ensure buildapi is running BANDAID - until something better is done. Will email release@ if it finds buildapi down, and page hwine if it doesn't come up.
Created attachment 819248 [details] crontab.buildapi01 -- additions to existing one BANDAID - run the hourly check and email release@ if any issues
Created attachment 819254 [details] hourly_check -- cronjob to ensure selfserve agent is running BANDAID - script to restart selfserve-agent if it is not running -- this runs on buildbot-master36, this bug seemed closest to mark that fact
Created attachment 819256 [details] crontab.bm36 - lines added to existing crontab BANDAID - restart selfserve agent if not running
bandaids applied -- please remove when proper solution is applied
Assignee: hwine → nobody
Status: ASSIGNED → NEW
moved to releng cluster
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → INVALID
So now we have Apache + WSGI, so making a request relaunches buildapi if required ?
If it "crashes" that's caught either by mod_wsgi (Python exception) or by the Apache parent process (segfault), and restarted immediately. If it gets wedged somehow, I believe the request would eventually time out, again either at the mod_wsgi or Apache levels. But I haven't seen this happen so I'm not sure.
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.