Closed
Bug 1312513
Opened 8 years ago
Closed 7 years ago
How do we (temporarily) stop an app on the releng web cluster?
Categories
(Infrastructure & Operations :: IT-Managed Tools, task)
Infrastructure & Operations
IT-Managed Tools
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: hwine, Assigned: fox2mike)
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4083])
Situation came up during the 2016-10-08 TCW, looking to see if there's an "official" way to handle it. One of the apps RelEng runs on the web cluster is "self service". As part of it's operation, it maintains a r/w connection to the buildbot databases. During certain database operations (such as failover), we need to be able to: - stop that backend connection - (re)start that backend connection. As I understand it, the only "built in" support is to hardhat the vip for the frontend of the webapp. Are there any other "built in" options? Or recommended ways to achieve this? We do have access to the admin node for app deployment.
Comment 1•8 years ago
|
||
You probably mean the self-serve agent, which runs on a few masters (look for include toplevel::mixin::selfserve_agent in https://hg.mozilla.org/build/puppet/file/default/manifests/moco-nodes.pp; so bm70, bm71 etc). That's managed by supervisord.
Reporter | ||
Comment 2•8 years ago
|
||
During the last TCW, we had some some active connections to the R/W VIP from the releng web cluster after all buildbot masters were gracefully shut down. I'll see if I still have details available and update here.
Flags: needinfo?(hwine)
Assignee | ||
Comment 3•8 years ago
|
||
There are harsher ways to do this (iptables etc) but I'd like to maybe discuss what we're trying to do a little better before suggesting more paths forward.
Reporter | ||
Comment 4•8 years ago
|
||
Update to comment 0: - the app with the db connections is 'buildapi' (not selfserve) - buildapi connects to the buildapi database via the buildbot-rw-vip connection :fox2mike - the goal is as in comment 0 - we need to be able to shut down an app temporarily so that db connections will close.
Flags: needinfo?(hwine) → needinfo?(smani)
FWIW, I see five ways you could do this. #5 is the safest option, giving WebOps the ability to shutter the app at any time and have it behave towards the databases and respond to clients in a RelEng-approved manner for that time. 1. TrafficScript reject all requests to the app, restart the webservers to close any open sessions, and then ignore open sessions from the app as it's hardhat'd. This isn't really a great plan, because it's not provably certain that the app is writeless during the work. 2. Alter the app's credentials on the write master to prohibit its access, and terminate any open sessions from the database server. This is guaranteed to work, but requires the database team to alter and then revert a change to credentials, which is fragile. 3. Alter the app's database hostname to force it to error when trying to connect to the database, restart the webservers to close any open sessions. This is guaranteed to work, but leaves the app running in a broken state for the duration. 4. Remove all httpd configs for the app entirely. This is guaranteed to work, but will produce undefined results for any requests to the app for the duration (404, 500, etc) unless combined with a hardhat solution. 5. Alter the app's config to enable "maintenance mode" operation, restart the webservers to close any open sessions. This is the best solution, but will require the app developers to implement a maintenance mode that can be enabled/disabled easily.
Reporter | ||
Comment 6•8 years ago
|
||
6. push an "empty/null app", which is completely doable from our side. (we could keep such an app/webpage on a different branch) -- similar to (5), but easier to test.
Assignee | ||
Comment 7•7 years ago
|
||
Or 7. Just modify the write password on the app config for the database and then follow 2 by terminating any DB sessions on the DB server?
Flags: needinfo?(smani)
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(hwine)
Reporter | ||
Comment 8•7 years ago
|
||
I'm fine with whatever process you think will work best. But I'd like to get enough detail to document who needs to do what to shut an app down. If I'm understanding these correctly: - approaches 1 & 4 can be done completely by webops (but probably not moc) - approaches 2, 3, & 7 take coordination between releng & webops - approach 5 takes resources not available from releng - approach 6 can be done completely by releng, but takes coordinated testing to see if it has a reasonable behavior Since there isn't a slam dunk answer, I don't know what resources it would take to investigate/document anything on your side. I can help verify approach 6 during the next TCW. I am also open to resolving wontfix, and dealing with the issue adhoc if we ever have the need arise again. It was a pretty special situation that led to asking this question. :) Shyam: your call on next step ;)
Flags: needinfo?(hwine) → needinfo?(smani)
Assignee | ||
Comment 9•7 years ago
|
||
Cool, we shall go with WONTFIX for now, and let you try 6 and you can always come back to us if you need help with things in the future. Thanks!
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(smani)
Resolution: --- → WONTFIX
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3583] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4011]
Assignee | ||
Updated•7 years ago
|
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4011] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4083]
Assignee | ||
Comment 10•7 years ago
|
||
My apologies for all the spam. Sometimes, automation can be a pain, this is clearly one of those times. We'll try and fix this before we try and go through this process again.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•