Closed Bug 1312513 Opened 8 years ago Closed 7 years ago

How do we (temporarily) stop an app on the releng web cluster?

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: fox2mike)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4083])

Situation came up during the 2016-10-08 TCW, looking to see if there's an "official" way to handle it.

One of the apps RelEng runs on the web cluster is "self service". As part of it's operation, it maintains a r/w connection to the buildbot databases. During certain database operations (such as failover), we need to be able to:
 - stop that backend connection
 - (re)start that backend connection.

As I understand it, the only "built in" support is to hardhat the vip for the frontend of the webapp.

Are there any other "built in" options? Or recommended ways to achieve this? We do have access to the admin node for app deployment.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3583]
You probably mean the self-serve agent, which runs on a few masters (look for include toplevel::mixin::selfserve_agent in https://hg.mozilla.org/build/puppet/file/default/manifests/moco-nodes.pp; so bm70, bm71 etc). That's managed by supervisord.
During the last TCW, we had some some active connections to the R/W VIP from the releng web cluster after all buildbot masters were gracefully shut down.

I'll see if I still have details available and update here.
Flags: needinfo?(hwine)
There are harsher ways to do this (iptables etc) but I'd like to maybe discuss what we're trying to do a little better before suggesting more paths forward.
Update to comment 0:
 - the app with the db connections is 'buildapi' (not selfserve)
 - buildapi connects to the buildapi database via the buildbot-rw-vip connection

:fox2mike - the goal is as in comment 0 - we need to be able to shut down an app temporarily so that db connections will close.
Flags: needinfo?(hwine) → needinfo?(smani)
FWIW, I see five ways you could do this. #5 is the safest option, giving WebOps the ability to shutter the app at any time and have it behave towards the databases and respond to clients in a RelEng-approved manner for that time.

1. TrafficScript reject all requests to the app, restart the webservers to close any open sessions, and then ignore open sessions from the app as it's hardhat'd.

This isn't really a great plan, because it's not provably certain that the app is writeless during the work.

2. Alter the app's credentials on the write master to prohibit its access, and terminate any open sessions from the database server.

This is guaranteed to work, but requires the database team to alter and then revert a change to credentials, which is fragile.

3. Alter the app's database hostname to force it to error when trying to connect to the database, restart the webservers to close any open sessions.

This is guaranteed to work, but leaves the app running in a broken state for the duration.

4. Remove all httpd configs for the app entirely.

This is guaranteed to work, but will produce undefined results for any requests to the app for the duration (404, 500, etc) unless combined with a hardhat solution.

5. Alter the app's config to enable "maintenance mode" operation, restart the webservers to close any open sessions.

This is the best solution, but will require the app developers to implement a maintenance mode that can be enabled/disabled easily.
Assignee: server-ops-webops → smani
6. push an "empty/null app", which is completely doable from our side. (we could keep such an app/webpage on a different branch) -- similar to (5), but easier to test.
Or 

7. Just modify the write password on the app config for the database and then follow 2 by terminating any DB sessions on the DB server?
Flags: needinfo?(smani)
Flags: needinfo?(hwine)
I'm fine with whatever process you think will work best. But I'd like to get enough detail to document who needs to do what to shut an app down.

If I'm understanding these correctly:
 - approaches 1 & 4 can be done completely by webops (but probably not moc)
 - approaches 2, 3, & 7 take coordination between releng & webops
 - approach 5 takes resources not available from releng
 - approach 6 can be done completely by releng, but takes coordinated testing to see if it has a reasonable behavior

Since there isn't a slam dunk answer, I don't know what resources it would take to investigate/document anything on your side. I can help verify approach 6 during the next TCW.

I am also open to resolving wontfix, and dealing with the issue adhoc if we ever have the need arise again. It was a pretty special situation that led to asking this question. :)

Shyam: your call on next step ;)
Flags: needinfo?(hwine) → needinfo?(smani)
Cool, we shall go with WONTFIX for now, and let you try 6 and you can always come back to us if you need help with things in the future.

Thanks!
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(smani)
Resolution: --- → WONTFIX
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3583] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4011]
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4011] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4083]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
My apologies for all the spam. Sometimes, automation can be a pain, this is clearly one of those times. We'll try and fix this before we try and go through this process again.
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.