How do we (temporarily) stop an app on the releng web cluster?

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: IT-Managed Tools
RESOLVED FIXED
a year ago
11 months ago

People

(Reporter: hwine, Assigned: fox2mike)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4083])

(Reporter)

Description

a year ago
Situation came up during the 2016-10-08 TCW, looking to see if there's an "official" way to handle it.

One of the apps RelEng runs on the web cluster is "self service". As part of it's operation, it maintains a r/w connection to the buildbot databases. During certain database operations (such as failover), we need to be able to:
 - stop that backend connection
 - (re)start that backend connection.

As I understand it, the only "built in" support is to hardhat the vip for the frontend of the webapp.

Are there any other "built in" options? Or recommended ways to achieve this? We do have access to the admin node for app deployment.

Updated

a year ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3583]
You probably mean the self-serve agent, which runs on a few masters (look for include toplevel::mixin::selfserve_agent in https://hg.mozilla.org/build/puppet/file/default/manifests/moco-nodes.pp; so bm70, bm71 etc). That's managed by supervisord.
(Reporter)

Comment 2

a year ago
During the last TCW, we had some some active connections to the R/W VIP from the releng web cluster after all buildbot masters were gracefully shut down.

I'll see if I still have details available and update here.
Flags: needinfo?(hwine)
(Assignee)

Comment 3

a year ago
There are harsher ways to do this (iptables etc) but I'd like to maybe discuss what we're trying to do a little better before suggesting more paths forward.
(Reporter)

Comment 4

a year ago
Update to comment 0:
 - the app with the db connections is 'buildapi' (not selfserve)
 - buildapi connects to the buildapi database via the buildbot-rw-vip connection

:fox2mike - the goal is as in comment 0 - we need to be able to shut down an app temporarily so that db connections will close.
Flags: needinfo?(hwine) → needinfo?(smani)
FWIW, I see five ways you could do this. #5 is the safest option, giving WebOps the ability to shutter the app at any time and have it behave towards the databases and respond to clients in a RelEng-approved manner for that time.

1. TrafficScript reject all requests to the app, restart the webservers to close any open sessions, and then ignore open sessions from the app as it's hardhat'd.

This isn't really a great plan, because it's not provably certain that the app is writeless during the work.

2. Alter the app's credentials on the write master to prohibit its access, and terminate any open sessions from the database server.

This is guaranteed to work, but requires the database team to alter and then revert a change to credentials, which is fragile.

3. Alter the app's database hostname to force it to error when trying to connect to the database, restart the webservers to close any open sessions.

This is guaranteed to work, but leaves the app running in a broken state for the duration.

4. Remove all httpd configs for the app entirely.

This is guaranteed to work, but will produce undefined results for any requests to the app for the duration (404, 500, etc) unless combined with a hardhat solution.

5. Alter the app's config to enable "maintenance mode" operation, restart the webservers to close any open sessions.

This is the best solution, but will require the app developers to implement a maintenance mode that can be enabled/disabled easily.

Updated

a year ago
Assignee: server-ops-webops → smani
(Reporter)

Comment 6

a year ago
6. push an "empty/null app", which is completely doable from our side. (we could keep such an app/webpage on a different branch) -- similar to (5), but easier to test.
(Assignee)

Comment 7

11 months ago
Or 

7. Just modify the write password on the app config for the database and then follow 2 by terminating any DB sessions on the DB server?
Flags: needinfo?(smani)
(Assignee)

Updated

11 months ago
Flags: needinfo?(hwine)
(Reporter)

Comment 8

11 months ago
I'm fine with whatever process you think will work best. But I'd like to get enough detail to document who needs to do what to shut an app down.

If I'm understanding these correctly:
 - approaches 1 & 4 can be done completely by webops (but probably not moc)
 - approaches 2, 3, & 7 take coordination between releng & webops
 - approach 5 takes resources not available from releng
 - approach 6 can be done completely by releng, but takes coordinated testing to see if it has a reasonable behavior

Since there isn't a slam dunk answer, I don't know what resources it would take to investigate/document anything on your side. I can help verify approach 6 during the next TCW.

I am also open to resolving wontfix, and dealing with the issue adhoc if we ever have the need arise again. It was a pretty special situation that led to asking this question. :)

Shyam: your call on next step ;)
Flags: needinfo?(hwine) → needinfo?(smani)
(Assignee)

Comment 9

11 months ago
Cool, we shall go with WONTFIX for now, and let you try 6 and you can always come back to us if you need help with things in the future.

Thanks!
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Flags: needinfo?(smani)
Resolution: --- → WONTFIX

Updated

11 months ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3583] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4011]

Updated

11 months ago
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
(Assignee)

Updated

11 months ago
Status: REOPENED → RESOLVED
Last Resolved: 11 months ago11 months ago
Resolution: --- → FIXED

Updated

11 months ago
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4011] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4083]

Updated

11 months ago
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 10

11 months ago
My apologies for all the spam. Sometimes, automation can be a pain, this is clearly one of those times. We'll try and fix this before we try and go through this process again.
Status: REOPENED → RESOLVED
Last Resolved: 11 months ago11 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.