1312513 - How do we (temporarily) stop an app on the releng web cluster?

Reporter

Description

•

8 years ago

Situation came up during the 2016-10-08 TCW, looking to see if there's an "official" way to handle it.

One of the apps RelEng runs on the web cluster is "self service". As part of it's operation, it maintains a r/w connection to the buildbot databases. During certain database operations (such as failover), we need to be able to:
 - stop that backend connection
 - (re)start that backend connection.

As I understand it, the only "built in" support is to hardhat the vip for the frontend of the webapp.

Are there any other "built in" options? Or recommended ways to achieve this? We do have access to the admin node for app deployment.

:kanban

Updated

•

8 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3583]

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

8 years ago

You probably mean the self-serve agent, which runs on a few masters (look for include toplevel::mixin::selfserve_agent in https://hg.mozilla.org/build/puppet/file/default/manifests/moco-nodes.pp; so bm70, bm71 etc). That's managed by supervisord.

Hal Wine [:hwine] use NI!

Reporter

Comment 2

•

8 years ago

During the last TCW, we had some some active connections to the R/W VIP from the releng web cluster after all buildbot masters were gracefully shut down.

I'll see if I still have details available and update here.

Flags: needinfo?(hwine)

Shyam Mani [:fox2mike]

Assignee

Comment 3

•

8 years ago

There are harsher ways to do this (iptables etc) but I'd like to maybe discuss what we're trying to do a little better before suggesting more paths forward.

Hal Wine [:hwine] use NI!

Reporter

Comment 4

•

8 years ago

Update to comment 0:
 - the app with the db connections is 'buildapi' (not selfserve)
 - buildapi connects to the buildapi database via the buildbot-rw-vip connection

:fox2mike - the goal is as in comment 0 - we need to be able to shut down an app temporarily so that db connections will close.

Flags: needinfo?(hwine) → needinfo?(smani)

:Atoll

Comment 5

•

8 years ago

FWIW, I see five ways you could do this. #5 is the safest option, giving WebOps the ability to shutter the app at any time and have it behave towards the databases and respond to clients in a RelEng-approved manner for that time.

1. TrafficScript reject all requests to the app, restart the webservers to close any open sessions, and then ignore open sessions from the app as it's hardhat'd.

This isn't really a great plan, because it's not provably certain that the app is writeless during the work.

2. Alter the app's credentials on the write master to prohibit its access, and terminate any open sessions from the database server.

This is guaranteed to work, but requires the database team to alter and then revert a change to credentials, which is fragile.

3. Alter the app's database hostname to force it to error when trying to connect to the database, restart the webservers to close any open sessions.

This is guaranteed to work, but leaves the app running in a broken state for the duration.

4. Remove all httpd configs for the app entirely.

This is guaranteed to work, but will produce undefined results for any requests to the app for the duration (404, 500, etc) unless combined with a hardhat solution.

5. Alter the app's config to enable "maintenance mode" operation, restart the webservers to close any open sessions.

This is the best solution, but will require the app developers to implement a maintenance mode that can be enabled/disabled easily.

:kanban

Updated

•

8 years ago

Assignee: server-ops-webops → smani

Hal Wine [:hwine] use NI!

Reporter

Comment 6

•

8 years ago

6. push an "empty/null app", which is completely doable from our side. (we could keep such an app/webpage on a different branch) -- similar to (5), but easier to test.

Shyam Mani [:fox2mike]

Assignee

Comment 7

•

7 years ago

Or 

7. Just modify the write password on the app config for the database and then follow 2 by terminating any DB sessions on the DB server?

Flags: needinfo?(smani)

Shyam Mani [:fox2mike]

Assignee

Updated

•

7 years ago

Flags: needinfo?(hwine)

Hal Wine [:hwine] use NI!

Reporter

Comment 8

•

7 years ago

I'm fine with whatever process you think will work best. But I'd like to get enough detail to document who needs to do what to shut an app down.

If I'm understanding these correctly:
 - approaches 1 & 4 can be done completely by webops (but probably not moc)
 - approaches 2, 3, & 7 take coordination between releng & webops
 - approach 5 takes resources not available from releng
 - approach 6 can be done completely by releng, but takes coordinated testing to see if it has a reasonable behavior

Since there isn't a slam dunk answer, I don't know what resources it would take to investigate/document anything on your side. I can help verify approach 6 during the next TCW.

I am also open to resolving wontfix, and dealing with the issue adhoc if we ever have the need arise again. It was a pretty special situation that led to asking this question. :)

Shyam: your call on next step ;)

Flags: needinfo?(hwine) → needinfo?(smani)

Shyam Mani [:fox2mike]

Assignee

Comment 9

•

7 years ago

Cool, we shall go with WONTFIX for now, and let you try 6 and you can always come back to us if you need help with things in the future.

Thanks!

Status: NEW → RESOLVED

Closed: 7 years ago

Flags: needinfo?(smani)

Resolution: --- → WONTFIX

:kanban

Updated

•

7 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/3583] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4011]

:kanban

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Resolution: WONTFIX → ---

Shyam Mani [:fox2mike]

Assignee

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

:kanban

Updated

•

7 years ago

Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4011] → [kanban:https://webops.kanbanize.com/ctrl_board/2/4083]

:kanban

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Shyam Mani [:fox2mike]

Assignee

Comment 10

•

7 years ago

My apologies for all the spam. Sometimes, automation can be a pain, this is clearly one of those times. We'll try and fix this before we try and go through this process again.

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

How do we (temporarily) stop an app on the releng web cluster?

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

Tracking

(Not tracked)

People

(Reporter: hwine, Assigned: fox2mike)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/4083])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Updated

Updated

Updated

Updated

Updated

Comment 10