Back-end databases often need to go down for maintenance or repair. We need a way to tell the system that this has happened. As it is, we have a couple of bad behaviors associated with simply taking down a back-end DB host. We may still assign nodes to the down server. Mitigating this in the current system requires removing the nodes from node_config.json (which is json, therefore you can't comment it out) and setting `ct` to 0 in the available_nodes table on the admin host. A hackish way to do it is simply to crank up the actives on the node, which will keep new assignments from happening until approximately 1am, but could pollute metrics. If a new user gets 503'd, we think they get 'unknown' error. This is untested. If the host is all the way down, instead of merely refusing MySQL connections, then the webheads run out of apache processes. This is because the MySQL connection timeout is long. (60s) We can shorten this, but even at 5s, I could see running out of apache processes at high load. Mitigating this requires repointing the shard_constants entries for the host at something that will refuse the db connection quickly. I've been using 127.0.0.1 for this. So having a single place to flag a server as 'down' that will avoid these ugly behaviors would be extremely valuable to ops.
http://hg.mozilla.org/services/reg-server-secure/rev/8c06e8ab82cb allows us to mark nodes as downed.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.