trigger_indexer.py should terminate stopped indexing jobs when spawning a new indexer, not just running jobs
Categories
(Webtools :: Searchfox, enhancement)
Tracking
(Not tracked)
People
(Reporter: asuth, Assigned: asuth)
Details
Attachments
(1 file)
In discussion of a failed config4 indexer it came up that the logic to terminate existing indexers when triggering a new indexer explicitly only terminates running indexers, not stopped indexers (thanks to :kats for noticing that![1]). I think we should terminate stopped indexers here (which is the state that a failed indexing job will put itself in) as well as running indexing jobs here as well. The general idea is that we shouldn't let stopped indexers stack up in the event no one is actively triaging indexer failures.
It maybe appropriate to do some kind of carve-out here for situations where we're actively investigating so that the investigation doesn't get interrupted. Possibilities:
- Notice if the security mode is
indexerinstead ofindexer-secureand don't terminate in that case. - Have us explicitly re-tag channels when there's an indexing failure we want to investigate. I've recently taken to moving config1 web-servers across channels to make it easier to stand up a "dev" channel that only involves changes to static resources. If we add this script, it would be easy enough to intentionally move the stopped indexer to an "investigating" tag or "investigating-asuth" or the like. This could also handle assigning the server to an ELB target group if it's a "web-server" and there's an ELB with the same name for the channel (and deregistering existing targets, or just failing saying that the existing target should be terminated first).
- This could possibly eventually lead to some kind of cleverness where instead of having separate ELBs per-developer, we could have just the single "dev" channel ELB work for that. In combination with bug 1703115 this would seem to let us decrease one of our main cost areas at the expense of wackier URLs. (Like our URLs may end up becoming
https://dev.searchfox.org/asuth/mozilla-central/orhttps://dev.searchfox.org/asuth-mozilla-central/.
- This could possibly eventually lead to some kind of cleverness where instead of having separate ELBs per-developer, we could have just the single "dev" channel ELB work for that. In combination with bug 1703115 this would seem to let us decrease one of our main cost areas at the expense of wackier URLs. (Like our URLs may end up becoming
1: My morning coffee routine is a convenient time to look at searchfox failures, but not maybe the wisest time to be reading code ;)
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 1•3 years ago
|
||
I think re-tagging so the indexer no longer looks like an indexer is the way to go. At least currently, it's very possible for ssh.py to fail to transition the security mode back to indexer-secure due to a stale request when the connection finally terminates. Even if the request was re-issued from scratch, the LDAP grant could also already have expired.
I'm going to try and fix this now because config4 has been falling over all week and the indexers would have stacked up if I didn't manually terminate them and there was clearly no benefit to this for anyone.
| Assignee | ||
Comment 2•3 years ago
|
||
| Assignee | ||
Comment 3•3 years ago
|
||
lambda jobs have been updated
| Assignee | ||
Comment 4•3 years ago
|
||
This worked; Saturday's release4 run failed because of the ongoing "rust" tree breakage and today, Sunday, there was only one stopped indexer after release4 failed again. (Presumably the release4 stopped indexer was terminated at lambda function time, as intended, and since the indexer itself will not take any such action.)
Description
•