Closed Bug 1705961 Opened 5 years ago Closed 3 years ago

trigger_indexer.py should terminate stopped indexing jobs when spawning a new indexer, not just running jobs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: asuth, Assigned: asuth)

Details

Attachments

(1 file)

Landed PR: terminate stopped indexers too when triggering indexer 3 years ago Andrew Sutherland [:asuth] (he/him) 47 bytes, text/x-github-pull-request		Details \| Review

Andrew Sutherland [:asuth] (he/him)

Assignee

Description

•

5 years ago

In discussion of a failed config4 indexer it came up that the logic to terminate existing indexers when triggering a new indexer explicitly only terminates running indexers, not stopped indexers (thanks to :kats for noticing that![1]). I think we should terminate stopped indexers here (which is the state that a failed indexing job will put itself in) as well as running indexing jobs here as well. The general idea is that we shouldn't let stopped indexers stack up in the event no one is actively triaging indexer failures.

It maybe appropriate to do some kind of carve-out here for situations where we're actively investigating so that the investigation doesn't get interrupted. Possibilities:

Notice if the security mode is indexer instead of indexer-secure and don't terminate in that case.
Have us explicitly re-tag channels when there's an indexing failure we want to investigate. I've recently taken to moving config1 web-servers across channels to make it easier to stand up a "dev" channel that only involves changes to static resources. If we add this script, it would be easy enough to intentionally move the stopped indexer to an "investigating" tag or "investigating-asuth" or the like. This could also handle assigning the server to an ELB target group if it's a "web-server" and there's an ELB with the same name for the channel (and deregistering existing targets, or just failing saying that the existing target should be terminated first).
- This could possibly eventually lead to some kind of cleverness where instead of having separate ELBs per-developer, we could have just the single "dev" channel ELB work for that. In combination with bug 1703115 this would seem to let us decrease one of our main cost areas at the expense of wackier URLs. (Like our URLs may end up becoming https://dev.searchfox.org/asuth/mozilla-central/ or https://dev.searchfox.org/asuth-mozilla-central/.

1: My morning coffee routine is a convenient time to look at searchfox failures, but not maybe the wisest time to be reading code ;)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

5 years ago

Summary: trigger_indexer.py should terminate stopped indexing jobs when spawning a new indexer, not just running jops → trigger_indexer.py should terminate stopped indexing jobs when spawning a new indexer, not just running jobs

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 1

•

3 years ago

I think re-tagging so the indexer no longer looks like an indexer is the way to go. At least currently, it's very possible for ssh.py to fail to transition the security mode back to indexer-secure due to a stale request when the connection finally terminates. Even if the request was re-issued from scratch, the LDAP grant could also already have expired.

I'm going to try and fix this now because config4 has been falling over all week and the indexers would have stacked up if I didn't manually terminate them and there was clearly no benefit to this for anyone.

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 2

•

3 years ago

Attached file Landed PR: terminate stopped indexers too when triggering indexer — Details

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 3

•

3 years ago

lambda jobs have been updated

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 4

•

3 years ago

This worked; Saturday's release4 run failed because of the ongoing "rust" tree breakage and today, Sunday, there was only one stopped indexer after release4 failed again. (Presumably the release4 stopped indexer was terminated at lambda function time, as intended, and since the indexer itself will not take any such action.)

You need to log in before you can comment on or make changes to this bug.

Bugzilla

trigger_indexer.py should terminate stopped indexing jobs when spawning a new indexer, not just running jobs

Categories

(Webtools :: Searchfox, enhancement)

Tracking

(Not tracked)

People

(Reporter: asuth, Assigned: asuth)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Attachment

General

Description

File Name

Content Type