1048358 - slaverebooter should attempt reboot if graceful shutdown fails

Assignee

Description

•

10 years ago

So slaverebooter currently takes a very long time to execute, due to graceful shutdowns blocking other work.

This also has a feedback loop because on re-entry we will attempt a graceful shutdown again, and again fail, and never actually reboot.

We should allow the script to complete sooner, and on re-entry check the status of results!

This bug is to achieve the following multiple things:

* Make the script run in a shorter timeframe, ideally less than 30 minutes (rather than the 10-12 hours now)
* Make the script re-entrant (if the machine needs to reboot, the logic shouldn't care)
** Should even allow us to run the script on another host, without having to care deeply.
* Make any machine that fails to graceful shutdown buildbot, have a forced reboot.

Justin Wood (:Callek)

Assignee

Updated

•

10 years ago

Assignee: nobody → bugspam.Callek

Justin Wood (:Callek)

Assignee

Comment 1

•

10 years ago

Attached patch [tools] v1 (obsolete) — Details — Splinter Review

Hey coop, how does this look?

It is a pretty big refactor here, I however left out the "continue to reboot if graceful fails" piece just because I'd like to be able to test this as is.

What it *does* do, besides refactor is:

* Makes the codepath more automic (if the script dies we can re-enter without as much worry)
* Checks slaveapi history to see if a graceful was already issued.
* Doesn't attempt to do work on a slave if <work item> is in progress or if <work item> was complete within IDLE_THRESHOLD (5 hours currently)
* Doesn't attempt to wait for a graceful shutdown to complete, which allows for total completion of script to happen much faster than previously.


I need to test this before a deploy, which will consist of a dry-run or two on the bm our official runs on. As well as temporarily disabling official for a day or two in order to run this copy instead.

Once this deploys officially it should be < 1 hour of work to make the graceful failures be ignored.

Attachment #8471923 - Flags: review?(coop)

Justin Wood (:Callek)

Assignee

Updated

•

10 years ago

Attachment #8471923 - Attachment is patch: true

Chris Cooper [:coop] (he/him)

Comment 2

•

10 years ago

Comment on attachment 8471923 [details] [diff] [review]
[tools] v1

Review of attachment 8471923 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Justin Wood (:Callek) from comment #1)
> Created attachment 8471923 [details] [diff] [review]
> [tools] v1
> 
> Hey coop, how does this look?
> 
> It is a pretty big refactor here, I however left out the "continue to reboot
> if graceful fails" piece just because I'd like to be able to test this as is.
> 
> What it *does* do, besides refactor is:
> 
> * Makes the codepath more automic (if the script dies we can re-enter
> without as much worry)
> * Checks slaveapi history to see if a graceful was already issued.
> * Doesn't attempt to do work on a slave if <work item> is in progress or if
> <work item> was complete within IDLE_THRESHOLD (5 hours currently)
> * Doesn't attempt to wait for a graceful shutdown to complete, which allows
> for total completion of script to happen much faster than previously.
> 
> 
> I need to test this before a deploy, which will consist of a dry-run or two
> on the bm our official runs on. As well as temporarily disabling official
> for a day or two in order to run this copy instead.
> 
> Once this deploys officially it should be < 1 hour of work to make the
> graceful failures be ignored.

::: buildfarm/maintenance/reboot-idle-slaves.py
@@ +49,4 @@
>      return dt.strftime("%A, %B %d, %H:%M")
>  
>  
> +def get_latest_timestamp_from_result(result):

You could probably inline this whole function below as:

res_ts = max(result['start_timestamp'],result['request_timestamp'],result['end_timestamp'])

@@ +61,5 @@
> +    res = None
> +    for result in results:
> +        if not res:
> +            res = result
> +        res_ts = get_latest_timestamp_from_result(res)

Wrong level of indentation here, I think. You'll want to initialize this once in the if clause and then 'continue.'

@@ +63,5 @@
> +        if not res:
> +            res = result
> +        res_ts = get_latest_timestamp_from_result(res)
> +        if res_ts < get_latest_timestamp_from_result(result):
> +            res = result

You don't return anything from this function.

@@ +66,5 @@
> +        if res_ts < get_latest_timestamp_from_result(result):
> +            res = result
> +
> +
> +def get_recent_foo(slaveapi, slave, action):

Can we change this name to 'get_recent_action,' please?

@@ +80,5 @@
> +    return get_latest_result(results)
> +
> +
> +def get_recent_graceful(slaveapi, slave):
> +    return get_recent_foo(slaveapi, slave, "shutdown_buildslave")

Make this call match the new function name above.

@@ +84,5 @@
> +    return get_recent_foo(slaveapi, slave, "shutdown_buildslave")
> +
> +
> +def get_recent_reboot(slaveapi, slave):
> +    return get_recent_foo(slaveapi, slave, "reboot")

Make this call match the new function name above.

@@ +93,5 @@
> +    if not info["recent_jobs"]:
> +        log.info("%s - Skipping reboot because no recent jobs found", slave)
> +        return None
> +
> +    return info["recent_jobs"][0]["endtime"]

Is a recent job always guaranteed to have an endtime? You should check this before returning it.

@@ +111,5 @@
> +
> +    time.sleep(30)  # Sleep to give a graceful some leeway to complete
> +    log.debug("%s - issued graceful, re-adding to queue", slave)
> +    SLAVE_QUEUE.put_nowait(slave)
> +    return

No return value?

@@ +121,5 @@
> +    retry(requests.post, args=(str(url),))
> +    # Because SlaveAPI fully escalates reboots (all the way to IT bug filing),
> +    # there's no reason for us to watch for it to complete.
> +    log.info("%s - Reboot queued", slave)
> +

No return value?

@@ +140,1 @@
>              log.info("%s - Skipping reboot because no recent jobs found", slave)

Maybe it's not taking jobs because it needs a reboot? May be a chicken-and-egg problem here.

If the slave is enabled and *should* be taking jobs, I think we should reboot it anyway. We could make this smarter by not doing this on weekends, etc.

@@ +150,5 @@
> +        recent_graceful = get_recent_graceful(slaveapi, slave)
> +        recent_graceful_ts = get_latest_timestamp_from_result(recent_graceful)
> +        recent_reboot = get_recent_reboot(slaveapi, slave)
> +        recent_reboot_ts = get_latest_timestamp_from_result(recent_reboot)
> +        # Determine the timestamp we care about for doing work

As mentioned above, maybe this shouldn't be static, i.e. maybe we have different criteria based on time of day, day of week, etc. Possibly break this into its own function where we can encapsulate that logic?

@@ +191,5 @@
> +                         slave, get_formatted_time(last_job_dt))
> +                return
> +            # TODO: XXXCallek remove this FAILURE clause when we want to reboot anyway
> +            if recent_graceful["state"] in (FAILURE,):
> +                log.info("%s - Graceful shutdown failed, aborting reboot", slave)

Isn't this one of the cases we were trying to fix with a hard reboot, i.e. the graceful shutdown failed, so now we try the reboot anyway?

Attachment #8471923 - Flags: review?(coop) → review-

Justin Wood (:Callek)

Assignee

Comment 3

•

10 years ago

(In reply to Chris Cooper [:coop] from comment #2)
> ::: buildfarm/maintenance/reboot-idle-slaves.py
> @@ +49,4 @@
> >      return dt.strftime("%A, %B %d, %H:%M")
> >  
> >  
> > +def get_latest_timestamp_from_result(result):
> 
> You could probably inline this whole function below as:
> 
> res_ts =
> max(result['start_timestamp'],result['request_timestamp'],
> result['end_timestamp'])

its used at least twice there and I wanted to keep the line length lower and not as confusing. (Also if we ever extend slaveapi to have a result['progress_timestamp'] while in progress somehow we can add it in one place easier)

however I did reduce the function to a one-line return rather than using a list/for loop.

> @@ +61,5 @@
> > +    res = None
> > +    for result in results:
> > +        if not res:
> > +            res = result
> > +        res_ts = get_latest_timestamp_from_result(res)
> 
> Wrong level of indentation here, I think. You'll want to initialize this
> once in the if clause and then 'continue.'

Well it actually is fine as-is, but I can see the value in the continue and pushing below the if.

That said, pushing the res_ts setting below the if will cause us to need to also set it below the later if here, making 2 calls to get_...timestamp... each time we need to update res. So I left it as-is, but added continue.

> @@ +63,5 @@
> > +        if not res:
> > +            res = result
> > +        res_ts = get_latest_timestamp_from_result(res)
> > +        if res_ts < get_latest_timestamp_from_result(result):
> > +            res = result
> 
> You don't return anything from this function.

fixed

> 
> @@ +66,5 @@
> > +        if res_ts < get_latest_timestamp_from_result(result):
> > +            res = result
> > +
> > +
> > +def get_recent_foo(slaveapi, slave, action):
> 
> Can we change this name to 'get_recent_action,' please?

done

> @@ +93,5 @@
> > +    if not info["recent_jobs"]:
> > +        log.info("%s - Skipping reboot because no recent jobs found", slave)
> > +        return None
> > +
> > +    return info["recent_jobs"][0]["endtime"]
> 
> Is a recent job always guaranteed to have an endtime? You should check this
> before returning it.

Unsure, *however* it was like this in the old code, so I'm inclined to keep it.
http://hg.mozilla.org/build/tools/file/f9c24112003a/buildfarm/maintenance/reboot-idle-slaves.py#l61


> @@ +111,5 @@
> > +
> > +    time.sleep(30)  # Sleep to give a graceful some leeway to complete
> > +    log.debug("%s - issued graceful, re-adding to queue", slave)
> > +    SLAVE_QUEUE.put_nowait(slave)
> > +    return
> 
> No return value?

I've been using this since we're not error checking here, so any errors would be raised in the form of tracebacks, and ignored anyway.

This is a possible improvement if we need though.

> @@ +121,5 @@
> > +    retry(requests.post, args=(str(url),))
> > +    # Because SlaveAPI fully escalates reboots (all the way to IT bug filing),
> > +    # there's no reason for us to watch for it to complete.
> > +    log.info("%s - Reboot queued", slave)
> > +
> 
> No return value?

At least here I added an explicit return

> @@ +140,1 @@
> >              log.info("%s - Skipping reboot because no recent jobs found", slave)
> 
> Maybe it's not taking jobs because it needs a reboot? May be a
> chicken-and-egg problem here.
> 
> If the slave is enabled and *should* be taking jobs, I think we should
> reboot it anyway. We could make this smarter by not doing this on weekends,
> etc.

Per slaveapi, "recent jobs" is a bit of a misnomer, it actually returns the *most recent* job per buildapi, whatever that is.

This code is protecting against the case of "this slave is newly in slavealloc and has not had ANY jobs yet, might mean we didn't get it added to buildbot yet."  I'm inclined to leave it given this, but perhaps the log.info needs adjusting or better comments.

> @@ +150,5 @@
> > +        recent_graceful = get_recent_graceful(slaveapi, slave)
> > +        recent_graceful_ts = get_latest_timestamp_from_result(recent_graceful)
> > +        recent_reboot = get_recent_reboot(slaveapi, slave)
> > +        recent_reboot_ts = get_latest_timestamp_from_result(recent_reboot)
> > +        # Determine the timestamp we care about for doing work
> 
> As mentioned above, maybe this shouldn't be static, i.e. maybe we have
> different criteria based on time of day, day of week, etc. Possibly break
> this into its own function where we can encapsulate that logic?

Possible Future improvement? (or would you like to expand on that thinking ~now and I can try and incorporate it, I just don't see a strong value in the now for that change)

> @@ +191,5 @@
> > +                         slave, get_formatted_time(last_job_dt))
> > +                return
> > +            # TODO: XXXCallek remove this FAILURE clause when we want to reboot anyway
> > +            if recent_graceful["state"] in (FAILURE,):
> > +                log.info("%s - Graceful shutdown failed, aborting reboot", slave)
> 
> Isn't this one of the cases we were trying to fix with a hard reboot, i.e.
> the graceful shutdown failed, so now we try the reboot anyway?

As per the TODO and the comment I posted this with (+ talk with coop in IRC). This piece is for a part 2, since I want to make sure I don't break existing things with this rather large refactor. So we can also evaluate the other change on its own merits.

New Patch after I test this a bit

Justin Wood (:Callek)

Assignee

Comment 4

•

10 years ago

Attached patch [tools] part 1 - interdiff v1->v2 — Details — Splinter Review

This is the interdiff....

Attachment #8471923 - Attachment is obsolete: true

Justin Wood (:Callek)

Assignee

Comment 5

•

10 years ago

Comment on attachment 8473189 [details] [diff] [review]
[tools] part 1 - interdiff v1->v2

Review of attachment 8473189 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/maintenance/reboot-idle-slaves.py
@@ +153,4 @@
>  
>          recent_graceful = get_recent_graceful(slaveapi, slave)
>          recent_graceful_ts = get_latest_timestamp_from_result(recent_graceful)
> +        recent_graceful_dt = datetime.fromtimestamp(recent_graceful_ts)

err this var isn't used, removed from the real patch coming up next.

Justin Wood (:Callek)

Assignee

Comment 6

•

10 years ago

Attached patch [tools] part 1 - v2 — Details — Splinter Review

So this patch does all the other patch did, only properly tested.

Minor changes, and a few explanations vs the prior patch in my reply to your review.

You can see a log of most of this (successful) testing, repeated runs ("sep by a string of ====") on
buildbot-master74.srv.releng.usw2.mozilla.com 
 -->> /builds/slaverebooter/testing_callek_log.log

Attachment #8473200 - Flags: review?(coop)

Justin Wood (:Callek)

Assignee

Comment 7

•

10 years ago

Attached patch [tools] part 2 - v1 — Details — Splinter Review

This part, I hope to deploy _after_ I have deployed part 1, so we get a chance to see if/what part 1 breaks.

But I'm open to deploying it all together at your discretion.

Attachment #8473202 - Flags: review?(coop)

Chris Cooper [:coop] (he/him)

Comment 8

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #3)
> (In reply to Chris Cooper [:coop] from comment #2)
> > @@ +93,5 @@
> > > +    if not info["recent_jobs"]:
> > > +        log.info("%s - Skipping reboot because no recent jobs found", slave)
> > > +        return None
> > > +
> > > +    return info["recent_jobs"][0]["endtime"]
> > 
> > Is a recent job always guaranteed to have an endtime? You should check this
> > before returning it.
> 
> Unsure, *however* it was like this in the old code, so I'm inclined to keep
> it.
> http://hg.mozilla.org/build/tools/file/f9c24112003a/buildfarm/maintenance/
> reboot-idle-slaves.py#l61

I don't like hiding behind a bad precedent (even if it's not yours). 

Could we check for endtime and return 0 if it doesn't exist, or would that screw something else up?

> > @@ +111,5 @@
> > > +
> > > +    time.sleep(30)  # Sleep to give a graceful some leeway to complete
> > > +    log.debug("%s - issued graceful, re-adding to queue", slave)
> > > +    SLAVE_QUEUE.put_nowait(slave)
> > > +    return
> > 
> > No return value?
> 
> I've been using this since we're not error checking here, so any errors
> would be raised in the form of tracebacks, and ignored anyway.
> 
> This is a possible improvement if we need though.

I think you've hit the nail on the head here: we should be aware of errors/tracebacks and deal with them. If we continue to ignore them, slaves will go AWOL and we'll become aware of them by other means, e.g. philor.

It doesn't need to be fixed in the context of *this* bug, but please file follow-ups for improvements.
 
> > @@ +140,1 @@
> > >              log.info("%s - Skipping reboot because no recent jobs found", slave)
> > 
> > Maybe it's not taking jobs because it needs a reboot? May be a
> > chicken-and-egg problem here.
> > 
> > If the slave is enabled and *should* be taking jobs, I think we should
> > reboot it anyway. We could make this smarter by not doing this on weekends,
> > etc.
> 
> Per slaveapi, "recent jobs" is a bit of a misnomer, it actually returns the
> *most recent* job per buildapi, whatever that is.
> 
> This code is protecting against the case of "this slave is newly in
> slavealloc and has not had ANY jobs yet, might mean we didn't get it added
> to buildbot yet."  I'm inclined to leave it given this, but perhaps the
> log.info needs adjusting or better comments.

How does a reboot every 6 hours hurt us in this situation though? If we're not blocking other actions any longer, at least the slave will make it back into production sooner than if left to chance.
 
> > @@ +150,5 @@
> > > +        recent_graceful = get_recent_graceful(slaveapi, slave)
> > > +        recent_graceful_ts = get_latest_timestamp_from_result(recent_graceful)
> > > +        recent_reboot = get_recent_reboot(slaveapi, slave)
> > > +        recent_reboot_ts = get_latest_timestamp_from_result(recent_reboot)
> > > +        # Determine the timestamp we care about for doing work
> > 
> > As mentioned above, maybe this shouldn't be static, i.e. maybe we have
> > different criteria based on time of day, day of week, etc. Possibly break
> > this into its own function where we can encapsulate that logic?
> 
> Possible Future improvement? (or would you like to expand on that thinking
> ~now and I can try and incorporate it, I just don't see a strong value in
> the now for that change)

I'm fine with it being a future enhancement, but let's get that bug on file, please.

Justin Wood (:Callek)

Assignee

Comment 9

•

10 years ago

(In reply to Chris Cooper [:coop] from comment #8)
> (In reply to Justin Wood (:Callek) from comment #3)
> > (In reply to Chris Cooper [:coop] from comment #2)
> > > @@ +140,1 @@
> > > >              log.info("%s - Skipping reboot because no recent jobs found", slave)
> > > 
> > > Maybe it's not taking jobs because it needs a reboot? May be a
> > > chicken-and-egg problem here.
> > > 
> > > If the slave is enabled and *should* be taking jobs, I think we should
> > > reboot it anyway. We could make this smarter by not doing this on weekends,
> > > etc.
> > 
> > Per slaveapi, "recent jobs" is a bit of a misnomer, it actually returns the
> > *most recent* job per buildapi, whatever that is.
> > 
> > This code is protecting against the case of "this slave is newly in
> > slavealloc and has not had ANY jobs yet, might mean we didn't get it added
> > to buildbot yet."  I'm inclined to leave it given this, but perhaps the
> > log.info needs adjusting or better comments.
> 
> How does a reboot every 6 hours hurt us in this situation though? If we're
> not blocking other actions any longer, at least the slave will make it back
> into production sooner than if left to chance.

To address this point, part of my coming enhancements here is to run this script every 1 hour, or even more often. the logic will hold up that often, assuming the total runtime is not much longer.

(A future enh might be to file a bug if we reboot 3 or more times and no job comes out of it)

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Attachment #8473200 - Flags: review?(coop) → review+

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Attachment #8473202 - Flags: review?(coop) → review+

Justin Wood (:Callek)

Assignee

Updated

•

10 years ago

Depends on: 1054350

Justin Wood (:Callek)

Assignee

Comment 10

•

10 years ago

Part 1 - http://hg.mozilla.org/build/tools/rev/5971b4dc1994

I'll file the followups on monday.

Justin Wood (:Callek)

Assignee

Updated

•

10 years ago

Depends on: 1055392

Justin Wood (:Callek)

Assignee

Updated

•

10 years ago

Blocks: 1055401

Justin Wood (:Callek)

Assignee

Comment 11

•

10 years ago

part 2 - http://hg.mozilla.org/build/tools/rev/c418456e51bf

Justin Wood (:Callek)

Assignee

Comment 12

•

10 years ago

This is now fully deployed, leaving open for a day or two incase there are problems that crop up, but I think its done [except promised bugs]

Justin Wood (:Callek)

Assignee

Comment 13

•

10 years ago

No fallout reported, --> resolve

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Tools → General

[tools] v1 10 years ago Justin Wood (:Callek) 8.48 KB, patch	coop : review-	Details \| Diff \| Splinter Review
[tools] part 1 - interdiff v1->v2 10 years ago Justin Wood (:Callek) 5.29 KB, patch		Details \| Diff \| Splinter Review
[tools] part 1 - v2 10 years ago Justin Wood (:Callek) 8.67 KB, patch	coop : review+	Details \| Diff \| Splinter Review
[tools] part 2 - v1 10 years ago Justin Wood (:Callek) 1.11 KB, patch	coop : review+	Details \| Diff \| Splinter Review