Closed Bug 1967787 Opened 6 months ago Closed 5 months ago

landoscript should handle terminations

Categories

(Release Engineering :: Release Automation, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file)

Although unlikely, it is possible for a landoscript task to time out (most likely due to terminationGracePeriodSeconds). When this happens, the task will fail, but the lando job that it was polling may succeed or fail. As things stand now, if the landoscript task is rerun it will be unaware of the previous attempt, and try to redo the work from scratch. This may be desirable and/or work for some cases, but it is certainly not what we want for all cases. For example, if the underlying lando job from the first landoscript task succeeded, a rerun landoscript task ought to do nothing at all.

Some scenarios where this could happen and what we can do about them, largely courtesy of jcristau:

  • Lando job in previous run can't be found, or was never submitted -> we can retry the job from scratch
  • Lando job in previous run succeeded -> do nothing, return success
  • Lando job in previous run failed -> we can retry the job from scratch
  • Lando job is still in flight -> poll the existing job and wait for a result

I think this covers all of the possible cases, and is fairly straightforward to implement if we publish an artifact with the lando status URL (to avoid the need to scrape it out of the previous run's logs).

https://github.com/mozilla-releng/scriptworker/pull/696 is not directly related to this, but it will make re-run handling easier.

This was deployed today.

Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: