landoscript should handle terminations
Categories
(Release Engineering :: Release Automation, enhancement, P1)
Tracking
(Not tracked)
People
(Reporter: bhearsum, Assigned: bhearsum)
Details
Attachments
(1 file)
Although unlikely, it is possible for a landoscript task to time out (most likely due to terminationGracePeriodSeconds). When this happens, the task will fail, but the lando job that it was polling may succeed or fail. As things stand now, if the landoscript task is rerun it will be unaware of the previous attempt, and try to redo the work from scratch. This may be desirable and/or work for some cases, but it is certainly not what we want for all cases. For example, if the underlying lando job from the first landoscript task succeeded, a rerun landoscript task ought to do nothing at all.
Some scenarios where this could happen and what we can do about them, largely courtesy of jcristau:
- Lando job in previous run can't be found, or was never submitted -> we can retry the job from scratch
- Lando job in previous run succeeded -> do nothing, return success
- Lando job in previous run failed -> we can retry the job from scratch
- Lando job is still in flight -> poll the existing job and wait for a result
I think this covers all of the possible cases, and is fairly straightforward to implement if we publish an artifact with the lando status URL (to avoid the need to scrape it out of the previous run's logs).
| Assignee | ||
Comment 1•5 months ago
|
||
https://github.com/mozilla-releng/scriptworker/pull/696 is not directly related to this, but it will make re-run handling easier.
Comment 2•5 months ago
|
||
| Assignee | ||
Comment 3•5 months ago
|
||
This was deployed today.
Description
•