Open
Bug 1501520
Opened 6 years ago
Updated 6 years ago
[tracking] make robustcheckout more reliable
Categories
(Developer Services :: Mercurial: robustcheckout, enhancement)
Developer Services
Mercurial: robustcheckout
Tracking
(Not tracked)
NEW
People
(Reporter: jlund, Unassigned)
References
(Depends on 4 open bugs)
Details
currently, hg robustcheckout requires some of the most manual reruns in release automation. This puts a burden on releaseduty and delays releases.
This bug tracks some examples of various ways it fails
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 1•6 years ago
|
||
@gps - any guidance or help on these would be greatly appreciated.
Flags: needinfo?(gps)
Comment 2•6 years ago
|
||
The "robust" in robustcheckout is supposed to mean something. The extension is a glorified wrapper around Mercurial internals that is supposed to retry (with intelligent backoffs) when intermittent network errors occur. To a large extent, we're successful in doing this.
But there are a handful of failures that sill manage to creep in. We're essentially engaged in a game of whack-a-mole with failures.
It also doesn't help that robustcheckout.py is vendored into a few different repos. Sometimes we forget to update it everywhere. So e.g. TaskCluster Windows workers may not get all fixes as quickly as mozilla-central. And we may not uplift robustcheckout.py changes to e.g. mozilla-release.
I think we should treat all VCS checkout failures like we do any other failure in CI: prioritize fixing problems by the impact of their failures (read: count and disruption to release processes) and chase the long tail as long as we can justify it.
Do you have particular failures that care causing significant pain? Bug 1371378 had 1 failure last week and bug 1318173 had 3. These seem pretty low frequency...
Flags: needinfo?(gps)
Keywords: in-triage
Summary: [tracking] make robustcheckout more reliable for release tasks → [tracking] make robustcheckout more reliable
Reporter | ||
Comment 3•6 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #2)
> Do you have particular failures that care causing significant pain? Bug
> 1371378 had 1 failure last week and bug 1318173 had 3. These seem pretty low
> frequency...
Seems like these examples haven't been hit in last two betas. Perhaps there were improvements or we were just unlucky. Fine to ignore until it happens again.
Reporter | ||
Comment 4•6 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #2)
> The "robust" in robustcheckout is supposed to mean something. The extension
> is a glorified wrapper around Mercurial internals that is supposed to retry
> (with intelligent backoffs) when intermittent network errors occur. To a
> large extent, we're successful in doing this.
To be clear, robustcheckout seems to be very successful at this, and it's awesome.
However when we do have release automation failures, even if they are more seldom, it puts more operations pressure on Releng and delays the release from getting into QA's hands. Perhaps we could invest some time getting CIDuty to help rerun the intermittents but that only hides the failure under the rug.
>I think we should treat all VCS checkout failures like we do any other failure in CI: prioritize fixing problems by the impact of their failures (read: count and disruption to release processes) and chase the long tail as long as we can justify it.
I've added some more failures we hit this past week that seem to be new: bug 1504346 and bug 1504345.
Release automation tasks are not often starred. Perhaps they should be but that's out of scope here. We do have some historical data in our weekly postmortems but no metrics of frequency (yet): https://github.com/mozilla-releng/releasewarrior-data/tree/master/postmortems
You need to log in
before you can comment on or make changes to this bug.
Description
•