I had originally thought to use Push Health for this, but I find that this is different enough that using Push Health would require a lot of changes.
The use case here is when there are large changes made to infrastructure, we need to verify parity. This could be different machines, cloud providers, upgrades to OS, etc.
I view the workflow like this:
- push to try with recent push for all jobs for given affected platform, --rebuild 5
- apply change to use new version, push to try again like in #1
- when all jobs are done, compare push 1 vs 2
- compare average runtime of all passing jobs, report anything >3% difference
- compare total failures, failures by suite, failures by job
- compare failure types (infra, timeout, crashes, test fail, etc.)
In a perfect world everything is the same, some small variance <5% is usually ok.
This could either be part of push health, perfherder compare, or something new that is modeled after one of those.
If this existed, then release operations would be able to make changes without CI-A helping out all the time, likewise taskcluster could upgrade workers, wpt-sync could benefit as well.