Closed
Bug 1490758
Opened 6 years ago
Closed 3 years ago
Define and document relops staging environment for worker changes
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: coop, Unassigned)
References
Details
Relops provides AMIs and hardware nodes for testing across our four target platforms: Android, Linux, Mac, and Windows. Those platforms are constantly evolving, requiring a well-defined, _safe_ staging environment.
Safety is important here, because during the generic-worker upgrade on Windows (bug 1443589), we ended up testing against live production tasks and burned more than a few. This could have been avoided with a proper staging environment.
The Taskcluster team is at the point now where it is "easy" to deploy a new cluster, so if the relops team could benefit from a standalone cluster, we can do that. This might be useful for staging accessory services like OCC and puppet.
However, we may not want a distinct staging environment. There are benefits to the current model where we run staging workers in the production environment, chief among them that you are only changing one variable, the worker itself. The important caveat here is that we need a reliable way to ensure that staging workers do not take production jobs.
This could take a few paths:
* standardize around Pete's transforms patch to rename beta workers in Try: https://hg.mozilla.org/try/rev/f2bf1f4f7b0f6c046b7ae42398b940cafef5333d
* a tool to automatically resubmit a portion of a taskgraph (e.g. only the Windows 10 tests) to a pool of staging workers. This would allow direct comparison of the existing run on production workers with those run on beta workers
* a tool/transform to automatically fork the testing of a new taskgraph to go to both production and staging workers
...or some combination of those, or something else entirely.
Whichever route we choose, we should figure out the requirements and file some dependent bugs. The Taskcluster team will help with whichever parts fall under our domain.
Comment 1•6 years ago
|
||
fwiw, i'm both a fan and consumer of the transforms patch and testing on beta workers. mostly because it's quite simple to use. this is something that works quite well *most* of the time. it lacks the robsutness to give us a feel for how a patch will perform at scale, which would be useful for testing certain types of patches, but in most of our day to day workload, it works well. the technique can and has been used even on hardware workers, we just have yet to develop a rythm for regularly using it there. we would need to create some worker types and develop some working habits to exploit it better, but i think it's entirely feasible.
i'm intrigued by the possibility of modifying the taskgraph generator to automatically send load to beta/staging workers and i can see that being useful if it's simple to configure, throttle and switch on/off.
i don't fully understand (i have not gotten my head around) how a staging taskcluster environment would work or what it would enable, in terms of a deployment testing workflow, but i'm happy to explore that further too.
Comment 2•6 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #0)
>
> The Taskcluster team is at the point now where it is "easy" to deploy a new
> cluster, so if the relops team could benefit from a standalone cluster, we
> can do that.
How "easy" are we talking to do a deploy, and what does the process look like to keep it up to date? Is it something that we could spin up as needed, and/or have multiple staging clusters within an AWS account?
> This might be useful for staging accessory services like OCC and puppet.
Possibly; I think those two might be better suited to testing on staging workers in the production environment, but there may well be cases that a staging cluster would be better.
The place where I'd really like to have the staging cluster are for testing the image generation and eventual deployment, particularly for hardware. On the other hand, our biggest concern is keeping the staging cluster up to date and managing it. If it bit rots it could make things worse, and we're not very familiar with various other parts of TC so managing them might be painful.
> The important caveat here is that we need a reliable way to
> ensure that staging workers do not take production jobs.
Agreed; we have one or two bugs open around this, and Dave is going to see where we currently stand with them.
> * a tool to automatically resubmit a portion of a taskgraph (e.g. only the
> Windows 10 tests) to a pool of staging workers. This would allow direct
> comparison of the existing run on production workers with those run on beta
> workers
>
> * a tool/transform to automatically fork the testing of a new taskgraph to
> go to both production and staging workers
I think both of these would be very useful, both for us and for QA! How do we handle reporting to Treeherder, though? I can see where it would be useful (comparing changes), and also potentially a problem (staging bustage should be ignored by most people).
Backlog grooming, likely lies with SRE at this point.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•