Closed Bug 1498256 Opened 7 years ago Closed 6 years ago

grant ciduty permission to restart the taskcluster queue

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Unassigned)

References

Details

Recently we had an event where the pushes had only their gecko decision tasks run ("D") and nothing was scheduled after that succeeded. Brian restarted the queue (?) and jobs got scheduled again. This is a rare event nowadays but when it fails a) it's fatal and requires tree closure until a taskcluster person restarts it b) creates backlog after the restart because all pushes without jobs get them after that and this creates huge demand for workers. If CIduty has access to that and got training, this allows to timely resolve the issue, especially at times when no TC developer with the required knowledge is available.
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #0) > If CIduty has access to that and got training, this allows to timely resolve > the issue, especially at times when no TC developer with the required > knowledge is available. So I'm all for this. Particularly if it means one less ping for #taskcluster. However, I'd like to stress the above "and got training" part. We should think about footguns, security, documentation, and escalation strategy before handing over keys. Brian, what do you think? Obviously this would be helpful outside of Taskcluster's working hours and possibly lead to less interrupts during work. However, if this is niche or will take considerable time to hand off, I'm fine not granting permission.
Flags: needinfo?(bstack)
There's very little damage that can be done with jut a restart so I support this in general. I don't think that the heroku permissions are fine grained enough to support this unfortunately :/ (I'd love to be shown wrong however!) "Operate" is the closest permission I see to this but it allows viewing/changing config which contains a lot of very important secrets. I think we could give this to ciduty but we would need to think a bit harder first. It might also make sense to wait on a newly deployed cluster in kubernetes and see if we can do permissions correctly there. Also mozilla heroku org is changing auth this week I think so maybe that will allow for more flexibility?
Flags: needinfo?(bstack)
We seem to keep crossing the question of "should ciduty be able to .." with reasonable arguments for "yes" and "no" in every case. Maybe we should reach out to someone who can make a broader policy decision on this?
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #3) > We seem to keep crossing the question of "should ciduty be able to .." with > reasonable arguments for "yes" and "no" in every case. Maybe we should > reach out to someone who can make a broader policy decision on this? Fair and good point. My comment was more of "I don't know what it would take to give them permission to do this but if it's just to be able to restart the queue, yes please as this falls in line with our expectation". CIDuty should have the tools to stop/start/restart/reimage workers and services within firefox-ci. Ideally through another service or api. As well as adjust task configuration through commit level access in-tree. That's the general scope and broader policy you may be wanting. Outside of that, we should limit exposing secrets or granting access to machines that could compromise a release. Of course this is not a trust thing but is about restricting access to important keys to those who absolutely need it. Perhaps it would make sense to have secops involved with each of these requests and the owner of a given team work with them to do a risk assessment.
(In reply to Brian Stack [:bstack] from comment #2) > It might also make sense to wait on a newly deployed cluster in kubernetes > and see if we can do permissions correctly there. Also mozilla heroku org is > changing auth this week I think so maybe that will allow for more > flexibility? sounds good. let's wait. Is there a bug to block this on. Thanks for the quick response.
Ah, good call. Added!
Component: Queue → Services

This was been handled in bug 1542168

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.