Closed
Bug 1505000
Opened 6 years ago
Closed 6 years ago
beetmover android-components release tasks mistakenly run on staging instances
Categories
(Release Engineering :: Release Automation, enhancement)
Release Engineering
Release Automation
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mtabara, Assigned: mtabara)
Details
Attachments
(1 file)
61 bytes,
patch
|
mtabara
:
review+
mtabara
:
checked-in+
|
Details | Diff | Splinter Review |
Most likely the staging instance added in bug 1503548 was misconfigured to query for tasks within the production calls.
We need to audit the:
* TC clients
* the scopes between production/staging
* the workerType/workerId
Investigate how come a staging instance query-ied for jobs in production.
On the bright side, thank you CoT for failing fast and showing something is wrong.
Assignee | ||
Comment 1•6 years ago
|
||
For now I shut-down from AWS the staging worker - https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types/mobile-beetmover-v1/workers/mobile-beetmover-v1/mobil-beetmover-dev1
and rerun all the jobs to be taken by the other two production instances.
Assignee | ||
Comment 2•6 years ago
|
||
Note to self:
* post-mortem this to see how it happened
* be cautios to add more checks for upcoming products and clear separate of scopes/credentials
* Sebastian mentioned we should find a way to wither publish all, or none. This might be tricky as we can't control the CDN push once it happened. Maybe add a post-beetmover task that triggers CDN to purge the contents in case something failed?
Assignee: nobody → mtabara
Assignee | ||
Comment 3•6 years ago
|
||
Most likely this[1] is the culprit. Worth investigate though the scopes and hiera secrets for the TC clients too.
As to this:
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #2)
> * Sebastian mentioned we should find a way to wither publish all, or none.
> This might be tricky as we can't control the CDN push once it happened.
> Maybe add a post-beetmover task that triggers CDN to purge the contents in
> case something failed?
we could eventually migrate towards something like Firefox where we push things to beetmover and then if things are fine, move to releases.
[1]: https://github.com/mozilla-releng/build-puppet/blob/master/modules/beetmover_scriptworker/manifests/settings.pp#L158
Assignee | ||
Comment 4•6 years ago
|
||
There were two culprits that enabled bug 1505000 to happen.
1. https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types/mobile-beetmover-v1 lists the staging worker-Id which it shouldn't. This PR fixes it and allocates a separate workerType
2. https://tools.taskcluster.net/auth/clients/project%2Fmobile%2Fandroid-components%2Freleng%2Fscriptworker%2Fbeetmover%2Fdev had access to `queue:claim-work:scriptworker-prov-v1/mobile-beetmover-v*` which enabled it to query Taskcluster for tasks like the production one. I fixed this on TC side to take into account the new workerType from this PR.
Attachment #9022971 -
Flags: review?(jlorenzo)
Assignee | ||
Comment 5•6 years ago
|
||
* CoT caught this, yay
* we need to be very careful when we add new TC scopes to not copy-paste them as comment 4 shows how two factors contributed to the issue, not just one
* prevention-wise, amending scopes isn't under signing-off process so for now there's pretty much no action that we can do. As to workerType, Cot catches this so enables us to recover fast so hopefully this is good enough for now.
Leftovers:
* get the puppet patch reviewed and landed
* restart the staging instance from AWS console
* make sure it shows up under its proper workerType in TC provisioners[1] so that it's completely separated from production counterpart.
[1]: https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types/
Assignee | ||
Comment 6•6 years ago
|
||
Comment on attachment 9022971 [details] [diff] [review]
[puppet] Fix workerType in mobile-beetmover-dev instances.
r+'ed by jlorenzo in PR.
Attachment #9022971 -
Flags: review?(jlorenzo) → review+
Comment 7•6 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #0)
> On the bright side, thank you CoT for failing fast and showing something is
> wrong.
Phew!
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #2)
> * be cautios to add more checks for upcoming products and clear separate of
> scopes/credentials
Yes please!
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #5)
> * we need to be very careful when we add new TC scopes to not copy-paste
> them as comment 4 shows how two factors contributed to the issue, not just
> one
+1
> * prevention-wise, amending scopes isn't under signing-off process so for
> now there's pretty much no action that we can do. As to workerType, Cot
> catches this so enables us to recover fast so hopefully this is good enough
> for now.
We're moving towards using ci-configuration and ci-admin for everything, so there will be a review.
Assignee | ||
Comment 8•6 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #7)
> > * prevention-wise, amending scopes isn't under signing-off process so for
> > now there's pretty much no action that we can do. As to workerType, Cot
> > catches this so enables us to recover fast so hopefully this is good enough
> > for now.
>
> We're moving towards using ci-configuration and ci-admin for everything, so
> there will be a review.
Good to know! Thank you!
Assignee | ||
Comment 9•6 years ago
|
||
Comment on attachment 9022971 [details] [diff] [review]
[puppet] Fix workerType in mobile-beetmover-dev instances.
https://github.com/mozilla-releng/build-puppet/commit/92a043f5d2151c402b4f69b33ad14ad73aa9bbe2
Attachment #9022971 -
Flags: checked-in+
Assignee | ||
Comment 10•6 years ago
|
||
Conclusion(s):
* I switched the TC client[1] to use:
queue:claim-work:scriptworker-prov-v1/mobile-beetmover-dev (provisioner / workerType )
queue:worker-id:mobile-beetmover-v1/mobil-beetmover-dev* (workerGroup / workerId)
and we now have it showing up in the provisioners[2]
* more information about the workers hierarchy can be found here[3] and here[4]
* other ideas suggested by Tom and Aki included to use aws region for workerGroup and for certainity it should be versioned to ease the update of the pools later on
This should work as expected now.
[1]: https://tools.taskcluster.net/auth/clients/project%2Fmobile%2Fandroid-components%2Freleng%2Fscriptworker%2Fbeetmover%2Fdev
[2]: https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types/mobile-beetmover-dev
[3]: https://docs.taskcluster.net/docs/reference/platform/taskcluster-queue/docs/worker-hierarchy
[4]: https://docs.taskcluster.net/docs/manual/task-execution/queues
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 11•6 years ago
|
||
On the same note, there's more cleanup that we can do, thanks Tom for catching this.
Specifically:
i) trim these[1] scopes from the TC client[2] to query for dev beetmover jobs
ii) trim these[3] scopes from the TC client[4] to query for production beetmover jobs
iii) above are not needed as we need them instead in the associated role for the github project (e.g. release[5] or pull-request[6])
[1]:
project:mobile:android-components:releng:beetmover:action:push-to-maven
project:mobile:android-components:releng:beetmover:bucket:maven-staging
[2]: https://tools.taskcluster.net/auth/clients/project%2Fmobile%2Fandroid-components%2Freleng%2Fscriptworker%2Fbeetmover%2Fdev
[3]:
project:mobile:android-components:releng:beetmover:action:push-to-maven
project:mobile:android-components:releng:beetmover:bucket:maven-production
[4]: https://tools.taskcluster.net/auth/clients/project%2Fmobile%2Fandroid-components%2Freleng%2Fscriptworker%2Fbeetmover%2Fproduction
[5]: https://tools.taskcluster.net/auth/roles/repo%3Agithub.com%2Fmozilla-mobile%2Fandroid-components%3Arelease
[6]: https://tools.taskcluster.net/auth/roles/repo%3Agithub.com%2Fmozilla-mobile%2Fandroid-components%3Apull-request
Updated•2 months ago
|
Component: Release Automation: Uploading → Release Automation
You need to log in
before you can comment on or make changes to this bug.
Description
•