Closed Bug 1460446 Opened 7 years ago Closed 6 years ago

Segregate Macs used for staging

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: coop, Assigned: dragrom)

References

Details

Attachments

(1 file, 4 obsolete files)

Bug_1460446_segregate_mac_workers.patch 6 years ago Dragos Crisan [:dragrom] 3.01 KB, patch		Details \| Diff \| Splinter Review
Bug_1460446_segregate_mac_workers.patch 6 years ago Dragos Crisan [:dragrom] 3.01 KB, patch		Details \| Diff \| Splinter Review
Bug_1460446_segregate_mac_workers.patch 6 years ago Dragos Crisan [:dragrom] 3.01 KB, patch		Details \| Diff \| Splinter Review
Segregate Macs used for staging 6 years ago Dragos Crisan [:dragrom] 201 bytes, text/plain		Details
Segregate Macs used for staging 6 years ago Dragos Crisan [:dragrom] 202 bytes, text/plain	dragrom : checked-in+	Details

Chris Cooper [:coop] (he/him)

Reporter

Description

•

7 years ago

Dragos is currently doing a bunch of work on t-yosemite-r7-380 to upgrade the generic-worker. This machine is the de facto Mac staging "pool." Unfortunately because there's no official segregation of staging Macs from those running in production, the same alerting rules apply to both. Since changes in staging have a much higher change of going wrong and causing alerts, we've been getting *lots* of alerts via Papertrail from t-yosemite-r7-380, over 3,000 over the past week alone. The recent errors are for worker crashes, e.g.: May 08 03:25:11 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:25:11 goroutine 1 [running]: May 08 03:25:34 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:25:34 goroutine 1 [running]: May 08 03:26:02 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:26:02 goroutine 1 [running]: I don't have access to the Papertrail search that reports these alerts, but the alerts link here: https://papertrailapp.com/searches/21789221 Can we figure out a way to segregate the staging Macs, if only for alerting purposes? Alternately, if we know t-yosemite-r7-380 will always be for staging, can we simply exclude it from production alerts?

Kendall Libby [:fubar] (he/him)

Comment 1

•

7 years ago

Pete rightly suggested over in bug 1452095#c42 that we set up a staging pool (gecko-t-osx-1010-beta), so we should definitely exclude it from production alerts. We'll have a look and sort it out asap.

Chris Cooper [:coop] (he/him)

Reporter

Comment 2

•

7 years ago

t-yosemite-r7-449 is alerting via papertrail email in a similar way today. Is it also in staging?

Adrian Pop

Comment 3

•

7 years ago

I have quarantine t-yosemite-r7-449 on taskcluster as it had 10+ tasks resolved as exception. Currently the alerting via papertrail has stopped.

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Comment 4

•

7 years ago

(In reply to Adrian Pop from comment #3) > I have quarantine t-yosemite-r7-449 on taskcluster as it had 10+ tasks > resolved as exception. > Currently the alerting via papertrail has stopped. Thanks Adrian. I've created bug 1461914 for that worker - we probably just need to delete /Users/cltbld/file-caches.json on that worker. For deadling with the general problem of macOS workers getting into a bad state, and burning through tasks and/or alerting at high levels and/or logging excessively from boot looping, I've created bug 1461913. In short, there is no reason they should do this - once the generic-worker has detected it is running on a bad environment (disk/memory problems etc), the launch daemon should be disabled, and a bug should be raised for a human to investigate manually.

Comment 5

•

7 years ago

Quarantined t-yosemite-r7-449 once again. Seems like it went out of the quarantine and started reporting via papertrail again. Pete opened a bug for this machine. (https://bugzilla.mozilla.org/show_bug.cgi?id=1461914)

Kendall Libby [:fubar] (he/him)

Comment 6

•

7 years ago

(In reply to Roland Mutter Michael (:rmutter) from comment #5) > Quarantined t-yosemite-r7-449 once again. Seems like it went out of the > quarantine We've had a few other cases where it seemed like systems are coming out of quarantine unexpectedly; we should investigate.

Roland Mutter Michael (:rmutter)

Comment 7

•

6 years ago

t-yosemite-r7-449 went out of the quarantine once again. Leaving it like that for further TC investigation.

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

6 years ago

I'm guessing this was the quarantine operation, three days ago: May 17 18:49:51 taskcluster-queue app/web.23: Authorized mozilla-auth0/ad|Mozilla-LDAP|rmutter for PUT access to /v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-449 May 17 18:49:51 taskcluster-queue heroku/router: at=info method=PUT path="/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-449" host=queue.taskcluster.net request_id=43d00136-e4c0-4283-836e-6dae55651d36 fwd="86.127.174.35" dyno=web.23 connect=1ms service=511ms status=200 bytes=3129 protocol=https and it didn't do much until yesterday: May 19 14:09:41 taskcluster-queue app/web.25: Authorized task-client/CYU7hHuPSFK7N_TnoEeu-Q/0/on/mdc1/t-yosemite-r7-449/until/1526740179.339 for POST access to /v1/task/CYU7hHuPSFK7N_TnoEeu-Q/runs/0/artifacts/public%2Flogs%2Flive_backing.log I'm trying to see if I can figure out what the quarantinedUntil value was set to on the 17th.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

6 years ago

quarantineUntil: 2018-05-19T13:33:47.463Z, so it looks like the host started accepting jobs at the time its quarantine expired..

Adrian Pop

Comment 10

•

6 years ago

As agreed with Dustin, I have set to quarantine the worker t-yosemite-r7-410 the quarantine record is set to 3018-05-21T00:17:36.000Z We will keep on monitoring this worker to check if there are any changes for the next days

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

6 years ago

$ curl -s https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-410 | jq .quarantineUntil "3018-05-21T00:17:36.000Z"

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

6 years ago

Still no change: $ curl -s https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-410 | jq .quarantineUntil "3018-05-21T00:17:36.000Z"

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

6 years ago

Moved to bug 1463473.

Kendall Libby [:fubar] (he/him)

Updated

•

6 years ago

Assignee: relops → dcrisan

Dragos Crisan [:dragrom]

Assignee

Updated

•

6 years ago

Status: NEW → ASSIGNED

Dragos Crisan [:dragrom]

Assignee

Comment 14

•

6 years ago

Attached patch Bug_1460446_segregate_mac_workers.patch (obsolete) — Details — Splinter Review

Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not. If the host is a beta worker, we add the beta work at the end of worker type string

Attachment #8981409 - Flags: review?(pmoore)

Attachment #8981409 - Flags: review?(jwatkins)

Dragos Crisan [:dragrom]

Assignee

Comment 15

•

6 years ago

Attached patch Bug_1460446_segregate_mac_workers.patch (obsolete) — Details — Splinter Review

Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not. If the host is a beta worker, we add the beta work at the end of worker type string

Attachment #8981409 - Attachment is obsolete: true

Attachment #8981409 - Flags: review?(pmoore)

Attachment #8981409 - Flags: review?(jwatkins)

Attachment #8981410 - Flags: review?(pmoore)

Attachment #8981410 - Flags: review?(jwatkins)

Dragos Crisan [:dragrom]

Assignee

Comment 16

•

6 years ago

Attached patch Bug_1460446_segregate_mac_workers.patch (obsolete) — Details — Splinter Review

Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not. If the host is a beta worker, we add the beta work at the end of worker type string

Attachment #8981410 - Attachment is obsolete: true

Attachment #8981410 - Flags: review?(pmoore)

Attachment #8981410 - Flags: review?(jwatkins)

Attachment #8981421 - Flags: review?(pmoore)

Attachment #8981421 - Flags: review?(jwatkins)

Pete Moore [:pmoore][:pete]

Comment 17

•

6 years ago

Comment on attachment 8981421 [details] [diff] [review] Bug_1460446_segregate_mac_workers.patch Review of attachment 8981421 [details] [diff] [review]: ----------------------------------------------------------------- ::: manifests/moco-nodes.pp @@ +1181,5 @@ > + $aspects = [ 'low-security' ] > + $slave_trustlevel = 'try' > + $pin_puppet_server = 'releng-puppet2.srv.releng.scl3.mozilla.com' > + $pin_puppet_env = 'dcrisan' > + $is_beta_worker = true I think the node should declare which environment pool it is in (production / staging / ...) and associated logic should set the worker type name based on the environment pool it is in. Using an enum rather than a bool ("production", "staging", ...) avoids that a node could end up in two pools, and means only one variable is needed (i.e. to avoid that we end up introducing $is_production, $is_dev, $is_staging, ....). ::: modules/generic_worker/manifests/init.pp @@ +18,5 @@ > + } > + else { > + $worker_type = "gecko-t-osx-${macos_version}" > + } > + This would then be something like: if ($env == 'staging') { .... }

Attachment #8981421 - Flags: review?(pmoore)

Pete Moore [:pmoore][:pete]

Comment 18

•

6 years ago

Dustin, do you know if the environment pool (staging / prod / dev / test, etc) already exists in our puppet setup? I'm no puppet expert, curious if you agree with my review comment above.

Flags: needinfo?(dustin)

Pete Moore [:pmoore][:pete]

Comment 19

•

6 years ago

Alternatively we could just explicitly set $worker_type directly in manifests/moco-nodes.pp and avoid the extra level of abstraction. This may also be more future-proof, making it easier to configure many different worker type names.

Dustin J. Mitchell [:dustin] (he/him)

Comment 20

•

6 years ago

I think you're thinking of trustlevel, which already exists here.

Flags: needinfo?(dustin)

Pete Moore [:pmoore][:pete]

Comment 21

•

6 years ago

Do you mean creating something like a 'try-staging' trust level? I think at the moment we only have 'try' which currently is a shared trust level across production and staging, isn't it?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 22

•

6 years ago

OK, I don't know what you mean, then.

Flags: needinfo?(dustin)

Pete Moore [:pmoore][:pete]

Comment 23

•

6 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #22) > OK, I don't know what you mean, then. We have different environments: * production * staging And we have different trust levels: * try * core I'm not sure what 'core' is - I assume everything that isn't try. I inferred this from the puppet repo. We also have a concept of scm level, which I didn't see referenced in the puppet repo: * scm level 1 * scm level 2 * scm level 3 These three concepts seem to be partially orthogonal, e.g. an environment in production can have a 'core' trust level and be used for scm level 2 tasks. But a production environment can also have 'try' trust level and run only 'scm level 1' tasks. Perhaps 'core' just means scm level 2/3 and try just means scm level 1. I'm not sure - I'm trying to understand the number of degrees of freedom that exist, and the possible combinations (multi-dimension matrix) that are allowed. The point is we should tag servers according to the environment they belong to (production / staging / ....) independently of whether they are try / core or scm level 1 / 2 / 3. So in summary - does anyone know how trust level relates to scm level and what I've coined "environment" ? If we can draw up a list of allowed combinations, we can see if there are two or three degrees of freedom at play here, and choose appropriate tags to use in the puppet repo. Thanks!

Pete Moore [:pmoore][:pete]

Comment 24

•

6 years ago

Flags: needinfo?(jwatkins)

Pete Moore [:pmoore][:pete]

Comment 25

•

6 years ago

Option 4) Just use existing 'try' trust level for staging, but add an $environment to help track that a worker is a staging one: Environment | SCM Level | $slave_trustlevel | $environment ========================================================== staging | 1 | 'try' | staging staging | 2 | 'try' | staging staging | 3 | 'try' | staging production | 1 | 'try' | production production | 2 | 'core' | production production | 3 | 'core' | production Maybe this is the simplest and best. The only downside here is that it could be confusing/dangerous is that a staging scm level 3 worker has a try trust level. But no production keys/secrets should leak into it, as it has a try trust level. Also, is $environment a good enum name to distinguish between production and staging?

Dragos Crisan [:dragrom]

Assignee

Updated

•

6 years ago

Flags: needinfo?(jwatkins)

Dragos Crisan [:dragrom]

Assignee

Updated

•

6 years ago

Attachment #8981421 - Flags: review?(jwatkins)

Dragos Crisan [:dragrom]

Assignee

Comment 26

•

6 years ago

Attached file Segregate Macs used for staging (obsolete) — Details

Segregate Macs used for staging

Attachment #8981421 - Attachment is obsolete: true

Dragos Crisan [:dragrom]

Assignee

Comment 27

•

6 years ago

Comment on attachment 8986696 [details] Segregate Macs used for staging <html> <head> <meta http-equiv="Refresh" content="2; url=https://github.com/mozilla-releng/build-puppet/pull/77"> </head> <body> Redirect to pull request 77 </body> </html>

Dragos Crisan [:dragrom]

Assignee

Comment 28

•

6 years ago

Comment on attachment 8986696 [details] Segregate Macs used for staging ><html> > <head> > <meta http-equiv="Refresh" content="2; >url='https://github.com/mozilla-releng/build-puppet/pull/77'"> > </head> > <body> > Redirect to pull request 77 > </body> ></html>

Dragos Crisan [:dragrom]

Assignee

Comment 29

•

6 years ago

Comment on attachment 8986696 [details] Segregate Macs used for staging <html> <head> <meta http-equiv="refresh" content="2; URL='https://github.com/mozilla-releng/build-puppet/pull/77'"> </head> <body> Redirect to pull request 77 </body> </html>

Dragos Crisan [:dragrom]

Assignee

Comment 30

•

6 years ago

Attached file Segregate Macs used for staging — Details

Segregate Macs used for staging

Attachment #8986696 - Attachment is obsolete: true

Dragos Crisan [:dragrom]

Assignee

Updated

•

6 years ago

Attachment #8986700 - Flags: checked-in+

Dragos Crisan [:dragrom]

Assignee

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.