Closed Bug 1460446 Opened 7 years ago Closed 6 years ago

Segregate Macs used for staging

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: dragrom)

References

Details

Attachments

(1 file, 4 obsolete files)

Dragos is currently doing a bunch of work on t-yosemite-r7-380 to upgrade the generic-worker. This machine is the de facto Mac staging "pool." Unfortunately because there's no official segregation of staging Macs from those running in production, the same alerting rules apply to both. Since changes in staging have a much higher change of going wrong and causing alerts, we've been getting *lots* of alerts via Papertrail from t-yosemite-r7-380, over 3,000 over the past week alone. The recent errors are for worker crashes, e.g.: May 08 03:25:11 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:25:11 goroutine 1 [running]: May 08 03:25:34 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:25:34 goroutine 1 [running]: May 08 03:26:02 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:26:02 goroutine 1 [running]: I don't have access to the Papertrail search that reports these alerts, but the alerts link here: https://papertrailapp.com/searches/21789221 Can we figure out a way to segregate the staging Macs, if only for alerting purposes? Alternately, if we know t-yosemite-r7-380 will always be for staging, can we simply exclude it from production alerts?
Pete rightly suggested over in bug 1452095#c42 that we set up a staging pool (gecko-t-osx-1010-beta), so we should definitely exclude it from production alerts. We'll have a look and sort it out asap.
t-yosemite-r7-449 is alerting via papertrail email in a similar way today. Is it also in staging?
I have quarantine t-yosemite-r7-449 on taskcluster as it had 10+ tasks resolved as exception. Currently the alerting via papertrail has stopped.
See Also: → 1461914
(In reply to Adrian Pop from comment #3) > I have quarantine t-yosemite-r7-449 on taskcluster as it had 10+ tasks > resolved as exception. > Currently the alerting via papertrail has stopped. Thanks Adrian. I've created bug 1461914 for that worker - we probably just need to delete /Users/cltbld/file-caches.json on that worker. For deadling with the general problem of macOS workers getting into a bad state, and burning through tasks and/or alerting at high levels and/or logging excessively from boot looping, I've created bug 1461913. In short, there is no reason they should do this - once the generic-worker has detected it is running on a bad environment (disk/memory problems etc), the launch daemon should be disabled, and a bug should be raised for a human to investigate manually.
See Also: → 1461913
Quarantined t-yosemite-r7-449 once again. Seems like it went out of the quarantine and started reporting via papertrail again. Pete opened a bug for this machine. (https://bugzilla.mozilla.org/show_bug.cgi?id=1461914)
(In reply to Roland Mutter Michael (:rmutter) from comment #5) > Quarantined t-yosemite-r7-449 once again. Seems like it went out of the > quarantine We've had a few other cases where it seemed like systems are coming out of quarantine unexpectedly; we should investigate.
t-yosemite-r7-449 went out of the quarantine once again. Leaving it like that for further TC investigation.
I'm guessing this was the quarantine operation, three days ago: May 17 18:49:51 taskcluster-queue app/web.23: Authorized mozilla-auth0/ad|Mozilla-LDAP|rmutter for PUT access to /v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-449 May 17 18:49:51 taskcluster-queue heroku/router: at=info method=PUT path="/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-449" host=queue.taskcluster.net request_id=43d00136-e4c0-4283-836e-6dae55651d36 fwd="86.127.174.35" dyno=web.23 connect=1ms service=511ms status=200 bytes=3129 protocol=https and it didn't do much until yesterday: May 19 14:09:41 taskcluster-queue app/web.25: Authorized task-client/CYU7hHuPSFK7N_TnoEeu-Q/0/on/mdc1/t-yosemite-r7-449/until/1526740179.339 for POST access to /v1/task/CYU7hHuPSFK7N_TnoEeu-Q/runs/0/artifacts/public%2Flogs%2Flive_backing.log I'm trying to see if I can figure out what the quarantinedUntil value was set to on the 17th.
quarantineUntil: 2018-05-19T13:33:47.463Z, so it looks like the host started accepting jobs at the time its quarantine expired..
As agreed with Dustin, I have set to quarantine the worker t-yosemite-r7-410 the quarantine record is set to 3018-05-21T00:17:36.000Z We will keep on monitoring this worker to check if there are any changes for the next days
Assignee: relops → dcrisan
Status: NEW → ASSIGNED
Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not. If the host is a beta worker, we add the beta work at the end of worker type string
Attachment #8981409 - Flags: review?(pmoore)
Attachment #8981409 - Flags: review?(jwatkins)
Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not. If the host is a beta worker, we add the beta work at the end of worker type string
Attachment #8981409 - Attachment is obsolete: true
Attachment #8981409 - Flags: review?(pmoore)
Attachment #8981409 - Flags: review?(jwatkins)
Attachment #8981410 - Flags: review?(pmoore)
Attachment #8981410 - Flags: review?(jwatkins)
Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not. If the host is a beta worker, we add the beta work at the end of worker type string
Attachment #8981410 - Attachment is obsolete: true
Attachment #8981410 - Flags: review?(pmoore)
Attachment #8981410 - Flags: review?(jwatkins)
Attachment #8981421 - Flags: review?(pmoore)
Attachment #8981421 - Flags: review?(jwatkins)
Comment on attachment 8981421 [details] [diff] [review] Bug_1460446_segregate_mac_workers.patch Review of attachment 8981421 [details] [diff] [review]: ----------------------------------------------------------------- ::: manifests/moco-nodes.pp @@ +1181,5 @@ > + $aspects = [ 'low-security' ] > + $slave_trustlevel = 'try' > + $pin_puppet_server = 'releng-puppet2.srv.releng.scl3.mozilla.com' > + $pin_puppet_env = 'dcrisan' > + $is_beta_worker = true I think the node should declare which environment pool it is in (production / staging / ...) and associated logic should set the worker type name based on the environment pool it is in. Using an enum rather than a bool ("production", "staging", ...) avoids that a node could end up in two pools, and means only one variable is needed (i.e. to avoid that we end up introducing $is_production, $is_dev, $is_staging, ....). ::: modules/generic_worker/manifests/init.pp @@ +18,5 @@ > + } > + else { > + $worker_type = "gecko-t-osx-${macos_version}" > + } > + This would then be something like: if ($env == 'staging') { .... }
Attachment #8981421 - Flags: review?(pmoore)
Dustin, do you know if the environment pool (staging / prod / dev / test, etc) already exists in our puppet setup? I'm no puppet expert, curious if you agree with my review comment above.
Flags: needinfo?(dustin)
Alternatively we could just explicitly set $worker_type directly in manifests/moco-nodes.pp and avoid the extra level of abstraction. This may also be more future-proof, making it easier to configure many different worker type names.
I think you're thinking of trustlevel, which already exists here.
Flags: needinfo?(dustin)
Do you mean creating something like a 'try-staging' trust level? I think at the moment we only have 'try' which currently is a shared trust level across production and staging, isn't it?
Flags: needinfo?(dustin)
OK, I don't know what you mean, then.
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #22) > OK, I don't know what you mean, then. We have different environments: * production * staging And we have different trust levels: * try * core I'm not sure what 'core' is - I assume everything that isn't try. I inferred this from the puppet repo. We also have a concept of scm level, which I didn't see referenced in the puppet repo: * scm level 1 * scm level 2 * scm level 3 These three concepts seem to be partially orthogonal, e.g. an environment in production can have a 'core' trust level and be used for scm level 2 tasks. But a production environment can also have 'try' trust level and run only 'scm level 1' tasks. Perhaps 'core' just means scm level 2/3 and try just means scm level 1. I'm not sure - I'm trying to understand the number of degrees of freedom that exist, and the possible combinations (multi-dimension matrix) that are allowed. The point is we should tag servers according to the environment they belong to (production / staging / ....) independently of whether they are try / core or scm level 1 / 2 / 3. So in summary - does anyone know how trust level relates to scm level and what I've coined "environment" ? If we can draw up a list of allowed combinations, we can see if there are two or three degrees of freedom at play here, and choose appropriate tags to use in the puppet repo. Thanks!
Options ======= 1) Just add a single 'staging' trust level that covers all scm levels on staging environment: Environment | SCM Level | $slave_trustlevel =========================================== staging | 1 | 'staging' staging | 2 | 'staging' staging | 3 | 'staging' production | 1 | 'try' production | 2 | 'core' production | 3 | 'core' 2) Add two new trust levels 'staging-try' and 'staging-core': Environment | SCM Level | $slave_trustlevel =========================================== staging | 1 | 'staging-try' staging | 2 | 'staging-core' staging | 3 | 'staging-core' production | 1 | 'try' production | 2 | 'core' production | 3 | 'core' 3) Leave $slave_trustlevel alone, and add '$environment' Environment | SCM Level | $slave_trustlevel | $environment ========================================================== staging | 1 | 'try' | staging staging | 2 | 'core' | staging staging | 3 | 'core' | staging production | 1 | 'try' | production production | 2 | 'core' | production production | 3 | 'core' | production I tend to favour option 1 for simplicity. Option 2 is probably a little better at reflecting production environment (in production we don't have a shared trust level between try and core machines). Option 3 strikes me as dangerous, because code that looks at $slave_trustlevel isn't currently evaluating this together with the environment, so staging core/try trust levels could easily get confused with production ones. I think 3 is a big no-no. Jake, what are your thoughts, or do you know who would be a good person to review/decide (is there a module owner for this)? Thanks!
Flags: needinfo?(jwatkins)
Option 4) Just use existing 'try' trust level for staging, but add an $environment to help track that a worker is a staging one: Environment | SCM Level | $slave_trustlevel | $environment ========================================================== staging | 1 | 'try' | staging staging | 2 | 'try' | staging staging | 3 | 'try' | staging production | 1 | 'try' | production production | 2 | 'core' | production production | 3 | 'core' | production Maybe this is the simplest and best. The only downside here is that it could be confusing/dangerous is that a staging scm level 3 worker has a try trust level. But no production keys/secrets should leak into it, as it has a try trust level. Also, is $environment a good enum name to distinguish between production and staging?
Flags: needinfo?(jwatkins)
Attachment #8981421 - Flags: review?(jwatkins)
Attached file Segregate Macs used for staging (obsolete) —
Segregate Macs used for staging
Attachment #8981421 - Attachment is obsolete: true
Comment on attachment 8986696 [details] Segregate Macs used for staging <html> <head> <meta http-equiv="Refresh" content="2; url=https://github.com/mozilla-releng/build-puppet/pull/77"> </head> <body> Redirect to pull request 77 </body> </html>
Comment on attachment 8986696 [details] Segregate Macs used for staging ><html> > <head> > <meta http-equiv="Refresh" content="2; >url='https://github.com/mozilla-releng/build-puppet/pull/77'"> > </head> > <body> > Redirect to pull request 77 > </body> ></html>
Comment on attachment 8986696 [details] Segregate Macs used for staging <html> <head> <meta http-equiv="refresh" content="2; URL='https://github.com/mozilla-releng/build-puppet/pull/77'"> </head> <body> Redirect to pull request 77 </body> </html>
Segregate Macs used for staging
Attachment #8986696 - Attachment is obsolete: true
Attachment #8986700 - Flags: checked-in+
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: