Closed
Bug 1460446
Opened 7 years ago
Closed 6 years ago
Segregate Macs used for staging
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: dragrom)
References
Details
Attachments
(1 file, 4 obsolete files)
202 bytes,
text/plain
|
dragrom
:
checked-in+
|
Details |
Dragos is currently doing a bunch of work on t-yosemite-r7-380 to upgrade the generic-worker. This machine is the de facto Mac staging "pool."
Unfortunately because there's no official segregation of staging Macs from those running in production, the same alerting rules apply to both. Since changes in staging have a much higher change of going wrong and causing alerts, we've been getting *lots* of alerts via Papertrail from t-yosemite-r7-380, over 3,000 over the past week alone.
The recent errors are for worker crashes, e.g.:
May 08 03:25:11 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:25:11 goroutine 1 [running]:
May 08 03:25:34 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:25:34 goroutine 1 [running]:
May 08 03:26:02 t-yosemite-r7-380.test.releng.mdc1.mozilla.com generic-worker: 2018/05/07 20:26:02 goroutine 1 [running]:
I don't have access to the Papertrail search that reports these alerts, but the alerts link here:
https://papertrailapp.com/searches/21789221
Can we figure out a way to segregate the staging Macs, if only for alerting purposes?
Alternately, if we know t-yosemite-r7-380 will always be for staging, can we simply exclude it from production alerts?
Comment 1•7 years ago
|
||
Pete rightly suggested over in bug 1452095#c42 that we set up a staging pool (gecko-t-osx-1010-beta), so we should definitely exclude it from production alerts. We'll have a look and sort it out asap.
Reporter | ||
Comment 2•7 years ago
|
||
t-yosemite-r7-449 is alerting via papertrail email in a similar way today. Is it also in staging?
Comment 3•7 years ago
|
||
I have quarantine t-yosemite-r7-449 on taskcluster as it had 10+ tasks resolved as exception.
Currently the alerting via papertrail has stopped.
Comment 4•7 years ago
|
||
(In reply to Adrian Pop from comment #3)
> I have quarantine t-yosemite-r7-449 on taskcluster as it had 10+ tasks
> resolved as exception.
> Currently the alerting via papertrail has stopped.
Thanks Adrian.
I've created bug 1461914 for that worker - we probably just need to delete /Users/cltbld/file-caches.json on that worker.
For deadling with the general problem of macOS workers getting into a bad state, and burning through tasks and/or alerting at high levels and/or logging excessively from boot looping, I've created bug 1461913. In short, there is no reason they should do this - once the generic-worker has detected it is running on a bad environment (disk/memory problems etc), the launch daemon should be disabled, and a bug should be raised for a human to investigate manually.
See Also: → 1461913
Comment 5•7 years ago
|
||
Quarantined t-yosemite-r7-449 once again. Seems like it went out of the quarantine and started reporting via papertrail again. Pete opened a bug for this machine. (https://bugzilla.mozilla.org/show_bug.cgi?id=1461914)
Comment 6•7 years ago
|
||
(In reply to Roland Mutter Michael (:rmutter) from comment #5)
> Quarantined t-yosemite-r7-449 once again. Seems like it went out of the
> quarantine
We've had a few other cases where it seemed like systems are coming out of quarantine unexpectedly; we should investigate.
Comment 7•6 years ago
|
||
t-yosemite-r7-449 went out of the quarantine once again. Leaving it like that for further TC investigation.
Comment 8•6 years ago
|
||
I'm guessing this was the quarantine operation, three days ago:
May 17 18:49:51 taskcluster-queue app/web.23: Authorized mozilla-auth0/ad|Mozilla-LDAP|rmutter for PUT access to /v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-449
May 17 18:49:51 taskcluster-queue heroku/router: at=info method=PUT path="/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-449" host=queue.taskcluster.net request_id=43d00136-e4c0-4283-836e-6dae55651d36 fwd="86.127.174.35" dyno=web.23 connect=1ms service=511ms status=200 bytes=3129 protocol=https
and it didn't do much until yesterday:
May 19 14:09:41 taskcluster-queue app/web.25: Authorized task-client/CYU7hHuPSFK7N_TnoEeu-Q/0/on/mdc1/t-yosemite-r7-449/until/1526740179.339 for POST access to /v1/task/CYU7hHuPSFK7N_TnoEeu-Q/runs/0/artifacts/public%2Flogs%2Flive_backing.log
I'm trying to see if I can figure out what the quarantinedUntil value was set to on the 17th.
Comment 9•6 years ago
|
||
quarantineUntil: 2018-05-19T13:33:47.463Z,
so it looks like the host started accepting jobs at the time its quarantine expired..
Comment 10•6 years ago
|
||
As agreed with Dustin, I have set to quarantine the worker t-yosemite-r7-410
the quarantine record is set to 3018-05-21T00:17:36.000Z
We will keep on monitoring this worker to check if there are any changes for the next days
Comment 11•6 years ago
|
||
$ curl -s https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-410 | jq .quarantineUntil
"3018-05-21T00:17:36.000Z"
Comment 12•6 years ago
|
||
Still no change:
$ curl -s https://queue.taskcluster.net/v1/provisioners/releng-hardware/worker-types/gecko-t-osx-1010/workers/mdc1/t-yosemite-r7-410 | jq .quarantineUntil
"3018-05-21T00:17:36.000Z"
Comment 13•6 years ago
|
||
Moved to bug 1463473.
Updated•6 years ago
|
Assignee: relops → dcrisan
Assignee | ||
Updated•6 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 14•6 years ago
|
||
Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not.
If the host is a beta worker, we add the beta work at the end of worker type string
Attachment #8981409 -
Flags: review?(pmoore)
Attachment #8981409 -
Flags: review?(jwatkins)
Assignee | ||
Comment 15•6 years ago
|
||
Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not.
If the host is a beta worker, we add the beta work at the end of worker type string
Attachment #8981409 -
Attachment is obsolete: true
Attachment #8981409 -
Flags: review?(pmoore)
Attachment #8981409 -
Flags: review?(jwatkins)
Attachment #8981410 -
Flags: review?(pmoore)
Attachment #8981410 -
Flags: review?(jwatkins)
Assignee | ||
Comment 16•6 years ago
|
||
Added a node scope variable $is_beta_worker to determine if this host is a beta worker or not.
If the host is a beta worker, we add the beta work at the end of worker type string
Attachment #8981410 -
Attachment is obsolete: true
Attachment #8981410 -
Flags: review?(pmoore)
Attachment #8981410 -
Flags: review?(jwatkins)
Attachment #8981421 -
Flags: review?(pmoore)
Attachment #8981421 -
Flags: review?(jwatkins)
Comment 17•6 years ago
|
||
Comment on attachment 8981421 [details] [diff] [review]
Bug_1460446_segregate_mac_workers.patch
Review of attachment 8981421 [details] [diff] [review]:
-----------------------------------------------------------------
::: manifests/moco-nodes.pp
@@ +1181,5 @@
> + $aspects = [ 'low-security' ]
> + $slave_trustlevel = 'try'
> + $pin_puppet_server = 'releng-puppet2.srv.releng.scl3.mozilla.com'
> + $pin_puppet_env = 'dcrisan'
> + $is_beta_worker = true
I think the node should declare which environment pool it is in (production / staging / ...) and associated logic should set the worker type name based on the environment pool it is in. Using an enum rather than a bool ("production", "staging", ...) avoids that a node could end up in two pools, and means only one variable is needed (i.e. to avoid that we end up introducing $is_production, $is_dev, $is_staging, ....).
::: modules/generic_worker/manifests/init.pp
@@ +18,5 @@
> + }
> + else {
> + $worker_type = "gecko-t-osx-${macos_version}"
> + }
> +
This would then be something like: if ($env == 'staging') { .... }
Attachment #8981421 -
Flags: review?(pmoore)
Comment 18•6 years ago
|
||
Dustin, do you know if the environment pool (staging / prod / dev / test, etc) already exists in our puppet setup? I'm no puppet expert, curious if you agree with my review comment above.
Flags: needinfo?(dustin)
Comment 19•6 years ago
|
||
Alternatively we could just explicitly set $worker_type directly in manifests/moco-nodes.pp and avoid the extra level of abstraction.
This may also be more future-proof, making it easier to configure many different worker type names.
Comment 20•6 years ago
|
||
I think you're thinking of trustlevel, which already exists here.
Flags: needinfo?(dustin)
Comment 21•6 years ago
|
||
Do you mean creating something like a 'try-staging' trust level? I think at the moment we only have 'try' which currently is a shared trust level across production and staging, isn't it?
Flags: needinfo?(dustin)
Comment 23•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #22)
> OK, I don't know what you mean, then.
We have different environments:
* production
* staging
And we have different trust levels:
* try
* core
I'm not sure what 'core' is - I assume everything that isn't try. I inferred this from the puppet repo.
We also have a concept of scm level, which I didn't see referenced in the puppet repo:
* scm level 1
* scm level 2
* scm level 3
These three concepts seem to be partially orthogonal, e.g. an environment in production can have a 'core' trust level and be used for scm level 2 tasks. But a production environment can also have 'try' trust level and run only 'scm level 1' tasks.
Perhaps 'core' just means scm level 2/3 and try just means scm level 1. I'm not sure - I'm trying to understand the number of degrees of freedom that exist, and the possible combinations (multi-dimension matrix) that are allowed.
The point is we should tag servers according to the environment they belong to (production / staging / ....) independently of whether they are try / core or scm level 1 / 2 / 3.
So in summary - does anyone know how trust level relates to scm level and what I've coined "environment" ? If we can draw up a list of allowed combinations, we can see if there are two or three degrees of freedom at play here, and choose appropriate tags to use in the puppet repo.
Thanks!
Comment 24•6 years ago
|
||
Options
=======
1) Just add a single 'staging' trust level that covers all scm levels on staging environment:
Environment | SCM Level | $slave_trustlevel
===========================================
staging | 1 | 'staging'
staging | 2 | 'staging'
staging | 3 | 'staging'
production | 1 | 'try'
production | 2 | 'core'
production | 3 | 'core'
2) Add two new trust levels 'staging-try' and 'staging-core':
Environment | SCM Level | $slave_trustlevel
===========================================
staging | 1 | 'staging-try'
staging | 2 | 'staging-core'
staging | 3 | 'staging-core'
production | 1 | 'try'
production | 2 | 'core'
production | 3 | 'core'
3) Leave $slave_trustlevel alone, and add '$environment'
Environment | SCM Level | $slave_trustlevel | $environment
==========================================================
staging | 1 | 'try' | staging
staging | 2 | 'core' | staging
staging | 3 | 'core' | staging
production | 1 | 'try' | production
production | 2 | 'core' | production
production | 3 | 'core' | production
I tend to favour option 1 for simplicity. Option 2 is probably a little better at reflecting production environment (in production we don't have a shared trust level between try and core machines). Option 3 strikes me as dangerous, because code that looks at $slave_trustlevel isn't currently evaluating this together with the environment, so staging core/try trust levels could easily get confused with production ones. I think 3 is a big no-no.
Jake, what are your thoughts, or do you know who would be a good person to review/decide (is there a module owner for this)?
Thanks!
Flags: needinfo?(jwatkins)
Comment 25•6 years ago
|
||
Option 4) Just use existing 'try' trust level for staging, but add an $environment to help track that a worker is a staging one:
Environment | SCM Level | $slave_trustlevel | $environment
==========================================================
staging | 1 | 'try' | staging
staging | 2 | 'try' | staging
staging | 3 | 'try' | staging
production | 1 | 'try' | production
production | 2 | 'core' | production
production | 3 | 'core' | production
Maybe this is the simplest and best. The only downside here is that it could be confusing/dangerous is that a staging scm level 3 worker has a try trust level. But no production keys/secrets should leak into it, as it has a try trust level.
Also, is $environment a good enum name to distinguish between production and staging?
Assignee | ||
Updated•6 years ago
|
Flags: needinfo?(jwatkins)
Assignee | ||
Updated•6 years ago
|
Attachment #8981421 -
Flags: review?(jwatkins)
Assignee | ||
Comment 26•6 years ago
|
||
Segregate Macs used for staging
Attachment #8981421 -
Attachment is obsolete: true
Assignee | ||
Comment 27•6 years ago
|
||
Comment on attachment 8986696 [details]
Segregate Macs used for staging
<html>
<head>
<meta http-equiv="Refresh" content="2; url=https://github.com/mozilla-releng/build-puppet/pull/77">
</head>
<body>
Redirect to pull request 77
</body>
</html>
Assignee | ||
Comment 28•6 years ago
|
||
Comment on attachment 8986696 [details]
Segregate Macs used for staging
><html>
> <head>
> <meta http-equiv="Refresh" content="2;
>url='https://github.com/mozilla-releng/build-puppet/pull/77'">
> </head>
> <body>
> Redirect to pull request 77
> </body>
></html>
Assignee | ||
Comment 29•6 years ago
|
||
Comment on attachment 8986696 [details]
Segregate Macs used for staging
<html>
<head>
<meta http-equiv="refresh" content="2; URL='https://github.com/mozilla-releng/build-puppet/pull/77'">
</head>
<body>
Redirect to pull request 77
</body>
</html>
Assignee | ||
Comment 30•6 years ago
|
||
Segregate Macs used for staging
Attachment #8986696 -
Attachment is obsolete: true
Assignee | ||
Updated•6 years ago
|
Attachment #8986700 -
Flags: checked-in+
Assignee | ||
Updated•6 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•