firefoxci cluster appears to not be spinning up new workers
Categories
(Cloud Services :: Operations: Taskcluster, defect, P1)
Tracking
(Not tracked)
People
(Reporter: mozilla, Unassigned)
References
Details
Attachments
(1 file)
311.68 KB,
image/png
|
Details |
(Thread in Slack #taskcluster-cloudops.)
Our worker pools appear to be a couple hours backlogged, with no errors related to spinning up workers, but no workers spinning up.
- 62 pending gecko-3/decision tasks
- most recent task started 2 hours ago
- 0 capacity, 62 pending -- and other pools have hundreds of pending tasks.
https://status.aws.amazon.com/ shows no errors. The gecko-3/decision errors page shows no errors. It seems like worker-manager may not be attempting to spin up workers, or is unable to?
Reporter | ||
Comment 1•3 years ago
•
|
||
This GCP outage could be the culprit? (Ty :Owlish for the find!)
Reporter | ||
Comment 2•3 years ago
•
|
||
It's looking like we're terminating the worker scanner on too many timeouts; same with the provisioner.
(Same is true of the queue claim resolver and deadline resolver, so it isn't specific to the worker manager)
I'm wondering if the process.exit(1)
combined with the GCE API timeouts or errors from the gcloud status page may have resulted in the worker-manager process(es) exiting and not restarting. Is this our desired behavior?
(if so, I wonder if we should separate AWS from Azure from GCP in terms of total number of errors?)
Comment 3•3 years ago
•
|
||
I don't think the process exited, unless I'm misunderstanding how this fits together. The worker-manager-provisioner pod kept running after that iterate error, but it was only emitting what looks like heartbeat messages every 10 seconds.
One is "loop-interference":
"jsonPayload": {
"Type": "loop-interference",
"Severity": 5,
"Fields": {},
"serviceContext": {
"service": "worker-manager",
"version": "43.2.0"
},
"Hostname": "taskcluster-worker-manager-provisioner-55f6b8f7b6-n94gl",
"Pid": 1,
"Logger": "taskcluster.worker-manager.provisioner",
"Timestamp": 1633734424389000000,
"EnvVersion": "2.0"
},
"timestamp": "2021-10-08T23:07:04.389764392Z",
"severity": "NOTICE",
"logName": "projects/moz-fx-taskcluster-prod-4b87/logs/stdout",
"receiveTimestamp": "2021-10-08T23:07:08.766177735Z"
}
The other is monitor.periodic:
"jsonPayload": {
"Logger": "taskcluster.worker-manager.provisioner",
"Hostname": "taskcluster-worker-manager-provisioner-55f6b8f7b6-n94gl",
"Timestamp": 1633734424389000000,
"Pid": 1,
"serviceContext": {
"version": "43.2.0",
"service": "worker-manager"
},
"Type": "monitor.periodic",
"EnvVersion": "2.0",
"Severity": 5,
"Fields": {
"status": "success",
"name": "provisioner",
"duration": 0.181549,
"v": 1
}
},
"timestamp": "2021-10-08T23:07:04.390032326Z",
"severity": "NOTICE",
"logName": "projects/moz-fx-taskcluster-prod-4b87/logs/stdout",
"receiveTimestamp": "2021-10-08T23:07:08.766177735Z"
}
Those repeat every 10 seconds. And a db-pool-counts
message repeated every minute. Nothing else until I deleted the pod.
Comment 4•3 years ago
|
||
The issue got resolved at 02:40 UTC, thank you. Leaving the bug open for TC CloudOps to resolve.
Updated•3 years ago
|
Reporter | ||
Comment 5•3 years ago
|
||
(In reply to chris valaas [:cvalaas] from comment #3)
I don't think the process exited, unless I'm misunderstanding how this fits together. The worker-manager-provisioner pod kept running after that iterate error, but it was only emitting what looks like heartbeat messages every 10 seconds.
Could this mean that the worker-manager daemon exited, but the pod kept running? If so,
- should the pod exit when the daemon exits, or
- should the pod restart the daemon via supervisor or other, or
- should we get a louder notification somewhere so people know what happened?
If that's not what happened, I'm curious what happened.
Reporter | ||
Updated•3 years ago
|
Comment 6•3 years ago
|
||
In Matrix #taskcluster, Aki says:
I was looking for a retro, and possibly a followup fix if we determine we need one. We have a thread in #taskcluster-cloudops.
It's not clear to me what happened, and whether we want the code/configs to be as they currently are.
Since Matrix and Slack are temporary, I'm copying context here. My summary of that thread:
- bwong agrees to a retro, driven by ops. He suggests "notify louder with telescope and then automate recovery when we see a pattern of service disruption"
- cvalaas agrees, and is setting up a post-mortem for Oct 14. He was hoping for some more research from me (jwhitlock)
The original debugging thread had this additional data:
- The gecko3/decision pool provisions workers in AWS, and https://status.aws.amazon.com/ showed no failures.
- owlish found several issues on https://status.cloud.google.com/. Aki posted the screenshot, showing three issues:
- Multi region: Some Google Compute Engine API methods might be delayed or return errors. Mitigation is underway. The following GCE API methods might return errors or be unresponsive: compute.instanceGroupManagers.get, compute.instanceGroupManagers.listManagedInstances. For failed API calls, retries may complete successfully.
- This message is being shown today as well, making it less likely to be an issue.
- US Multi-region. Slowness in Cloud SQL federated queries. Queries against federated data sources via CloudSQL are exhibiting slowdown in US Multi-region.
- The cert is message is no longer shown. Taskcluster services use PostgreSQL, but I do not know if this is a federated data source in Google Cloud.
- Global: Build failures impacting a subset of Google App Engine customers, due to the "Let's Encrypt" certificate expiration. Mitigation work by our engineering team is currently in progress. Impacted customers may be experiencing build failures and receive an error message related to "certificate verification failed" or "SSL connection could not be established".
- The Let's Encrypt root cert expired on September 30th, 8 days previously making it less likely to be the root cause of this incident. This warning is no longer shown.
- Multi region: Some Google Compute Engine API methods might be delayed or return errors. Mitigation is underway. The following GCE API methods might return errors or be unresponsive: compute.instanceGroupManagers.get, compute.instanceGroupManagers.listManagedInstances. For failed API calls, retries may complete successfully.
- owlish found a GKE warning message
FailedScheduling 0/7 nodes are available: 7 Insufficient cpu.
around 6 PM (unknown timezone, assuming PST), but no other indications - cvalaas found that taskcluster-worker-manager-provisioner spit out an error and then was silent for 3 hours. Restarting the pod appeared to work, and workers began to be provisioned again, but then reported the same error:
Error: Iteration exceeded maximum time allowed at Timeout.<anonymous> (/app/libraries/iterate/src/index.js:96:16) at listOnTimeout (internal/timers.js:554:17) at processTimers (internal/timers.js:497:7)
- aki identified the code that raise the warning as coming from the iterate library, used in several places in the Taskcluster code.
I believe comments #2 and #3 above were after cvalaas restarted the pod and workers were provisioned again.
In developer instances, the worker-manager service runs in GCP as three Kubernetes deployments:
taskcluster-worker-manager-provisioner
- The provisioner. It determines if workers should be started due to low capacity versus high pending tasks in the queue.taskcluster-worker-manager-web
- The API server for worker manager, called by newly provisioned workers for configurationtaskcluster-worker-manager-workerscanner
- The worker scanner. It determines if workers have self-terminated.
The "Iteration exceeded maximum time allowed" was recorded in Sentry, taskcluster-firefox-ci
project:
https://sentry.prod.mozaws.net/share/issue/65ac23dd995c4a68b0a89c4f62e27810/
There is not enough context to determine which code raised the error, but all recent errors (last 30 days) were from the worker-manager
app, runProvisioner
process. It occurs 0-5 times a day, and once on October 8th.
Two years ago, this was tracked as bug 1595844 and closed as an annoyance rather than an issue affecting provisioning.
Updated•3 years ago
|
Comment 7•3 years ago
•
|
||
The GCP logs are in the taskcluster-prod project of firefox.gcp.mozilla.com, and can be viewed with this query in Logs Explorer:
resource.type="k8s_container"
resource.labels.project_id="moz-fx-taskcluster-prod-4b87"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-provisioner"
labels.k8s-pod/app_kubernetes_io/instance="firefoxcitc-taskcluster-taskcluster-1"
labels.k8s-pod/app_kubernetes_io/name="taskcluster-worker-manager"
labels.k8s-pod/app_kubernetes_io/part-of="taskcluster"
It is useful to add jsonPayload
.type
to the summary line to see the log types. Also, GCP seems to have no mechanism for sharing log views or copying.
The log entries are steady at Oct 8, 11 PM GMT, but quickly fall off:
- Last
simple-estimate
notice at 23:00:54.323. These stop during the incident. Previously, several a second - Last
worker-pool-provisioned
notice at 23:00:54.323. Previously, several a second, directly after asimple-estimate
. These stop during the incident. monitor.periodic
notice at 23:00:54.323, then anmonitor.periodic
error 6 minutes later at 23:06:04.342, followed by themonitor.error
warning ofError: Iteration exceeded maximum time allowed at Timeout
. Before the incident, they are 4 times a minute, and after they are 6 times a minute.- Last
db-function-call
notice at 23:01:04.566. Previously, there are about 20 a minute - There is a
db-pool-counts
notice at 23:05:47.619, after a sequence of 10. The next message is themonitor.periodic
error. These continue at their normal rate of 2 a minute after the incident. - The
loop-interference
notice begins at 23:06:24.366 (during the incident) and continues at a rate of 6 per minute
I think we can say the incident began at 23:01 GMT.
The loop-interference
log message only appeared during this incident from Oct 1 to Oct 11. It is generated by this code:
async provision() {
if (this.provisioningLoopAlive) {
this.monitor.notice('loop-interference', {});
return;
}
This code should prevent two provision()
loops from running at the same time, but appears to have gotten stuck. Restarting the pod reset it.
The last loop-interference
notice was at 2021-10-09 02:11:46.465 GMT.
There were three incidences of the monitor.periodic
error followed by the monitor.error
warning from Oct 1 to Oct 11:
- 2021-10-07 11:14:31 GMT. No
loop-interference
notices. I do not recall any reported break in provisioning. - 2021-10-08 23:06:04 GMT. Followed by many
loop-interference
notices. - 2021-10-09 02:09.06 GMT. Followed by
loop-interference
notices for 2 minutes, which then stopped. It is possible theloop-interference
messages were from the terminating pod.
I focused on these by adding this query term:
(jsonPayload.Type=("monitor.error" OR "loop-interference")) OR (jsonPayload.Type="monitor.periodic" AND severity=ERROR)
Comment 8•3 years ago
|
||
Retro published at https://github.com/taskcluster/taskcluster-retrospectives/blob/master/retrospectives/2021-10-08-worker-manager-underprovisioning.md, with next actions identified.
Aki, are you ready to close this bug?
Reporter | ||
Comment 9•3 years ago
|
||
Yes, thank you all!
Comment 10•3 years ago
|
||
This broke again on Tuesday, December 7th, in both community-tc and firefox-ci-tc. The instigating issue was some AWS issues in US-EAST-1, and a timeout in the woker-provisioner led to rapid loops with the loop-interference
message. Community-tc recovered with a forced cycle of taskcluster-worker-manager-provisioner
. Firefox-ci-tc repeatedly got stuck until US-EAST-1 was back to normal.
Description
•