Closed Bug 1734959 Opened 3 years ago Closed 3 years ago

firefoxci cluster appears to not be spinning up new workers

Categories

(Cloud Services :: Operations: Taskcluster, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Unassigned)

References

Details

Attachments

(1 file)

(Thread in Slack #taskcluster-cloudops.)

Our worker pools appear to be a couple hours backlogged, with no errors related to spinning up workers, but no workers spinning up.

https://status.aws.amazon.com/ shows no errors. The gecko-3/decision errors page shows no errors. It seems like worker-manager may not be attempting to spin up workers, or is unable to?

Attached image GCP-outage.png

This GCP outage could be the culprit? (Ty :Owlish for the find!)

It's looking like we're terminating the worker scanner on too many timeouts; same with the provisioner.

(Same is true of the queue claim resolver and deadline resolver, so it isn't specific to the worker manager)

I'm wondering if the process.exit(1) combined with the GCE API timeouts or errors from the gcloud status page may have resulted in the worker-manager process(es) exiting and not restarting. Is this our desired behavior?

(if so, I wonder if we should separate AWS from Azure from GCP in terms of total number of errors?)

Flags: needinfo?(jwhitlock)

I don't think the process exited, unless I'm misunderstanding how this fits together. The worker-manager-provisioner pod kept running after that iterate error, but it was only emitting what looks like heartbeat messages every 10 seconds.
One is "loop-interference":

  "jsonPayload": {
    "Type": "loop-interference",
    "Severity": 5,
    "Fields": {},
    "serviceContext": {
      "service": "worker-manager",
      "version": "43.2.0"
    },
    "Hostname": "taskcluster-worker-manager-provisioner-55f6b8f7b6-n94gl",
    "Pid": 1,
    "Logger": "taskcluster.worker-manager.provisioner",
    "Timestamp": 1633734424389000000,
    "EnvVersion": "2.0"
  },
  "timestamp": "2021-10-08T23:07:04.389764392Z",
  "severity": "NOTICE",
  "logName": "projects/moz-fx-taskcluster-prod-4b87/logs/stdout",
  "receiveTimestamp": "2021-10-08T23:07:08.766177735Z"
}

The other is monitor.periodic:

  "jsonPayload": {
    "Logger": "taskcluster.worker-manager.provisioner",
    "Hostname": "taskcluster-worker-manager-provisioner-55f6b8f7b6-n94gl",
    "Timestamp": 1633734424389000000,
    "Pid": 1,
    "serviceContext": {
      "version": "43.2.0",
      "service": "worker-manager"
    },
    "Type": "monitor.periodic",
    "EnvVersion": "2.0",
    "Severity": 5,
    "Fields": {
      "status": "success",
      "name": "provisioner",
      "duration": 0.181549,
      "v": 1
    }
  },
  "timestamp": "2021-10-08T23:07:04.390032326Z",
  "severity": "NOTICE",
  "logName": "projects/moz-fx-taskcluster-prod-4b87/logs/stdout",
  "receiveTimestamp": "2021-10-08T23:07:08.766177735Z"
}

Those repeat every 10 seconds. And a db-pool-counts message repeated every minute. Nothing else until I deleted the pod.

The issue got resolved at 02:40 UTC, thank you. Leaving the bug open for TC CloudOps to resolve.

Flags: needinfo?(jwhitlock)

(In reply to chris valaas [:cvalaas] from comment #3)

I don't think the process exited, unless I'm misunderstanding how this fits together. The worker-manager-provisioner pod kept running after that iterate error, but it was only emitting what looks like heartbeat messages every 10 seconds.

Could this mean that the worker-manager daemon exited, but the pod kept running? If so,

  • should the pod exit when the daemon exits, or
  • should the pod restart the daemon via supervisor or other, or
  • should we get a louder notification somewhere so people know what happened?

If that's not what happened, I'm curious what happened.

Flags: needinfo?(bwong)

In Matrix #taskcluster, Aki says:

I was looking for a retro, and possibly a followup fix if we determine we need one. We have a thread in #taskcluster-cloudops.
It's not clear to me what happened, and whether we want the code/configs to be as they currently are.

Since Matrix and Slack are temporary, I'm copying context here. My summary of that thread:

  • bwong agrees to a retro, driven by ops. He suggests "notify louder with telescope and then automate recovery when we see a pattern of service disruption"
  • cvalaas agrees, and is setting up a post-mortem for Oct 14. He was hoping for some more research from me (jwhitlock)

The original debugging thread had this additional data:

  • The gecko3/decision pool provisions workers in AWS, and https://status.aws.amazon.com/ showed no failures.
  • owlish found several issues on https://status.cloud.google.com/. Aki posted the screenshot, showing three issues:
    • Multi region: Some Google Compute Engine API methods might be delayed or return errors. Mitigation is underway. The following GCE API methods might return errors or be unresponsive: compute.instanceGroupManagers.get, compute.instanceGroupManagers.listManagedInstances. For failed API calls, retries may complete successfully.
      • This message is being shown today as well, making it less likely to be an issue.
    • US Multi-region. Slowness in Cloud SQL federated queries. Queries against federated data sources via CloudSQL are exhibiting slowdown in US Multi-region.
      • The cert is message is no longer shown. Taskcluster services use PostgreSQL, but I do not know if this is a federated data source in Google Cloud.
    • Global: Build failures impacting a subset of Google App Engine customers, due to the "Let's Encrypt" certificate expiration. Mitigation work by our engineering team is currently in progress. Impacted customers may be experiencing build failures and receive an error message related to "certificate verification failed" or "SSL connection could not be established".
      • The Let's Encrypt root cert expired on September 30th, 8 days previously making it less likely to be the root cause of this incident. This warning is no longer shown.
  • owlish found a GKE warning message FailedScheduling 0/7 nodes are available: 7 Insufficient cpu. around 6 PM (unknown timezone, assuming PST), but no other indications
  • cvalaas found that taskcluster-worker-manager-provisioner spit out an error and then was silent for 3 hours. Restarting the pod appeared to work, and workers began to be provisioned again, but then reported the same error:
    Error: Iteration exceeded maximum time allowed
        at Timeout.<anonymous> (/app/libraries/iterate/src/index.js:96:16)
        at listOnTimeout (internal/timers.js:554:17)
        at processTimers (internal/timers.js:497:7)
    
  • aki identified the code that raise the warning as coming from the iterate library, used in several places in the Taskcluster code.

I believe comments #2 and #3 above were after cvalaas restarted the pod and workers were provisioned again.

In developer instances, the worker-manager service runs in GCP as three Kubernetes deployments:

  • taskcluster-worker-manager-provisioner - The provisioner. It determines if workers should be started due to low capacity versus high pending tasks in the queue.
  • taskcluster-worker-manager-web - The API server for worker manager, called by newly provisioned workers for configuration
  • taskcluster-worker-manager-workerscanner - The worker scanner. It determines if workers have self-terminated.

The "Iteration exceeded maximum time allowed" was recorded in Sentry, taskcluster-firefox-ci project:

https://sentry.prod.mozaws.net/share/issue/65ac23dd995c4a68b0a89c4f62e27810/

There is not enough context to determine which code raised the error, but all recent errors (last 30 days) were from the worker-manager app, runProvisioner process. It occurs 0-5 times a day, and once on October 8th.

Two years ago, this was tracked as bug 1595844 and closed as an annoyance rather than an issue affecting provisioning.

See Also: → 1595844
Flags: needinfo?(bwong)

The GCP logs are in the taskcluster-prod project of firefox.gcp.mozilla.com, and can be viewed with this query in Logs Explorer:

resource.type="k8s_container"
resource.labels.project_id="moz-fx-taskcluster-prod-4b87"
resource.labels.cluster_name="taskcluster-firefoxcitc-v1"
resource.labels.namespace_name="firefoxcitc-taskcluster"
labels.k8s-pod/app_kubernetes_io/component="taskcluster-worker-manager-provisioner"
labels.k8s-pod/app_kubernetes_io/instance="firefoxcitc-taskcluster-taskcluster-1"
labels.k8s-pod/app_kubernetes_io/name="taskcluster-worker-manager"
labels.k8s-pod/app_kubernetes_io/part-of="taskcluster"

It is useful to add jsonPayload.type to the summary line to see the log types. Also, GCP seems to have no mechanism for sharing log views or copying.

The log entries are steady at Oct 8, 11 PM GMT, but quickly fall off:

  • Last simple-estimate notice at 23:00:54.323. These stop during the incident. Previously, several a second
  • Last worker-pool-provisioned notice at 23:00:54.323. Previously, several a second, directly after a simple-estimate. These stop during the incident.
  • monitor.periodic notice at 23:00:54.323, then an monitor.periodic error 6 minutes later at 23:06:04.342, followed by the monitor.error warning of Error: Iteration exceeded maximum time allowed at Timeout. Before the incident, they are 4 times a minute, and after they are 6 times a minute.
  • Last db-function-call notice at 23:01:04.566. Previously, there are about 20 a minute
  • There is a db-pool-counts notice at 23:05:47.619, after a sequence of 10. The next message is the monitor.periodic error. These continue at their normal rate of 2 a minute after the incident.
  • The loop-interference notice begins at 23:06:24.366 (during the incident) and continues at a rate of 6 per minute

I think we can say the incident began at 23:01 GMT.

The loop-interference log message only appeared during this incident from Oct 1 to Oct 11. It is generated by this code:

https://github.com/taskcluster/taskcluster/blob/66941613e4242d69dd2aff3fb560359eb633d59c/services/worker-manager/src/provisioner.js#L66-L68

  async provision() {
    if (this.provisioningLoopAlive) {
      this.monitor.notice('loop-interference', {});
      return;
    }

This code should prevent two provision() loops from running at the same time, but appears to have gotten stuck. Restarting the pod reset it.

The last loop-interference notice was at 2021-10-09 02:11:46.465 GMT.

There were three incidences of the monitor.periodic error followed by the monitor.error warning from Oct 1 to Oct 11:

  • 2021-10-07 11:14:31 GMT. No loop-interference notices. I do not recall any reported break in provisioning.
  • 2021-10-08 23:06:04 GMT. Followed by many loop-interference notices.
  • 2021-10-09 02:09.06 GMT. Followed by loop-interference notices for 2 minutes, which then stopped. It is possible the loop-interference messages were from the terminating pod.

I focused on these by adding this query term:

(jsonPayload.Type=("monitor.error" OR "loop-interference")) OR (jsonPayload.Type="monitor.periodic" AND severity=ERROR)

Retro published at https://github.com/taskcluster/taskcluster-retrospectives/blob/master/retrospectives/2021-10-08-worker-manager-underprovisioning.md, with next actions identified.

Aki, are you ready to close this bug?

Flags: needinfo?(aki)

Yes, thank you all!

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(aki)
Resolution: --- → FIXED

This broke again on Tuesday, December 7th, in both community-tc and firefox-ci-tc. The instigating issue was some AWS issues in US-EAST-1, and a timeout in the woker-provisioner led to rapid loops with the loop-interference message. Community-tc recovered with a forced cycle of taskcluster-worker-manager-provisioner. Firefox-ci-tc repeatedly got stuck until US-EAST-1 was back to normal.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: