"rolling" updates of the auth service do not seem to roll
Categories
(Cloud Services :: Operations: Taskcluster, defect)
Tracking
(Not tracked)
People
(Reporter: dustin, Assigned: brian)
References
Details
Earlier today we had about a 10-second spate of 500 errors from the Auth service to the Queue service during the update to 29.4.1, and it seems another on deploying the fix for that release, from 2020-05-13 19:21:14.026 GMT to 2020-05-13 19:21:25.772 GMT. In both cases I see errors for the Queue service in trying to call the Auth serivce, but I don't see any errors or 500 responses in the Auth service's logs.
Is there something happening with the load balancer or nginx that is causing these brief outages? Is it something that could be remedied reasonably quickly?
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 1•5 years ago
•
|
||
The traffic flow is lb->nginx->services.
It's possible something is wrong with the deployment strategy or health checks and we don't have enough healthy pods in service. https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment explains how the rollingUpdate strategy works and can be tweaked.
It's also possible the iprepd-nginx containers or services themselves are being killed while there are ongoing requests. https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods explains how individual pods are removed from serving traffic and killed.
https://github.com/mozilla-services/cloudops-infra/pull/2073 is an attempt to fix this for iprepd-nginx. That actually isn't restarted during a typical taskcluster deploy, though.
Dustin, do taskcluster services have a concept of a graceful shutdown? I.E. no longer accepting new requests but finishing ones in flight. Could they be taught to do that when they receive SIGTERM?
I need to narrow it down to the time ranges in question, but here's a first pass at looking at load balancer 5xx errors and all nginx logs
SELECT httpRequest.status, jsonpayload_type_loadbalancerlogentry.statusdetails, count(httpRequest.status) FROM `moz-fx-taskcluster-prod-4b87.log_storage.requests_20200513` WHERE httpRequest.status >= 500 AND resource.labels.url_map_name="k8s-um-firefoxcitc-taskcluster-taskcluster-ingress--300229e3580" GROUP BY httpRequest.status, jsonpayload_type_loadbalancerlogentry.statusdetails LIMIT 1000;
status statusdetails f0_
502 backend_connection_closed_before_data_sent_to_client 35662
502 failed_to_connect_to_backend 5
502 response_sent_by_backend 2785
500 websocket_handshake_failed 4
500 response_sent_by_backend 5634
502 websocket_handshake_failed 14
502 backend_timeout 606
SELECT jsonPayload.proxy_host, jsonPayload.status, count(jsonPayload.status) FROM `moz-fx-taskcluster-prod-4b87.log_storage.stdout_20200513` WHERE resource.labels.container_name = "nginx" AND resource.labels.namespace_name = "firefoxcitc-taskcluster" GROUP BY jsonPayload.proxy_host, jsonPayload.status ORDER BY jsonPayload.proxy_host, jsonPayload.status LIMIT 1000
proxy_host status f0_
200 201664
302 206
auth 200 5586252
auth 204 3
auth 304 16
auth 403 180
auth 500 1040
auth 502 617
github 200 11
github 204 372
github 302 1
github 304 2
github 400 94
github 500 4
hooks 200 2804
hooks 403 21
hooks 404 24
index 200 371505
index 303 127860
index 400 3
index 401 108
index 403 9
index 404 8715
index 500 21
index 502 9
notify 200 1701
notify 400 238
purge-cache 200 95157
purge-cache 500 5
purge-cache 502 49
queue 200 18530916
queue 303 3162507
queue 304 226
queue 400 755
queue 401 4
queue 403 18
queue 404 507573
queue 409 592
queue 424 72
queue 499 370
queue 500 3383
queue 502 1904
references 200 3
secrets 200 94469
secrets 403 2
secrets 404 29342
secrets 500 22
secrets 502 28
ui 200 117128
ui 206 14
ui 304 791
ui 405 17
ui 500 2
ui 502 1
web-server 101 12297
web-server 200 57960
web-server 204 112
web-server 302 190
web-server 400 24
web-server 403 3
web-server 404 26
web-server 499 4
web-server 500 6
web-server 502 15
worker-manager 200 48657
worker-manager 400 41
worker-manager 403 1
worker-manager 499 2
worker-manager 500 10
worker-manager 502 12
| Reporter | ||
Comment 2•5 years ago
|
||
Dustin, do taskcluster services have a concept of a graceful shutdown? I.E. no longer accepting new requests but finishing ones in flight. Could they be taught to do that when they receive SIGTERM?
I don't know if Express supports this kind of thing. It would be pretty neat! But, I would have expected 502's in that case, and I think based on the logs that these were 500's. That'd need to be confirmed, though. I see both "Internal Server Error" and "Unknown Server Error" logged by the queue when talking to auth.
| Assignee | ||
Comment 4•5 years ago
|
||
https://expressjs.com/en/advanced/healthcheck-graceful-shutdown.html has some avenues for investigating to get Express to handle SIGTERM gracefully.
I'm inclined to say this isn't worth investigating further, since even if it recurs the window of errors is so short and clients should be resilient to them.
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
| Comment hidden (Intermittent Failures Robot) |
Description
•