Closed Bug 1637725 Opened 5 years ago Closed 5 years ago

"rolling" updates of the auth service do not seem to roll

Categories

(Cloud Services :: Operations: Taskcluster, defect)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dustin, Assigned: brian)

References

Details

Earlier today we had about a 10-second spate of 500 errors from the Auth service to the Queue service during the update to 29.4.1, and it seems another on deploying the fix for that release, from 2020-05-13 19:21:14.026 GMT to 2020-05-13 19:21:25.772 GMT. In both cases I see errors for the Queue service in trying to call the Auth serivce, but I don't see any errors or 500 responses in the Auth service's logs.

Is there something happening with the load balancer or nginx that is causing these brief outages? Is it something that could be remedied reasonably quickly?

Assignee: edunham → bpitts

The traffic flow is lb->nginx->services.

It's possible something is wrong with the deployment strategy or health checks and we don't have enough healthy pods in service. https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-update-deployment explains how the rollingUpdate strategy works and can be tweaked.

It's also possible the iprepd-nginx containers or services themselves are being killed while there are ongoing requests. https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods explains how individual pods are removed from serving traffic and killed.

https://github.com/mozilla-services/cloudops-infra/pull/2073 is an attempt to fix this for iprepd-nginx. That actually isn't restarted during a typical taskcluster deploy, though.

Dustin, do taskcluster services have a concept of a graceful shutdown? I.E. no longer accepting new requests but finishing ones in flight. Could they be taught to do that when they receive SIGTERM?

I need to narrow it down to the time ranges in question, but here's a first pass at looking at load balancer 5xx errors and all nginx logs

SELECT httpRequest.status, jsonpayload_type_loadbalancerlogentry.statusdetails, count(httpRequest.status) FROM `moz-fx-taskcluster-prod-4b87.log_storage.requests_20200513` WHERE httpRequest.status >= 500 AND resource.labels.url_map_name="k8s-um-firefoxcitc-taskcluster-taskcluster-ingress--300229e3580" GROUP BY httpRequest.status, jsonpayload_type_loadbalancerlogentry.statusdetails LIMIT 1000;

status	statusdetails	f0_
502	backend_connection_closed_before_data_sent_to_client	35662
502	failed_to_connect_to_backend	5
502	response_sent_by_backend	2785
500	websocket_handshake_failed	4
500	response_sent_by_backend	5634
502	websocket_handshake_failed	14
502	backend_timeout	606

SELECT jsonPayload.proxy_host, jsonPayload.status, count(jsonPayload.status)  FROM `moz-fx-taskcluster-prod-4b87.log_storage.stdout_20200513` WHERE resource.labels.container_name = "nginx" AND resource.labels.namespace_name = "firefoxcitc-taskcluster" GROUP BY jsonPayload.proxy_host, jsonPayload.status ORDER BY jsonPayload.proxy_host, jsonPayload.status  LIMIT 1000

proxy_host	status	f0_
	200	201664
	302	206
auth	200	5586252
auth	204	3
auth	304	16
auth	403	180
auth	500	1040
auth	502	617
github	200	11
github	204	372
github	302	1
github	304	2
github	400	94
github	500	4
hooks	200	2804
hooks	403	21
hooks	404	24
index	200	371505
index	303	127860
index	400	3
index	401	108
index	403	9
index	404	8715
index	500	21
index	502	9
notify	200	1701
notify	400	238
purge-cache	200	95157
purge-cache	500	5
purge-cache	502	49
queue	200	18530916
queue	303	3162507
queue	304	226
queue	400	755
queue	401	4
queue	403	18
queue	404	507573
queue	409	592
queue	424	72
queue	499	370
queue	500	3383
queue	502	1904
references	200	3
secrets	200	94469
secrets	403	2
secrets	404	29342
secrets	500	22
secrets	502	28
ui	200	117128
ui	206	14
ui	304	791
ui	405	17
ui	500	2
ui	502	1
web-server	101	12297
web-server	200	57960
web-server	204	112
web-server	302	190
web-server	400	24
web-server	403	3
web-server	404	26
web-server	499	4
web-server	500	6
web-server	502	15
worker-manager	200	48657
worker-manager	400	41
worker-manager	403	1
worker-manager	499	2
worker-manager	500	10
worker-manager	502	12
Flags: needinfo?(dustin)

Dustin, do taskcluster services have a concept of a graceful shutdown? I.E. no longer accepting new requests but finishing ones in flight. Could they be taught to do that when they receive SIGTERM?

I don't know if Express supports this kind of thing. It would be pretty neat! But, I would have expected 502's in that case, and I think based on the logs that these were 500's. That'd need to be confirmed, though. I see both "Internal Server Error" and "Unknown Server Error" logged by the queue when talking to auth.

Flags: needinfo?(dustin)

https://expressjs.com/en/advanced/healthcheck-graceful-shutdown.html has some avenues for investigating to get Express to handle SIGTERM gracefully.

I'm inclined to say this isn't worth investigating further, since even if it recurs the window of errors is so short and clients should be resilient to them.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
See Also: → 1674882
You need to log in before you can comment on or make changes to this bug.