Open Bug 1936349 Opened 2 months ago Updated 2 months ago

LandoAPI celery worker deploys failing frequently due to health check failures

Categories

(Conduit :: Lando, defect, P3)

Tracking

(Not tracked)

People

(Reporter: sheehan, Unassigned)

Details

During the last few LandoAPI deploys, the Celery worker in Lando has failed to start in a timely fashion as the newly spun pods are taking longer and longer to pass health checks. This results in deployment failures that cause the deployment to be restarted, during which time Lando is unusable.

A similar issue with Lando's Celery worker resulted in bug 1935648, where emails weren't sent for a period of time due to the Celery worker being idle.

:emaydeck ran the readiness probe command lando-cli celery inspect ping and noticed it is taking ~35s to complete. The Celery pods have a 20s timeout on their readiness/liveness probes, so we are going to extend the timeout so k8s can see the pod is online and ready.

We extended the liveness probe timeout to 50 seconds and the worker deploy step completed in ~2m, instead of failing after ~25m.

We need to look into the cause of the slowness in Celery.

You need to log in before you can comment on or make changes to this bug.