For the last several weeks, when airflow kicked off the weekly jobs, a large number of tasks initially failed with a "killed as zombie" error message. This causes duplicate runs in some cases, as described in bug 1321424. We should determine why this is happening and fix it. One theory: when the weekly jobs run in addition to the usual daily jobs, it exceeds the available resources at the level of docker, ecs, or airflow itself (per the max allowed concurrent tasks).
Here are the logs from each of the containers from midnight UTC through the time I logged on (~3:26 UTC) https://gist.github.com/sunahsuh/885d5ad29348150a91715b405b57e989 The broken pipe error in the scheduler during the heartbeat check coincides with the unexpectedly closed TCP connection in the rabbitmq logs. Switching out rabbitmq to use SQS would be pretty easy and would be the quickest way to kill overhead, imo
Severity: normal → critical
Priority: -- → P2
It certainly looks like the upping the resource limits fixed the immediate "killed as zombie" problem -- other "don't run jobs twice" remediation items are being tracked in bug 1321424
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.