Closed Bug 1324805 Opened 8 years ago Closed 8 years ago

Fix Airflow "killed as zombie" problem

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: bugzilla)

References

Details

For the last several weeks, when airflow kicked off the weekly jobs, a large number of tasks initially failed with a "killed as zombie" error message. This causes duplicate runs in some cases, as described in bug 1321424. We should determine why this is happening and fix it. One theory: when the weekly jobs run in addition to the usual daily jobs, it exceeds the available resources at the level of docker, ecs, or airflow itself (per the max allowed concurrent tasks).
Blocks: 1269754
See Also: → 1321424
Here are the logs from each of the containers from midnight UTC through the time I logged on (~3:26 UTC) https://gist.github.com/sunahsuh/885d5ad29348150a91715b405b57e989 The broken pipe error in the scheduler during the heartbeat check coincides with the unexpectedly closed TCP connection in the rabbitmq logs. Switching out rabbitmq to use SQS would be pretty easy and would be the quickest way to kill overhead, imo
Severity: normal → critical
Priority: -- → P2
Assignee: nobody → ssuh
Priority: P2 → P1
It certainly looks like the upping the resource limits fixed the immediate "killed as zombie" problem -- other "don't run jobs twice" remediation items are being tracked in bug 1321424
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.