Closed
Bug 1324805
Opened 8 years ago
Closed 8 years ago
Fix Airflow "killed as zombie" problem
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mreid, Assigned: bugzilla)
References
Details
For the last several weeks, when airflow kicked off the weekly jobs, a large number of tasks initially failed with a "killed as zombie" error message.
This causes duplicate runs in some cases, as described in bug 1321424.
We should determine why this is happening and fix it.
One theory: when the weekly jobs run in addition to the usual daily jobs, it exceeds the available resources at the level of docker, ecs, or airflow itself (per the max allowed concurrent tasks).
Reporter | ||
Updated•8 years ago
|
Here are the logs from each of the containers from midnight UTC through the time I logged on (~3:26 UTC)
https://gist.github.com/sunahsuh/885d5ad29348150a91715b405b57e989
The broken pipe error in the scheduler during the heartbeat check coincides with the unexpectedly closed TCP connection in the rabbitmq logs.
Switching out rabbitmq to use SQS would be pretty easy and would be the quickest way to kill overhead, imo
Updated•8 years ago
|
Severity: normal → critical
Priority: -- → P2
Updated•8 years ago
|
Priority: P2 → P1
It certainly looks like the upping the resource limits fixed the immediate "killed as zombie" problem -- other "don't run jobs twice" remediation items are being tracked in bug 1321424
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•