Fix Airflow "killed as zombie" problem

RESOLVED FIXED

Status

P1
critical
RESOLVED FIXED
2 years ago
2 days ago

People

(Reporter: mreid, Assigned: sunahsuh)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

2 years ago
For the last several weeks, when airflow kicked off the weekly jobs, a large number of tasks initially failed with a "killed as zombie" error message.

This causes duplicate runs in some cases, as described in bug 1321424.

We should determine why this is happening and fix it.

One theory: when the weekly jobs run in addition to the usual daily jobs, it exceeds the available resources at the level of docker, ecs, or airflow itself (per the max allowed concurrent tasks).
(Reporter)

Updated

2 years ago
Blocks: 1269754
See Also: → bug 1321424
(Assignee)

Comment 1

2 years ago
Here are the logs from each of the containers from midnight UTC through the time I logged on (~3:26 UTC)
https://gist.github.com/sunahsuh/885d5ad29348150a91715b405b57e989

The broken pipe error in the scheduler during the heartbeat check coincides with the unexpectedly closed TCP connection in the rabbitmq logs.

Switching out rabbitmq to use SQS would be pretty easy and would be the quickest way to kill overhead, imo
Severity: normal → critical
Priority: -- → P2
(Assignee)

Updated

2 years ago
Assignee: nobody → ssuh
Priority: P2 → P1
(Assignee)

Comment 2

2 years ago
It certainly looks like the upping the resource limits fixed the immediate "killed as zombie" problem -- other "don't run jobs twice" remediation items are being tracked in bug 1321424
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED

Updated

2 days ago
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.