Closed Bug 1594720 Opened 6 years ago Closed 6 years ago

nightly balrogworkers dying with `python exited with signal -15`

Categories

(Release Engineering :: Release Automation, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mtabara, Assigned: mtabara)

References

Details

Attachments

(2 files)

Out of https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&selectedJob=275060958&searchStr=shippable%2Cmac&revision=d271c572a9bcd008ed14bf104b2eb81949952e4c we're seeing lots of balrogworkers dying with python exited with signal -15

Ben suspects we might be getting killed for hitting the max run time or something, given that now the number of attempts has increased. We should check into that.

E.g. This task says worker-shutdownhttps://tools.taskcluster.net/groups/TioxwyvvR3WUqVpbUrze9A/tasks/SgjgfuEWQHyDA6qLVkN-eg/runs/0

signal -15 is SIGTERM, so yeah, looks like we're probably getting killed off.

there are a lot of retries in there, and a mix of 400 and 502 errors.

this task ran for 28 minutes before failing :\

For now, let's drop the no of balrogworkers from 45 -> 25. And also reduce the no of attempts from 20 to 12.

https://github.com/mozilla-releng/scriptworker-scripts/pull/68
https://github.com/mozilla-releng/k8s-autoscale/pull/65

track this.

Attachment #9109015 - Attachment description: [scriptworker-scripts → [scriptworker-scripts] Reduce the no of attempts to avoid getting the container killer
See Also: → 1591373

We should stop seeing this type of errors, we're now deployed to all workers.
Please re-open if this happens again.
If so, we might need to look into extending max-run-time for balrogworkers or beef up the instances.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Release Automation: Updates → Release Automation
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: