All trees closed due to high AWS pending test backlog

RESOLVED FIXED

Status

Release Engineering
Buildduty
--
blocker
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: RyanVM, Assigned: Callek)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
I'm seeing high numbers of pending AWS tests across all trees and nagios is alerting in #buildduty about high numbers of pending jobs.

All trees closed, including Gaia.

Comment 1

3 years ago
New instances are failing to start buildbot because runner can't find hg via hgtool.py. We're bleeding capacity as old instances terminate themselves to pick up the new image.
Clarifying this is a configuration issue on the AWS machines (can't find the hgtool.py script), not an issue interacting with hg.m.o
(Assignee)

Comment 3

3 years ago
Created attachment 8534536 [details] [diff] [review]
fix

This patch should fix the root of the problem

In parallel :rail is reverting the golden AMI's to yesterday's ones which will avoid this bustage alltogether
Assignee: nobody → bugspam.Callek
Status: NEW → ASSIGNED
Attachment #8534536 - Flags: review?(winter2718)
Attachment #8534536 - Flags: review+
Attachment #8534536 - Flags: review+
(Reporter)

Comment 4

3 years ago
As an update, we're just waiting for the current pending backlog to come down before reopening. Rail's revert is working for getting the line moving in the right direction :)
(Reporter)

Comment 5

3 years ago
Backlog is looking better and new linux test jobs appear to be starting reasonably fast now. I'm reopening everything.
(Assignee)

Updated

3 years ago
Attachment #8534536 - Flags: review?(winter2718)
(Assignee)

Comment 7

3 years ago
Cautiously optimistic here, marking as fixed.

We'll know for sure after tomorrow's AMI's get generated.

A link that showed the problem today: https://www.hostedgraphite.com/da5c920d/grafana/#/dashboard/temp/e5db589335c850ef95f52b85c2585442aa61c401?panelId=5&fullscreen
(Assignee)

Updated

3 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Assignee)

Comment 8

3 years ago
and I landed an unsaved version, and tested said version -- thus I caused bustage.

The fix:
https://hg.mozilla.org/build/puppet/rev/01a37f44eafe
https://hg.mozilla.org/build/puppet/rev/521aa8dd8a02
I don't like the conditional here, as depending on install order /usr/bin/hg may end up pointing to the releng hg or the system hg.

Maybe the two packages should explicitly conflict, so that only one can be installed?
Flags: needinfo?(bugspam.Callek)
Flags: needinfo?(bugspam.Callek)
You need to log in before you can comment on or make changes to this bug.