Closed Bug 915457 Opened 12 years ago Closed 12 years ago

Triage tegras with no completed jobs within last 24 hours

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

ARM
Android
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: coop)

Details

Attachments

(1 file)

We currently have about 145 tegras that have not run/completed a job within the last 24 hours, and have a large (>1100) pending count for tegra-run jobs. We need to triage this list, look for any systemic problems and nurse these back to life as soon as possible. https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=tegra
Coop recovered most of these this past evening... so I concentrated on other related work instead of this bug, so handing to him pending when I get up ;-) Coop managed the recovery by: (in a script loop, across foopies/devices): * Kill any hung/idle buildbot procs * force reboot device As of a few hours ago we had all but 40 devices up!
Assignee: bugspam.Callek → coop
Attached file resurrect_tegras.py
Here's the script I ran last night to resurrect the hung tegras. 'tegra_list' contained a list of hung tegras, as reported by slave_health. I'll get this script added to braindump today. I think we should turn back on kittenherder reboots for tegras in the short-term.
The tegra problem tracking bugs are a mess. I'll take some time today to try to resolve any open tegra bugs that shouldn't still be open, provided we don't hit another try-nado or similar apocalypse.
This is mostly cleaned up now. tegras that were in the buildduty queue have all been nudged to their next state, whether that's recovery or production.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: