Closed Bug 781159 Opened 8 years ago Closed 5 years ago
mark tegras as dead when they timeout on 2 or more steps during an automation run
In our automation we continue to have a lot of reds in our automation. One of the highest frequency problems are when we have a device which is truly dead and we continue to schedule jobs on it. In fact, 20 out of 51 red jobs that I looked at on Monday were two devices that we continually scheduled jobs on. My proposal here is if we hit the timeout on 2 or more steps we push the tegra into a dead pool which we can then manually or automatically try to remediate. By a timeout, I don't mean fail, this is the really long maximum time hit. verify.py: 1200 seconds reboot.py: 1800 seconds mochitest/reftest: 2400 seconds Since these timeout and are terminated by buildbot, I am unable to detect these in the harness or in things like sut_tools.
Unfortunately I don't know a good way to identify this in our normal/current automation paths. Dustin, is there a way to identify, with buildbot "previous step(s) timed out" even if we have to check with specific tests, or some other way to do this with buildbot. For clarity, the current way to take jobs out of production for tegras, is by creating a file (error.flg) on the foopy, the file should contain a human-readable string that identifies why it has an error.
A status listener could do this - think of how MailNotifier works.
8 years ago
Priority: -- → P3
Product: mozilla.org → Release Engineering
Both tegras and pandas are dead.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.