Closed Bug 781159 Opened 10 years ago Closed 7 years ago

mark tegras as dead when they timeout on 2 or more steps during an automation run


(Release Engineering :: General, defect, P3)



(Not tracked)



(Reporter: jmaher, Unassigned)



(Whiteboard: [tegra])

In our automation we continue to have a lot of reds in our automation.  One of the highest frequency problems are when we have a device which is truly dead and we continue to schedule jobs on it.  In fact, 20 out of 51 red jobs that I looked at on Monday were two devices that we continually scheduled jobs on.  

My proposal here is if we hit the timeout on 2 or more steps we push the tegra into a dead pool which we can then manually or automatically try to remediate.

By a timeout, I don't mean fail, this is the really long maximum time hit. 1200 seconds 1800 seconds
mochitest/reftest: 2400 seconds

Since these timeout and are terminated by buildbot, I am unable to detect these in the harness or in things like sut_tools.
Blocks: 781162
Unfortunately I don't know a good way to identify this in our normal/current automation paths.

Dustin, is there a way to identify, with buildbot "previous step(s) timed out" even if we have to check with specific tests, or some other way to do this with buildbot.

For clarity, the current way to take jobs out of production for tegras, is by creating a file (error.flg) on the foopy, the file should contain a human-readable string that identifies why it has an error.
A status listener could do this - think of how MailNotifier works.
Priority: -- → P3
Whiteboard: [tegra]
Product: → Release Engineering
Both tegras and pandas are dead.
Closed: 7 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.