Closed Bug 716800 Opened 13 years ago Closed 11 years ago

"talosError: Found processes still running: .*. Please close them before running talos" should set RETRY

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [automation])

Whether it's dwwin (bug 703996) or firefox (bug 704380) or plugin-container (bug 714655), for releng's purposes the real meaning of "Found processes still running" is "something broke the last run so it failed to reboot" to which the solution is "this time we'll reboot, and when the run is manually retriggered it will be fine" so instead of manually retriggering, we should be automatically RETRYing.
I'm fuzzy on the details here: does setting RETRY on the releng side cause the entire build to be re-run (implying a reboot), or is only the step in question retried? philor: do you still want the individual bugs mentioned in comment #0 left open for tracking frequency (vs. DUPing them to this bug)?
Priority: -- → P3
Whiteboard: [orange][automation]
Nobody is fuzzier than me, nobody! but what I meant was http://mxr.mozilla.org/build/source/buildbotcustom/status/errors.py#5, since this is pretty much the same sort of thing as those Tegra failures - they are bugs, each their own separate snowflake of failure, but when you are talking about a particular run that hit them, that slave should go reboot, and another slave should be given the job to do it right. dwwin is certainly an entirely separate bug - under no circumstances should any slave taking a job have it running; 704380 seems to me to be a bug in the script that runs Jetpack, or in the Talos process-finder, or in hdiutil, hello pain, but still probably a bug that we want to stop at the source, rather than just sweep it away by trying another slave and hoping that another reboot will make it go away; the plugin-container one I have absolutely no feeling about, no idea where that came from.
Blocks: 438871
Mass marking whiteboard:[orange] bugs WFM (to clean up TBPL bug suggestions) that: * Haven't changed in > 6months * Whose whiteboard contains none of the strings: {disabled,marked,random,fuzzy,todo,fails,failing,annotated,leave open,time-bomb} * Passed a (quick) manual inspection of bug summary/whiteboard to ensure they weren't a false positive. I've also gone through and searched for cases where the whiteboard wasn't labelled correctly after test disabling, by using attachment description & basic comment searches. However if the test for which this bug was about has in fact been disabled/annotated/..., please accept my apologies & reopen/mark the whiteboard appropriately so this doesn't get re-closed in the future (and please ping me via IRC or email so I can try to tweak the saved searches to avoid more edge cases). Sorry for the spam! Filter on: #FFA500
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
No longer blocks: 438871
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Whiteboard: [orange][automation] → [automation]
At least I think the message has now changed from FAIL: to talosError:.
Summary: "FAIL: Found processes still running: .*. Please close them before running talos" should set RETRY → "talosError: Found processes still running: .*. Please close them before running talos" should set RETRY
Depends on: 797324
Product: mozilla.org → Release Engineering
Is this still a valid bug?
Flags: needinfo?(philringnalda)
The error still exists in talos code, and given a situation where it would be raised we should set retry, so it's valid in that sense, but either talos is broken so it doesn't notice running processes, or we've gotten to the point where we really never do let an unrebooted slave take a job, so there hasn't been anything to retry for months.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Flags: needinfo?(philringnalda)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.