716800 - "talosError: Found processes still running: .*. Please close them before running talos" should set RETRY

Reporter

Description

•

13 years ago

Whether it's dwwin (bug 703996) or firefox (bug 704380) or plugin-container (bug 714655), for releng's purposes the real meaning of "Found processes still running" is "something broke the last run so it failed to reboot" to which the solution is "this time we'll reboot, and when the run is manually retriggered it will be fine" so instead of manually retriggering, we should be automatically RETRYing.

Chris Cooper [:coop] (he/him)

Comment 1

•

13 years ago

I'm fuzzy on the details here: does setting RETRY on the releng side cause the entire build to be re-run (implying a reboot), or is only the step in question retried? philor: do you still want the individual bugs mentioned in comment #0 left open for tracking frequency (vs. DUPing them to this bug)?

Priority: -- → P3

Whiteboard: [orange][automation]

Phil Ringnalda (:philor)

Reporter

Comment 2

•

13 years ago

Nobody is fuzzier than me, nobody! but what I meant was http://mxr.mozilla.org/build/source/buildbotcustom/status/errors.py#5, since this is pretty much the same sort of thing as those Tegra failures - they are bugs, each their own separate snowflake of failure, but when you are talking about a particular run that hit them, that slave should go reboot, and another slave should be given the job to do it right. dwwin is certainly an entirely separate bug - under no circumstances should any slave taking a job have it running; 704380 seems to me to be a bug in the script that runs Jetpack, or in the Talos process-finder, or in hdiutil, hello pain, but still probably a bug that we want to stop at the source, rather than just sweep it away by trying another slave and hoping that another reboot will make it go away; the plugin-container one I have absolutely no feeling about, no idea where that came from.

(no longer active)

Updated

•

13 years ago

Blocks: 438871

Ed Morley [:emorley]

Comment 3

•

12 years ago

Mass marking whiteboard:[orange] bugs WFM (to clean up TBPL bug suggestions) that: * Haven't changed in > 6months * Whose whiteboard contains none of the strings: {disabled,marked,random,fuzzy,todo,fails,failing,annotated,leave open,time-bomb} * Passed a (quick) manual inspection of bug summary/whiteboard to ensure they weren't a false positive. I've also gone through and searched for cases where the whiteboard wasn't labelled correctly after test disabling, by using attachment description & basic comment searches. However if the test for which this bug was about has in fact been disabled/annotated/..., please accept my apologies & reopen/mark the whiteboard appropriately so this doesn't get re-closed in the future (and please ping me via IRC or email so I can try to tweak the saved searches to avoid more edge cases). Sorry for the spam! Filter on: #FFA500

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

Phil Ringnalda (:philor)

Reporter

Updated

•

12 years ago

No longer blocks: 438871

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Whiteboard: [orange][automation] → [automation]

Phil Ringnalda (:philor)

Reporter

Comment 4

•

12 years ago

At least I think the message has now changed from FAIL: to talosError:.

Summary: "FAIL: Found processes still running: .*. Please close them before running talos" should set RETRY → "talosError: Found processes still running: .*. Please close them before running talos" should set RETRY

Ed Morley [:emorley]

Updated

•

12 years ago

Depends on: 797324

Ed Morley [:emorley]

Comment 5

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=20333116&tree=Mozilla-Inbound

Ryan VanderMeulen [:RyanVM]

Comment 6

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=25487564&tree=Mozilla-Central

Ryan VanderMeulen [:RyanVM]

Comment 7

•

12 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=25500776&tree=Mozilla-Central

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

11 years ago

Is this still a valid bug?

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Comment 9

•

11 years ago

The error still exists in talos code, and given a situation where it would be raised we should set retry, so it's valid in that sense, but either talos is broken so it doesn't notice running processes, or we've gotten to the point where we really never do let an unrebooted slave take a job, so there hasn't been anything to retry for months.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Flags: needinfo?(philringnalda)

Resolution: --- → WONTFIX

Bugzilla

"talosError: Found processes still running: .*. Please close them before running talos" should set RETRY

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [automation])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9