Closed Bug 630578 Opened 9 years ago Closed 4 years ago

Jobs get hung but the slaves somehow keeps on taking jobs

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2581] [buildmasters][automation])

The way to get to this point is the following:
* I find hung jobs on buildapi/running
* I sort by duration
* I open the job's page
* I reboot the machine

Then what happens is odd:
* The job is marked with all steps as done
* When I look at the slave's page I can see the job as still running
* I can notice that the slave has taken job *after* the talos job got hung and *before* I rebooted it

dustin guesses it's because the build status pickle has been thrown out already

This type of problem does not seem to prevent the master to be reconfigured.

You can check the slave page of the three following slaves and you can see what I mean.

mozilla-central	e09b598992e8 	Rev3 WINNT 5.1 mozilla-central talos scroll 	2011-01-24 10:06:38 	2011-01-24 10:06:55 	7 days, 23:55:54 	buildbot-master1:8011
mozilla-central	102d318965db 	Rev3 WINNT 5.1 mozilla-central talos tp4 	2011-01-28 03:16:40 	2011-01-28 03:16:53 	4 days, 6:45:56 	buildbot-master1:8012
try	461b12e362da 	Rev3 WINNT 5.1 tryserver talos chrome 	2011-01-27 20:46:35 	2011-01-27 20:47:15 	4 days, 13:15:34 	buildbot-master2:8012
OS: Mac OS X → All
Priority: -- → P3
Hardware: x86 → All
Whiteboard: [buildmasters][automation]
After today's restart of the masters this got cleared out so there is no more evidence of what happened.

I vote for INVALID or WORKSFORME.

What do you say?
I've seen jobs with the reboot step still running but the slave has gone on to other jobs, so something in the master state isn't getting updated properly, and hence the db. Presumably all running jobs for a slave should be marked completed when a slave reconnects to a master. Upstream or are we doing something silly ?
I have seen that too.
I recently added "-f" to the shutdown step for Windows so we don't get blocked by "cmd.exe" or any prompt.

In this case I forced the reboot manually and suddenly all steps were marked as run (rather than reaching by their own forces to the reboot step and hanging).
What you say is similar as it goes to keep on grabbing new jobs even if the job is still marked as running.

Not sure if it is upstream or we are doing something silly.
Product: mozilla.org → Release Engineering
Found in triage, and moving to component that feels closest. Please bounce onwards/bounce back, if I guessed wrong.
Component: Other → General Automation
Whiteboard: [buildmasters][automation] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2581] [buildmasters][automation]
Probably not an issue anymore.
Status: NEW → RESOLVED
Closed: 4 years ago
QA Contact: catlee
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.