Closed Bug 630578 Opened 9 years ago Closed 4 years ago
Jobs get hung but the slaves somehow keeps on taking jobs
The way to get to this point is the following: * I find hung jobs on buildapi/running * I sort by duration * I open the job's page * I reboot the machine Then what happens is odd: * The job is marked with all steps as done * When I look at the slave's page I can see the job as still running * I can notice that the slave has taken job *after* the talos job got hung and *before* I rebooted it dustin guesses it's because the build status pickle has been thrown out already This type of problem does not seem to prevent the master to be reconfigured. You can check the slave page of the three following slaves and you can see what I mean. mozilla-central e09b598992e8 Rev3 WINNT 5.1 mozilla-central talos scroll 2011-01-24 10:06:38 2011-01-24 10:06:55 7 days, 23:55:54 buildbot-master1:8011 mozilla-central 102d318965db Rev3 WINNT 5.1 mozilla-central talos tp4 2011-01-28 03:16:40 2011-01-28 03:16:53 4 days, 6:45:56 buildbot-master1:8012 try 461b12e362da Rev3 WINNT 5.1 tryserver talos chrome 2011-01-27 20:46:35 2011-01-27 20:47:15 4 days, 13:15:34 buildbot-master2:8012
9 years ago
OS: Mac OS X → All
Priority: -- → P3
Hardware: x86 → All
After today's restart of the masters this got cleared out so there is no more evidence of what happened. I vote for INVALID or WORKSFORME. What do you say?
I've seen jobs with the reboot step still running but the slave has gone on to other jobs, so something in the master state isn't getting updated properly, and hence the db. Presumably all running jobs for a slave should be marked completed when a slave reconnects to a master. Upstream or are we doing something silly ?
I have seen that too. I recently added "-f" to the shutdown step for Windows so we don't get blocked by "cmd.exe" or any prompt. In this case I forced the reboot manually and suddenly all steps were marked as run (rather than reaching by their own forces to the reboot step and hanging). What you say is similar as it goes to keep on grabbing new jobs even if the job is still marked as running. Not sure if it is upstream or we are doing something silly.
Product: mozilla.org → Release Engineering
Found in triage, and moving to component that feels closest. Please bounce onwards/bounce back, if I guessed wrong.
Component: Other → General Automation
Whiteboard: [buildmasters][automation] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2581] [buildmasters][automation]
Probably not an issue anymore.
Status: NEW → RESOLVED
Closed: 4 years ago
QA Contact: catlee
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.