Closed Bug 437257 Opened 16 years ago Closed 16 years ago

talos slaves lose connectivity with buildbot master (qm-rhel02)

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Unassigned)

References

Details

Outage occurred at 10:30:05am causing multiple talos machines to lose connectivity.  Machines reconnected at 10:32:52am.

Affected the following machines:
qm-mini-xp01/02/03/04/05
qm-pxp-fast01/02
qm-pxp-jss01/02/03
qm-mini-vista01/02/03/04/05
qm-mini-ubuntu02/03/04/05
qm-plinux-fast02
qm-pmac02/03/04
qm-pmac-trunk04/05
qm-pmac-fast02
qm-pxp-trunk01/02/03/04/05/06
qm-plinux-trunk01/02/04/05/06
qm-pvista-trunk01/02/03

Machines unaffected:
qm-pmac-trunk01/02/03/07/08/09
qm-pleopard-trunk01/02/03
qm-plinux-trunk03
qm-pmac-fast01
qm-pmac01/05
qm-plinux-fast01
qm-mini-ubuntu01


Machines on the try perfmaster appeared unaffected, as do those on stage (qm-buildbot01).
What did they lose connectivity to?  Any error logs?
The master itself stayed up and the logs are pretty full of messages about slaves disconnecting:

<snip>
2008/06/04 10:32 PDT [Broker,21] <Builder 'WINNT 6.0 talos trunk' at -1215589556>.detached qm-mini-vista03
2008/06/04 10:32 PDT [Broker,21] Buildslave qm-mini-vista03 detached from WINNT 6.0 talos trunk
2008/06/04 10:32 PDT [Broker,21] BotPerspective.detached(qm-mini-vista03)
2008/06/04 10:32 PDT [Broker,21] <Build WINNT 6.0 talos trunk>.lostRemote
2008/06/04 10:32 PDT [Broker,21]  stopping currentStep <perfrunner.MozillaRunPerfTests instance at 0xb016eb8c>
2008/06/04 10:32 PDT [Broker,21] addCompleteLog(interrupt)
2008/06/04 10:32 PDT [Broker,21] RemoteCommand.interrupt <RemoteShellCommand '['python', 'run_tests.py', '--noisy', '20080604_0911_config.yml']'> [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
        ]
</snip>

The affected slaves all have the same error:

remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
]
Assignee: server-ops → mrz
I'm seeing similar things occasionally with the moz2 buildbot. This morning, it was between moz2-win32-slave1 and production-master. The slave just dropped, causing the build to go red.
Just saw the same thing between bm-xserve16 and production-master
Would like to move qm-rhel02 off netapp-d-fcal1 as a stab at fixing this.  I'm hoping this is related to the netapp perf issues and the misconfigured LUNs.  

Most moves are failing mid way with read errors and taking the VM offline.  We've had better luck moving powered off VMs.  

bhearsum says this needs to be scheduled though.
Depends on: 435134
moved, tossing back to RE - this might very well be fixed with the netapp issues.  
Assignee: mrz → nobody
Component: Server Operations → Release Engineering: Talos
QA Contact: justin → release
believe this is fixed as a result of netapp fixes.  Please reopen if it happens again.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.