Outage occurred at 10:30:05am causing multiple talos machines to lose connectivity. Machines reconnected at 10:32:52am. Affected the following machines: qm-mini-xp01/02/03/04/05 qm-pxp-fast01/02 qm-pxp-jss01/02/03 qm-mini-vista01/02/03/04/05 qm-mini-ubuntu02/03/04/05 qm-plinux-fast02 qm-pmac02/03/04 qm-pmac-trunk04/05 qm-pmac-fast02 qm-pxp-trunk01/02/03/04/05/06 qm-plinux-trunk01/02/04/05/06 qm-pvista-trunk01/02/03 Machines unaffected: qm-pmac-trunk01/02/03/07/08/09 qm-pleopard-trunk01/02/03 qm-plinux-trunk03 qm-pmac-fast01 qm-pmac01/05 qm-plinux-fast01 qm-mini-ubuntu01 Machines on the try perfmaster appeared unaffected, as do those on stage (qm-buildbot01).
What did they lose connectivity to? Any error logs?
The master itself stayed up and the logs are pretty full of messages about slaves disconnecting: <snip> 2008/06/04 10:32 PDT [Broker,21] <Builder 'WINNT 6.0 talos trunk' at -1215589556>.detached qm-mini-vista03 2008/06/04 10:32 PDT [Broker,21] Buildslave qm-mini-vista03 detached from WINNT 6.0 talos trunk 2008/06/04 10:32 PDT [Broker,21] BotPerspective.detached(qm-mini-vista03) 2008/06/04 10:32 PDT [Broker,21] <Build WINNT 6.0 talos trunk>.lostRemote 2008/06/04 10:32 PDT [Broker,21] stopping currentStep <perfrunner.MozillaRunPerfTests instance at 0xb016eb8c> 2008/06/04 10:32 PDT [Broker,21] addCompleteLog(interrupt) 2008/06/04 10:32 PDT [Broker,21] RemoteCommand.interrupt <RemoteShellCommand '['python', 'run_tests.py', '--noisy', '20080604_0911_config.yml']'> [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] </snip> The affected slaves all have the same error: remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ]
I'm seeing similar things occasionally with the moz2 buildbot. This morning, it was between moz2-win32-slave1 and production-master. The slave just dropped, causing the build to go red.
Just saw the same thing between bm-xserve16 and production-master
Would like to move qm-rhel02 off netapp-d-fcal1 as a stab at fixing this. I'm hoping this is related to the netapp perf issues and the misconfigured LUNs. Most moves are failing mid way with read errors and taking the VM offline. We've had better luck moving powered off VMs. bhearsum says this needs to be scheduled though.
moved, tossing back to RE - this might very well be fixed with the netapp issues.
Assignee: mrz → nobody
Component: Server Operations → Release Engineering: Talos
QA Contact: justin → release
believe this is fixed as a result of netapp fixes. Please reopen if it happens again.
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.