talos slaves lose connectivity with buildbot master (qm-rhel02)

RESOLVED FIXED

Status

--
major
RESOLVED FIXED
10 years ago
5 years ago

People

(Reporter: anodelman, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

10 years ago
Outage occurred at 10:30:05am causing multiple talos machines to lose connectivity.  Machines reconnected at 10:32:52am.

Affected the following machines:
qm-mini-xp01/02/03/04/05
qm-pxp-fast01/02
qm-pxp-jss01/02/03
qm-mini-vista01/02/03/04/05
qm-mini-ubuntu02/03/04/05
qm-plinux-fast02
qm-pmac02/03/04
qm-pmac-trunk04/05
qm-pmac-fast02
qm-pxp-trunk01/02/03/04/05/06
qm-plinux-trunk01/02/04/05/06
qm-pvista-trunk01/02/03

Machines unaffected:
qm-pmac-trunk01/02/03/07/08/09
qm-pleopard-trunk01/02/03
qm-plinux-trunk03
qm-pmac-fast01
qm-pmac01/05
qm-plinux-fast01
qm-mini-ubuntu01


Machines on the try perfmaster appeared unaffected, as do those on stage (qm-buildbot01).

Comment 1

10 years ago
What did they lose connectivity to?  Any error logs?
(Reporter)

Comment 2

10 years ago
The master itself stayed up and the logs are pretty full of messages about slaves disconnecting:

<snip>
2008/06/04 10:32 PDT [Broker,21] <Builder 'WINNT 6.0 talos trunk' at -1215589556>.detached qm-mini-vista03
2008/06/04 10:32 PDT [Broker,21] Buildslave qm-mini-vista03 detached from WINNT 6.0 talos trunk
2008/06/04 10:32 PDT [Broker,21] BotPerspective.detached(qm-mini-vista03)
2008/06/04 10:32 PDT [Broker,21] <Build WINNT 6.0 talos trunk>.lostRemote
2008/06/04 10:32 PDT [Broker,21]  stopping currentStep <perfrunner.MozillaRunPerfTests instance at 0xb016eb8c>
2008/06/04 10:32 PDT [Broker,21] addCompleteLog(interrupt)
2008/06/04 10:32 PDT [Broker,21] RemoteCommand.interrupt <RemoteShellCommand '['python', 'run_tests.py', '--noisy', '20080604_0911_config.yml']'> [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
        ]
</snip>

The affected slaves all have the same error:

remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
]

Updated

10 years ago
Assignee: server-ops → mrz
I'm seeing similar things occasionally with the moz2 buildbot. This morning, it was between moz2-win32-slave1 and production-master. The slave just dropped, causing the build to go red.
Just saw the same thing between bm-xserve16 and production-master

Comment 5

10 years ago
Would like to move qm-rhel02 off netapp-d-fcal1 as a stab at fixing this.  I'm hoping this is related to the netapp perf issues and the misconfigured LUNs.  

Most moves are failing mid way with read errors and taking the VM offline.  We've had better luck moving powered off VMs.  

bhearsum says this needs to be scheduled though.

Updated

10 years ago
Depends on: 435134

Comment 6

10 years ago
moved, tossing back to RE - this might very well be fixed with the netapp issues.  
Assignee: mrz → nobody
Component: Server Operations → Release Engineering: Talos
QA Contact: justin → release

Comment 7

10 years ago
believe this is fixed as a result of netapp fixes.  Please reopen if it happens again.
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED

Updated

9 years ago
Component: Release Engineering: Talos → Release Engineering
(Assignee)

Updated

5 years ago
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.