Closed Bug 1180877 Opened 9 years ago Closed 9 years ago

spike in win 8 test slave retries - losing connection

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: coop)

References

Details

- across trees
- only win 8 testers

example:
  - https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=t-w864-ix-032
  - remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

not just buildbot timeouts but hg.m.o and pypi timeouts too:
12:48 <philor> and a read timeout hitting hg.m.o, and something that looks like a pypi timeout
On the local machine I am seeing both network connection lost warnings and some DNS warnings. 

Do we have a list of machines that are being affected?
Non-exhaustively,

t-w864-ix-032
t-w864-ix-031
t-w864-ix-021
t-w864-ix-024
t-w864-ix-017
t-w864-ix-003
t-w864-ix-030
t-w864-ix-008
t-w864-ix-016
t-w864-ix-096
t-w864-ix-014
t-w864-ix-019

which is just from clicking on the blue/purple letters in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-resultStatus=exception&filter-resultStatus=retry&filter-searchStr=win&fromchange=7c9a34b615aa and getting the "Machine: " out of the lower left panel.

The skew toward just 024/031/032 toward the top of that makes me wonder if it was an event, a while back, and those three got caught by it at a bad time and got left broken, so I've rebooted those three.
And I'd add in 025 and 017 as feeling the effects, since they both did one job that retried and haven't taken one since. 024 failed to actually reboot through slaverebooter (looks like it hit the bug which keeps it from filing a problem tracking bug before it files an unreachable bug which it otherwise would have), 031 took a job and failed it by timing out reading from pypi.pvt.build.mozilla.org, and 032 has failed multiple jobs after multiple reboots.
031 and 032 stayed busted after reboots, I disabled them both; 024 claimed to be unable to reboot, and looks recovered after Van kicked it, 025 currently claims to be unable to reboot; 017 has probably recovered on its own.
This seems like a network issue. With 31 and 32 I was not able to connect to them by ssh or vnc and were unpingable. Then a few minutes later they came back. 

The handful I checked, according to inventory, are connected to switch1.r101-18.console.scl3.mozilla.net. Maybe we should ask netops to check it out.
Depends on: 1181615
Filed bug 1181615.

Anyone can file a follow-up bug, you know. Doesn't just just have to be buildduty! ;)
t-w864-ix-024 is still unreachable (and disabled in slavealloc), everything else in the dep list is re-enabled.
(In reply to Nick Thomas [:nthomas] from comment #8)
> t-w864-ix-024 is still unreachable (and disabled in slavealloc), everything
> else in the dep list is re-enabled.

We'll handle the individual slaves in follow-ups.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee: nobody → coop
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.