Closed
Bug 1180877
Opened 9 years ago
Closed 9 years ago
spike in win 8 test slave retries - losing connection
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Assigned: coop)
References
Details
- across trees - only win 8 testers example: - https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=t-w864-ix-032 - remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. not just buildbot timeouts but hg.m.o and pypi timeouts too: 12:48 <philor> and a read timeout hitting hg.m.o, and something that looks like a pypi timeout
Comment 1•9 years ago
|
||
On the local machine I am seeing both network connection lost warnings and some DNS warnings. Do we have a list of machines that are being affected?
Comment 2•9 years ago
|
||
Non-exhaustively, t-w864-ix-032 t-w864-ix-031 t-w864-ix-021 t-w864-ix-024 t-w864-ix-017 t-w864-ix-003 t-w864-ix-030 t-w864-ix-008 t-w864-ix-016 t-w864-ix-096 t-w864-ix-014 t-w864-ix-019 which is just from clicking on the blue/purple letters in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-resultStatus=exception&filter-resultStatus=retry&filter-searchStr=win&fromchange=7c9a34b615aa and getting the "Machine: " out of the lower left panel. The skew toward just 024/031/032 toward the top of that makes me wonder if it was an event, a while back, and those three got caught by it at a bad time and got left broken, so I've rebooted those three.
Comment 3•9 years ago
|
||
And I'd add in 025 and 017 as feeling the effects, since they both did one job that retried and haven't taken one since. 024 failed to actually reboot through slaverebooter (looks like it hit the bug which keeps it from filing a problem tracking bug before it files an unreachable bug which it otherwise would have), 031 took a job and failed it by timing out reading from pypi.pvt.build.mozilla.org, and 032 has failed multiple jobs after multiple reboots.
Comment hidden (Legacy TBPL/Treeherder Robot) |
Updated•9 years ago
|
Blocks: t-w864-ix-031
Updated•9 years ago
|
Blocks: t-w864-ix-032
Comment 5•9 years ago
|
||
031 and 032 stayed busted after reboots, I disabled them both; 024 claimed to be unable to reboot, and looks recovered after Van kicked it, 025 currently claims to be unable to reboot; 017 has probably recovered on its own.
Updated•9 years ago
|
Blocks: t-w864-ix-022
Updated•9 years ago
|
Blocks: t-w864-ix-025
Updated•9 years ago
|
Blocks: t-w864-ix-021
Updated•9 years ago
|
Blocks: t-w864-ix-024
Comment 6•9 years ago
|
||
This seems like a network issue. With 31 and 32 I was not able to connect to them by ssh or vnc and were unpingable. Then a few minutes later they came back. The handful I checked, according to inventory, are connected to switch1.r101-18.console.scl3.mozilla.net. Maybe we should ask netops to check it out.
Assignee | ||
Comment 7•9 years ago
|
||
Filed bug 1181615. Anyone can file a follow-up bug, you know. Doesn't just just have to be buildduty! ;)
Comment 8•9 years ago
|
||
t-w864-ix-024 is still unreachable (and disabled in slavealloc), everything else in the dep list is re-enabled.
Assignee | ||
Comment 9•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #8) > t-w864-ix-024 is still unreachable (and disabled in slavealloc), everything > else in the dep list is re-enabled. We'll handle the individual slaves in follow-ups.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → coop
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•