Nightly job disconnected

RESOLVED INVALID

Status

--
major
RESOLVED INVALID
2 years ago
5 months ago

People

(Reporter: armenzg, Unassigned)

Tracking

Details

(Reporter)

Description

2 years ago
The Windows XP job in [1]
The log is in here [2]

Can we please investigate why it happened and what can we do to improve this?

It says this:
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
========= Finished 'c:/mozilla-build/python27/python -u ...' interrupted (results: 5, elapsed: 1 hrs, 28 mins, 20 secs) (at 2016-07-11 09:08:26.388879) =========

[1] https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=214884d507ee369c1cf14edb26527c4f9a97bf48&filter-searchStr=nightly&group_state=expanded&selectedJob=4307623
[2] http://archive.mozilla.org/pub/firefox/nightly/2016/07/2016-07-11-14-37-37-mozilla-central/mozilla-central-win32-nightly-bm72-build1-build16.txt.gz
(Reporter)

Comment 1

2 years ago
It seems that a second Windows nightly builds disconnected [1][2].
I retried the job.

I'm not liking how this is going.

Can someone please investigate if there's something we can do to prevent this from happening?
Or if there's something wrong with the network?

[1]
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

[2] http://archive.mozilla.org/pub/firefox/nightly/2016/07/2016-07-11-14-37-57-mozilla-central/mozilla-central-win64-nightly-bm71-build1-build3.txt.gz
Severity: normal → major
Flags: needinfo?(bugspam.Callek)
This has always been a normal (and sad) thing with buildbot, especially coupled with XP.

It can happen when someone (say a sheriff) goes in and reboots a bunch of windows machines due to idleness, (depending on how they do it, and how much due-diligence they do to try and make sure its not actively doing a job).

Philor has been rebooting somewhere around 20-40 XP slaves a day due to them falling off the grid, he's now on community-PTO for a few weeks, so others have been taking that on.

I'm redirecting the n-i to Q though, since you clearly think this is a problem, and I haven't been knee deep in Windows issues in a while.
Flags: needinfo?(bugspam.Callek) → needinfo?(q)
(Reporter)

Comment 3

2 years ago
My apologies.

In both cases it is build jobs. TH shows build jobs under "XP".

The jobs retried but it took long to show up.

Rewording of original comment:
* I'm happy to find out that we retry this jobs automatically
* Filed the bug as a FYI that we're having some build jobs disconnecting and in case any investigations wants to be made for it

I'm happy to close this bug if it is not worth the effort to investigate.
Flags: needinfo?(q)
We've also been rebooting idle XP machines, but not before checking if any job was actually running. As Armen mentioned, these were build jobs and they were running on b-2008-spot instances.
Any reason to keep this bug opened? I don't think this is an issue at the moment.
Flags: needinfo?(armenzg)
(Reporter)

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Flags: needinfo?(armenzg)
Resolution: --- → INVALID

Updated

5 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.