Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
minor
RESOLVED FIXED
7 years ago
3 years ago

People

(Reporter: catlee, Assigned: cshields)

Tracking

Details

(Reporter)

Description

7 years ago
A few of our buildbot masters have hit an error trying to scp files to state.m.o at 05:29 this morning.  The error message is, "ssh_exchange_identification: Connection closed by remote host"

buildbot-master2 hit it at 2010-11-08 05:29
talos-master02 hit it at 2010-11-08 05:29
buildbot-master1 hit it at 2010-11-08 05:29

Updated

7 years ago
See Also: → bug 589542
Summary: Occasional disconnects from stage.m.o → Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")

Comment 1

7 years ago
These are all VM's @ 650 Castro.  That ESX host can be overloaded at time.

Comment 2

7 years ago
Could this be related to what dmoore said in bug 589542 about simultaneous connection attempts?  See comment 12.

Updated

7 years ago
Assignee: server-ops → network-operations
Component: Server Operations → Server Operations: Netops

Comment 3

7 years ago
Not sure why this is assigned to netops.

Did you look at the ssh log files on the host in question?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1293000709.1293005632.4282.gz
Linux x86-64 mozilla-central leak test build on 2010/12/21 22:51:49
s: moz2-linux64-slave03

firefox-4.0b9pre.en-US.linux-x86_64.crashrepo   0%    0     0.0KB/s   --:-- ETA
firefox-4.0b9pre.en-US.linux-x86_64.crashrepo 100%   24MB  24.3MB/s   00:01    
ssh_exchange_identification: Connection closed by remote host

ssh_exchange_identification: Connection closed by remote host

Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', 'ffxbld@stage.mozilla.org', 'rm -rf /tmp/tmp.leldE14175/'] returned non-zero exit code: 255
make[1]: *** [upload] Error 2
make[1]: Leaving directory `/builds/slave/cen-lnx64-dbg/build/obj-firefox/browser/installer'
make: *** [upload] Error 2
program finished with exit code 2
elapsedTime=9.418832
=== Output ended ===

Updated

7 years ago
Assignee: network-operations → server-ops
Component: Server Operations: Netops → Server Operations
(Assignee)

Comment 5

7 years ago
Is this problem still occurring?  I did a quick check of the logs and don't see any anomalies for today.  If you can get me a specific time of occurrence that would help too.

Comment 6

7 years ago
I had this problem when I was doing a staging release last week:
> bash -c ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone he l10n-central/he
> ssh_exchange_identification: Connection closed by remote host

I hit it 47 times in just one second:
from: Thu Jan 6 13:32:02 2011
to:   Thu Jan 6 13:32:03 2011

I am surprised that Axel was not already CCed on this bug.
The L10n repacks in general are the jobs that are most likely to hit this problem as during a 10-30 minutes gap we have 80 repacks being uploaded separately *per platform*. This makes L10n repacks more likely to hit this but it seems that we have a Russian roulette.

Corey I still believe that this is related to ssh refusing connections as mentioned in comment 2.

I believe we could reproduce this by triggering "repo_setup" on staging looping on deletion/creation of repos. The problem happened to me after I triggered it a 3rd time in less than 20 minutes.

job#0 - 12:49 - 74 repos deleted from users/stage-ffxbld
job#1 - 13:09 - 74 repos deleted from users/stage-ffxbld
job#2 - 13:11 - 74 repos deleted from users/stage-ffxbld
      - 13:12 - wait 10 minutes for hg before recreating repos
      - 13:22 - 27 recreated repos in users/stage-ffxbld
      - 13:32 - 47 *FAILED* *HERE* to recreate repos in users/stage-ffxbld
job#3 - 13:57 - 74 repos deleted from users/stage-ffxbld
      - 13:59 - wait 10 minutes for hg before recreating repos
      - 14:09 - 74 recreated repos in users/stage-ffxbld

I hope this info helps.

Comment 7

7 years ago
l10n nightlies hit this multiple times pretty much every day. Details on failures, and estimates on parallel uploads at least from moco's releng side would be in builddb.
(Assignee)

Comment 8

7 years ago
I've increased the maxstartup count from default (10) to 50 and reloaded sshd on stage.

Immediately the postponed key messages in the logs have gone away so I think this is a good sign.  Please verify that this is working for you tonight Axel.
(Assignee)

Comment 9

7 years ago
No complaints, so I'm closing this out..  Please feel free to reopen if the problem comes back.
Assignee: server-ops → cshields
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.