610399 - Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")

Reporter

Description

•

14 years ago

A few of our buildbot masters have hit an error trying to scp files to state.m.o at 05:29 this morning.  The error message is, "ssh_exchange_identification: Connection closed by remote host"

buildbot-master2 hit it at 2010-11-08 05:29
talos-master02 hit it at 2010-11-08 05:29
buildbot-master1 hit it at 2010-11-08 05:29

Armen [:armenzg]

Updated

•

14 years ago

Comment 1

•

14 years ago

These are all VM's @ 650 Castro.  That ESX host can be overloaded at time.

Ben Kero [:bkero]

Comment 2

•

14 years ago

Could this be related to what dmoore said in bug 589542 about simultaneous connection attempts?  See comment 12.

Phong Tran [:phong]

Updated

•

14 years ago

Assignee: server-ops → network-operations

Component: Server Operations → Server Operations: Netops

Ravi Pina [:ravi]

Comment 3

•

14 years ago

Not sure why this is assigned to netops.

Did you look at the ssh log files on the host in question?

Justin Wood (:Callek)

Comment 4

•

14 years ago

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1293000709.1293005632.4282.gz
Linux x86-64 mozilla-central leak test build on 2010/12/21 22:51:49
s: moz2-linux64-slave03

firefox-4.0b9pre.en-US.linux-x86_64.crashrepo   0%    0     0.0KB/s   --:-- ETA
firefox-4.0b9pre.en-US.linux-x86_64.crashrepo 100%   24MB  24.3MB/s   00:01    
ssh_exchange_identification: Connection closed by remote host

ssh_exchange_identification: Connection closed by remote host

Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', 'ffxbld@stage.mozilla.org', 'rm -rf /tmp/tmp.leldE14175/'] returned non-zero exit code: 255
make[1]: *** [upload] Error 2
make[1]: Leaving directory `/builds/slave/cen-lnx64-dbg/build/obj-firefox/browser/installer'
make: *** [upload] Error 2
program finished with exit code 2
elapsedTime=9.418832
=== Output ended ===

matthew zeier [:mrz]

Updated

•

14 years ago

Assignee: network-operations → server-ops

Component: Server Operations: Netops → Server Operations

Corey Shields [:cshields]

Assignee

Comment 5

•

14 years ago

Is this problem still occurring?  I did a quick check of the logs and don't see any anomalies for today.  If you can get me a specific time of occurrence that would help too.

Armen [:armenzg]

Comment 6

•

14 years ago

I had this problem when I was doing a staging release last week:
> bash -c ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone he l10n-central/he
> ssh_exchange_identification: Connection closed by remote host

I hit it 47 times in just one second:
from: Thu Jan 6 13:32:02 2011
to:   Thu Jan 6 13:32:03 2011

I am surprised that Axel was not already CCed on this bug.
The L10n repacks in general are the jobs that are most likely to hit this problem as during a 10-30 minutes gap we have 80 repacks being uploaded separately *per platform*. This makes L10n repacks more likely to hit this but it seems that we have a Russian roulette.

Corey I still believe that this is related to ssh refusing connections as mentioned in comment 2.

I believe we could reproduce this by triggering "repo_setup" on staging looping on deletion/creation of repos. The problem happened to me after I triggered it a 3rd time in less than 20 minutes.

job#0 - 12:49 - 74 repos deleted from users/stage-ffxbld
job#1 - 13:09 - 74 repos deleted from users/stage-ffxbld
job#2 - 13:11 - 74 repos deleted from users/stage-ffxbld
      - 13:12 - wait 10 minutes for hg before recreating repos
      - 13:22 - 27 recreated repos in users/stage-ffxbld
      - 13:32 - 47 *FAILED* *HERE* to recreate repos in users/stage-ffxbld
job#3 - 13:57 - 74 repos deleted from users/stage-ffxbld
      - 13:59 - wait 10 minutes for hg before recreating repos
      - 14:09 - 74 recreated repos in users/stage-ffxbld

I hope this info helps.

Axel Hecht [:Pike]

Comment 7

•

14 years ago

l10n nightlies hit this multiple times pretty much every day. Details on failures, and estimates on parallel uploads at least from moco's releng side would be in builddb.

Corey Shields [:cshields]

Assignee

Comment 8

•

14 years ago

I've increased the maxstartup count from default (10) to 50 and reloaded sshd on stage.

Immediately the postponed key messages in the logs have gone away so I think this is a good sign.  Please verify that this is working for you tonight Axel.

Corey Shields [:cshields]

Assignee

Comment 9

•

13 years ago

No complaints, so I'm closing this out..  Please feel free to reopen if the problem comes back.

Assignee: server-ops → cshields

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Quick Search

Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: catlee, Assigned: cshields)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated