Closed Bug 610399 Opened 14 years ago Closed 13 years ago

Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")

Categories

(mozilla.org Graveyard :: Server Operations, task)

All
Other
task
Not set
minor

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: cshields)

References

Details

A few of our buildbot masters have hit an error trying to scp files to state.m.o at 05:29 this morning.  The error message is, "ssh_exchange_identification: Connection closed by remote host"

buildbot-master2 hit it at 2010-11-08 05:29
talos-master02 hit it at 2010-11-08 05:29
buildbot-master1 hit it at 2010-11-08 05:29
See Also: → 589542
Summary: Occasional disconnects from stage.m.o → Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")
These are all VM's @ 650 Castro.  That ESX host can be overloaded at time.
Could this be related to what dmoore said in bug 589542 about simultaneous connection attempts?  See comment 12.
Assignee: server-ops → network-operations
Component: Server Operations → Server Operations: Netops
Not sure why this is assigned to netops.

Did you look at the ssh log files on the host in question?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1293000709.1293005632.4282.gz
Linux x86-64 mozilla-central leak test build on 2010/12/21 22:51:49
s: moz2-linux64-slave03

firefox-4.0b9pre.en-US.linux-x86_64.crashrepo   0%    0     0.0KB/s   --:-- ETA
firefox-4.0b9pre.en-US.linux-x86_64.crashrepo 100%   24MB  24.3MB/s   00:01    
ssh_exchange_identification: Connection closed by remote host

ssh_exchange_identification: Connection closed by remote host

Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', 'ffxbld@stage.mozilla.org', 'rm -rf /tmp/tmp.leldE14175/'] returned non-zero exit code: 255
make[1]: *** [upload] Error 2
make[1]: Leaving directory `/builds/slave/cen-lnx64-dbg/build/obj-firefox/browser/installer'
make: *** [upload] Error 2
program finished with exit code 2
elapsedTime=9.418832
=== Output ended ===
Assignee: network-operations → server-ops
Component: Server Operations: Netops → Server Operations
Is this problem still occurring?  I did a quick check of the logs and don't see any anomalies for today.  If you can get me a specific time of occurrence that would help too.
I had this problem when I was doing a staging release last week:
> bash -c ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone he l10n-central/he
> ssh_exchange_identification: Connection closed by remote host

I hit it 47 times in just one second:
from: Thu Jan 6 13:32:02 2011
to:   Thu Jan 6 13:32:03 2011

I am surprised that Axel was not already CCed on this bug.
The L10n repacks in general are the jobs that are most likely to hit this problem as during a 10-30 minutes gap we have 80 repacks being uploaded separately *per platform*. This makes L10n repacks more likely to hit this but it seems that we have a Russian roulette.

Corey I still believe that this is related to ssh refusing connections as mentioned in comment 2.

I believe we could reproduce this by triggering "repo_setup" on staging looping on deletion/creation of repos. The problem happened to me after I triggered it a 3rd time in less than 20 minutes.

job#0 - 12:49 - 74 repos deleted from users/stage-ffxbld
job#1 - 13:09 - 74 repos deleted from users/stage-ffxbld
job#2 - 13:11 - 74 repos deleted from users/stage-ffxbld
      - 13:12 - wait 10 minutes for hg before recreating repos
      - 13:22 - 27 recreated repos in users/stage-ffxbld
      - 13:32 - 47 *FAILED* *HERE* to recreate repos in users/stage-ffxbld
job#3 - 13:57 - 74 repos deleted from users/stage-ffxbld
      - 13:59 - wait 10 minutes for hg before recreating repos
      - 14:09 - 74 recreated repos in users/stage-ffxbld

I hope this info helps.
l10n nightlies hit this multiple times pretty much every day. Details on failures, and estimates on parallel uploads at least from moco's releng side would be in builddb.
I've increased the maxstartup count from default (10) to 50 and reloaded sshd on stage.

Immediately the postponed key messages in the logs have gone away so I think this is a good sign.  Please verify that this is working for you tonight Axel.
No complaints, so I'm closing this out..  Please feel free to reopen if the problem comes back.
Assignee: server-ops → cshields
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.