Closed
Bug 610399
Opened 15 years ago
Closed 15 years ago
Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: cshields)
References
Details
A few of our buildbot masters have hit an error trying to scp files to state.m.o at 05:29 this morning. The error message is, "ssh_exchange_identification: Connection closed by remote host"
buildbot-master2 hit it at 2010-11-08 05:29
talos-master02 hit it at 2010-11-08 05:29
buildbot-master1 hit it at 2010-11-08 05:29
Updated•15 years ago
|
See Also: → 589542
Summary: Occasional disconnects from stage.m.o → Occasional disconnects from stage.m.o ("ssh_exchange_identification: Connection closed by remote host")
Comment 1•15 years ago
|
||
These are all VM's @ 650 Castro. That ESX host can be overloaded at time.
Comment 2•15 years ago
|
||
Could this be related to what dmoore said in bug 589542 about simultaneous connection attempts? See comment 12.
Updated•15 years ago
|
Assignee: server-ops → network-operations
Component: Server Operations → Server Operations: Netops
Comment 3•15 years ago
|
||
Not sure why this is assigned to netops.
Did you look at the ssh log files on the host in question?
Comment 4•15 years ago
|
||
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1293000709.1293005632.4282.gz
Linux x86-64 mozilla-central leak test build on 2010/12/21 22:51:49
s: moz2-linux64-slave03
firefox-4.0b9pre.en-US.linux-x86_64.crashrepo 0% 0 0.0KB/s --:-- ETA
firefox-4.0b9pre.en-US.linux-x86_64.crashrepo 100% 24MB 24.3MB/s 00:01
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', 'ffxbld@stage.mozilla.org', 'rm -rf /tmp/tmp.leldE14175/'] returned non-zero exit code: 255
make[1]: *** [upload] Error 2
make[1]: Leaving directory `/builds/slave/cen-lnx64-dbg/build/obj-firefox/browser/installer'
make: *** [upload] Error 2
program finished with exit code 2
elapsedTime=9.418832
=== Output ended ===
Updated•15 years ago
|
Assignee: network-operations → server-ops
Component: Server Operations: Netops → Server Operations
Assignee | ||
Comment 5•15 years ago
|
||
Is this problem still occurring? I did a quick check of the logs and don't see any anomalies for today. If you can get me a specific time of occurrence that would help too.
Comment 6•15 years ago
|
||
I had this problem when I was doing a staging release last week:
> bash -c ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone he l10n-central/he
> ssh_exchange_identification: Connection closed by remote host
I hit it 47 times in just one second:
from: Thu Jan 6 13:32:02 2011
to: Thu Jan 6 13:32:03 2011
I am surprised that Axel was not already CCed on this bug.
The L10n repacks in general are the jobs that are most likely to hit this problem as during a 10-30 minutes gap we have 80 repacks being uploaded separately *per platform*. This makes L10n repacks more likely to hit this but it seems that we have a Russian roulette.
Corey I still believe that this is related to ssh refusing connections as mentioned in comment 2.
I believe we could reproduce this by triggering "repo_setup" on staging looping on deletion/creation of repos. The problem happened to me after I triggered it a 3rd time in less than 20 minutes.
job#0 - 12:49 - 74 repos deleted from users/stage-ffxbld
job#1 - 13:09 - 74 repos deleted from users/stage-ffxbld
job#2 - 13:11 - 74 repos deleted from users/stage-ffxbld
- 13:12 - wait 10 minutes for hg before recreating repos
- 13:22 - 27 recreated repos in users/stage-ffxbld
- 13:32 - 47 *FAILED* *HERE* to recreate repos in users/stage-ffxbld
job#3 - 13:57 - 74 repos deleted from users/stage-ffxbld
- 13:59 - wait 10 minutes for hg before recreating repos
- 14:09 - 74 recreated repos in users/stage-ffxbld
I hope this info helps.
Comment 7•15 years ago
|
||
l10n nightlies hit this multiple times pretty much every day. Details on failures, and estimates on parallel uploads at least from moco's releng side would be in builddb.
Assignee | ||
Comment 8•15 years ago
|
||
I've increased the maxstartup count from default (10) to 50 and reloaded sshd on stage.
Immediately the postponed key messages in the logs have gone away so I think this is a good sign. Please verify that this is working for you tonight Axel.
Assignee | ||
Comment 9•15 years ago
|
||
No complaints, so I'm closing this out.. Please feel free to reopen if the problem comes back.
Assignee: server-ops → cshields
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•