A few of our buildbot masters have hit an error trying to scp files to state.m.o at 05:29 this morning. The error message is, "ssh_exchange_identification: Connection closed by remote host" buildbot-master2 hit it at 2010-11-08 05:29 talos-master02 hit it at 2010-11-08 05:29 buildbot-master1 hit it at 2010-11-08 05:29
These are all VM's @ 650 Castro. That ESX host can be overloaded at time.
Could this be related to what dmoore said in bug 589542 about simultaneous connection attempts? See comment 12.
Not sure why this is assigned to netops. Did you look at the ssh log files on the host in question?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1293000709.1293005632.4282.gz Linux x86-64 mozilla-central leak test build on 2010/12/21 22:51:49 s: moz2-linux64-slave03 firefox-4.0b9pre.en-US.linux-x86_64.crashrepo 0% 0 0.0KB/s --:-- ETA firefox-4.0b9pre.en-US.linux-x86_64.crashrepo 100% 24MB 24.3MB/s 00:01 ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host Command ['ssh', '-o', 'IdentityFile=~/.ssh/ffxbld_dsa', 'email@example.com', 'rm -rf /tmp/tmp.leldE14175/'] returned non-zero exit code: 255 make: *** [upload] Error 2 make: Leaving directory `/builds/slave/cen-lnx64-dbg/build/obj-firefox/browser/installer' make: *** [upload] Error 2 program finished with exit code 2 elapsedTime=9.418832 === Output ended ===
Is this problem still occurring? I did a quick check of the logs and don't see any anomalies for today. If you can get me a specific time of occurrence that would help too.
I had this problem when I was doing a staging release last week: > bash -c ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone he l10n-central/he > ssh_exchange_identification: Connection closed by remote host I hit it 47 times in just one second: from: Thu Jan 6 13:32:02 2011 to: Thu Jan 6 13:32:03 2011 I am surprised that Axel was not already CCed on this bug. The L10n repacks in general are the jobs that are most likely to hit this problem as during a 10-30 minutes gap we have 80 repacks being uploaded separately *per platform*. This makes L10n repacks more likely to hit this but it seems that we have a Russian roulette. Corey I still believe that this is related to ssh refusing connections as mentioned in comment 2. I believe we could reproduce this by triggering "repo_setup" on staging looping on deletion/creation of repos. The problem happened to me after I triggered it a 3rd time in less than 20 minutes. job#0 - 12:49 - 74 repos deleted from users/stage-ffxbld job#1 - 13:09 - 74 repos deleted from users/stage-ffxbld job#2 - 13:11 - 74 repos deleted from users/stage-ffxbld - 13:12 - wait 10 minutes for hg before recreating repos - 13:22 - 27 recreated repos in users/stage-ffxbld - 13:32 - 47 *FAILED* *HERE* to recreate repos in users/stage-ffxbld job#3 - 13:57 - 74 repos deleted from users/stage-ffxbld - 13:59 - wait 10 minutes for hg before recreating repos - 14:09 - 74 recreated repos in users/stage-ffxbld I hope this info helps.
l10n nightlies hit this multiple times pretty much every day. Details on failures, and estimates on parallel uploads at least from moco's releng side would be in builddb.
I've increased the maxstartup count from default (10) to 50 and reloaded sshd on stage. Immediately the postponed key messages in the logs have gone away so I think this is a good sign. Please verify that this is working for you tonight Axel.
No complaints, so I'm closing this out.. Please feel free to reopen if the problem comes back.