Closed Bug 1038478 Opened 11 years ago Closed 11 years ago

releases/l10n/mozilla-beta/ repos are inconsistent across hgweb*

Categories

(Developer Services :: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: fubar)

References

Details

Attachments

(2 files)

Two of our three android builds for 31.0b10 worked ok, but one failed to get rev 50b3a539bd85 from https://hg.mozilla.org/releases/l10n/mozilla-beta/ms/. hgweb1/3/7 have it, 2/4/5/6/8 don't.
Summary: releases/l10n/mozilla-beta/ms repo is inconsistent → releases/l10n/mozilla-beta/ms repo is inconsistent across hgweb*
While digging for the possible problem I found these errors: on hgssh1: Jul 14 15:39:11 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com releases/l10n/mozilla-beta/ms from ffxbld): remote: ssh_exchange_identification: Connection closed by remote host#015 Jul 14 15:39:11 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com releases/l10n/mozilla-beta/ms from ffxbld): abort: no suitable response from remote hg! There were no relevant logs on hgweb6 (PAM error came later): [root@hgweb6.dmz.scl3 log]# grep 15:39 secure messages|grep -v puppet secure:Jul 13 15:39:05 hgweb6 sshd[6355]: Set /proc/self/oom_score_adj to 0 secure:Jul 13 15:39:05 hgweb6 sshd[6355]: Connection from 10.22.75.42 port 35459 secure:Jul 13 15:39:05 hgweb6 sshd[6356]: Connection closed by 10.22.75.42 secure:Jul 14 15:39:27 hgweb6 sudo: PAM unable to dlopen(/lib64/security/pam_fprintd.so): /lib64/security/pam_fprintd.so: cannot open shared object file: No such file or directory secure:Jul 14 15:39:27 hgweb6 sudo: PAM adding faulty module: /lib64/security/pam_fprintd.so
(the repository was synced out correctly when running the syncing script manually)
Handing off to general dev services for whoever has time to investigate this
Assignee: bkero → server-ops-webops
Assignee: server-ops-webops → server-ops-devservices
Component: WebOps: Source Control → Server Operations: Developer Services
Product: Infrastructure & Operations → mozilla.org
Thanks for the fix Ben.
Severity: critical → normal
hgssh1.dmz.scl3# grep repo-push messages* | grep 'Connection closed by remote host' | wc -l 328 So, this has hit us 328 times in the last month, but we only noticed now. I have greatly mixed feelings about that. I've manually synced up everything; there were only a few outstanding repos. I think we have two paths from here: 1) figure out why the sync is failing - because it's all in the same DC, this has some merit, in case there are larger issues, but overall I feel this is a lower priority than.. 2) make the sync process more robust - currently, it's just a bash for loop that runs logger and ssh (that dumps output to logger). at the very least, it's probably worth setting pipefail and check the return code for the ssh pipeline. ideal might be checking the output and attempting to re-run the push to failed nodes.
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015 Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg! Jul 18 09:54:07 hgssh1 sshd[11734]: Connection from 10.22.74.212 port 14693 Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb7.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015 Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb7.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg! Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb5.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015 Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb5.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg! Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb8.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015 Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb8.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg!
list of similar sync failures since the 15th; Jul 16 11:05:14 hgweb3.dmz.scl3.mozilla.com users/mgervasini_mozilla.com/en-GB Jul 16 11:06:48 hgweb6.dmz.scl3.mozilla.com try Jul 16 18:09:40 hgweb7.dmz.scl3.mozilla.com try Jul 16 18:09:40 hgweb8.dmz.scl3.mozilla.com try Jul 16 18:09:40 hgweb3.dmz.scl3.mozilla.com try Jul 17 21:10:30 hgweb4.dmz.scl3.mozilla.com releases/mozilla-aurora Jul 17 23:27:10 hgweb6.dmz.scl3.mozilla.com releases/gaia-l10n/v1_3/zh-TW Jul 17 23:30:31 hgweb8.dmz.scl3.mozilla.com releases/mozilla-aurora Jul 18 09:54:07 hgweb6.dmz.scl3.mozilla.com build/buildbot-configs Jul 18 09:54:07 hgweb7.dmz.scl3.mozilla.com build/buildbot-configs Jul 18 09:54:07 hgweb5.dmz.scl3.mozilla.com build/buildbot-configs Jul 18 09:54:07 hgweb8.dmz.scl3.mozilla.com build/buildbot-configs
Blocks: 1042210
Added ssh option 'ConnectionAttempts=3' to help with the ssh errors. Default is 1. The bitbucket thread discussing similar issues noted that an immediate retry usually (always?) got around the problem. (ssh options also moved to variable for easier reading) Failing that... exit status of ssh now checked; on error, waits a second and then tries again, with additional logging. Because bash, this was turned into a function that we can background, while still checking return codes, etc. I can still elicit ssh errors, but it requires multiple simultaneous evocations. It's possible that they may still crop up, particularly on bigger/slower repos, but the connect attempts and retry should help with that.
Attachment #8460548 - Flags: review?(bkero)
Attachment #8460548 - Flags: feedback?(chris.lonnen)
Comment on attachment 8460548 [details] updated repo-push.sh to better handle/report errors This is better than what we have -- klibby's tests show that this does detect and log a class of ssh error which was "invisible" before. Let's go with it for now.
Attachment #8460548 - Flags: review+
Comment on attachment 8460548 [details] updated repo-push.sh to better handle/report errors lgtm
Attachment #8460548 - Flags: review?(bkero) → review+
Comment on attachment 8460548 [details] updated repo-push.sh to better handle/report errors Lines {8,9} and {14,15} could DRY out a little, but this will do the job.
Attachment #8460548 - Flags: feedback?(chris.lonnen) → feedback+
Added "-o ServerAliveInterval=5" to ssh options after conversation in #vcs. Script committed to puppet and is in production.
update: no new out-of-sync issues reported since deployment
checked all repos updated since Jul 1, and we're still all good.
Assignee: server-ops-devservices → klibby
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
we missed adding the ssh ServerAlive* and ConnectionAttempts to the hg user on hgweb* (for hgweb*->hg.m.o:22 pulls), and ran into a hangup with hgweb6 pulling a try update. added in r91124.
See Also: → 1036998
high load on the hg web heads this morning caused issues, e.g.: Jul 31 04:39:00 hgssh1 repo-push.sh[17062] integration/gaia-central to hgweb1.dmz.scl3.mozilla.com for vcs-sync@mozilla.com: Connection timed out during banner exchange#015 Jul 31 04:39:04 hgssh1 repo-push.sh[17062] retry integration/gaia-central to hgweb3.dmz.scl3.mozilla.com for vcs-sync@mozilla.com: Connection timed out during banner exchange#015 Jul 31 04:39:04 hgssh1 repo-push.sh[17062] retry integration/gaia-central to hgweb4.dmz.scl3.mozilla.com for vcs-sync@mozilla.com: Connection timed out during banner exchange#015 integration/gaia-central, integration/gaia, try-comm-central, and one user repo were affected. I've increased the ConnectTimeout in repo-push.sh from 3s to 10s.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
We had another case with https://hg.mozilla.org/releases/l10n/mozilla-beta/si in the Firefox 32.0b4 release. The build machine did a fresh clone then tried to update to the tag FIREFOX_32_0b4_RELEASE. This failed because 23eeb52560a7 was the last changeset in the repo, ie it's missing d6bdca917766.
Same events in the logs: Aug 4 16:27:27 hgssh1 repo-push.sh[28452] releases/l10n/mozilla-beta/si to hgweb3.dmz.scl3.mozilla.com for ffxbld: remote: ssh_exchange_identification: Connection closed by remote host#015 Aug 4 16:27:27 hgssh1 repo-push.sh[28452] releases/l10n/mozilla-beta/si to hgweb3.dmz.scl3.mozilla.com for ffxbld: abort: no suitable response from remote hg!
https://hg.mozilla.org/releases/l10n/mozilla-beta/kn today, do we have any more ideas about tweaking the config ? This release automation seems to tickle this bug every beta.
Summary: releases/l10n/mozilla-beta/ms repo is inconsistent across hgweb* → releases/l10n/mozilla-beta/ repos are inconsistent across hgweb*
Repos resynced. If this is blocking stuff overnight, let the MOC know and they can use the docs at https://mana.mozilla.org/wiki/display/SYSADMIN/Mercurial+-+Common+Repository+Operations#Mercurial-CommonRepositoryOperations-Verifyingandre-syncingwebheads to fix it. Need to go and look through the logs to see where these syncs fell apart. I know we can make a few more tweaks in one part of the process, but I'm not sure we've seen it break there yet.
The zeus connection mgmt settings for the hgssh pool were kinda low - 4s connect timeout and 30s no-reply timeout. Increased to 30s and 45s. mirror-pull still needs to be changed to retry pulls on failure. Also, we may have glanced over it, but we currently set ConnectionAttempts=3 for ssh, which is actually only three attempts one second apart, rather than 3*N seconds. We might also want to increase the ConnectTimeout to match zeus; currently set to 15s.
unherped my derp and found the easy way to have mirror-pull retry pulls/clones and scp's of pushlog. also makes the pushlog swap contingent on scp's success.
Attachment #8472534 - Flags: review?(bkero)
Attachment #8472534 - Flags: feedback?(hwine)
Comment on attachment 8472534 [details] [diff] [review] add retries to mirror-pull lgtm as long as the retry function is also fetching the commands after && and honoring the redirects
Attachment #8472534 - Flags: review?(bkero) → review+
it does if I put quotes around it. thx.
At least a dozen failed syncs over the course of last night. Commited changes to mirror-pull in r92016. Rolling out to hgweb heads now.
success! Aug 14 07:00:43 hgssh1 repo-push.sh[17266] -e try to hgweb9.dmz.scl3.mozilla.com for pvanderbeken@mozilla.com: remote: ssh_exchange_identification: Connection closed by remote host#015 Aug 14 07:00:45 hgssh1 repo-push.sh[17266] -e try to hgweb9.dmz.scl3.mozilla.com for pvanderbeken@mozilla.com: retrying hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/try still unknown what's causing the intermittent ssh failures on pull, though.
We didn't have any problems with 32.0b7 today, there was much rejoicing!
Increased sshd's MaxSessions and MaxStartups to 50 yesterday after continued failures and (successful) retries. No unexpected issues since.
Attachment #8472534 - Attachment is patch: true
Comment on attachment 8472534 [details] [diff] [review] add retries to mirror-pull Review of attachment 8472534 [details] [diff] [review]: ----------------------------------------------------------------- ::: mirror-pull.erb @@ +115,5 @@ > > cd $REPO_TARGET || die "$REPO_TARGET does not exist, cannot create repositories there" > > +retry() { > + local _cmd=$* nit - _cmd="$@" catches some cases that show up as more mac/windows folks name things :) But functionally equivalent for sane (old school) unix ;)
Attachment #8472534 - Flags: feedback?(hwine) → feedback+
There was one mis-sync during one of the last super high load episodes, but nothing since. Calling it good.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: