Closed
Bug 1038478
Opened 11 years ago
Closed 11 years ago
releases/l10n/mozilla-beta/ repos are inconsistent across hgweb*
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: fubar)
References
Details
Attachments
(2 files)
Two of our three android builds for 31.0b10 worked ok, but one failed to get rev 50b3a539bd85 from https://hg.mozilla.org/releases/l10n/mozilla-beta/ms/.
hgweb1/3/7 have it, 2/4/5/6/8 don't.
Reporter | ||
Updated•11 years ago
|
Summary: releases/l10n/mozilla-beta/ms repo is inconsistent → releases/l10n/mozilla-beta/ms repo is inconsistent across hgweb*
Comment 1•11 years ago
|
||
While digging for the possible problem I found these errors:
on hgssh1:
Jul 14 15:39:11 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com releases/l10n/mozilla-beta/ms from ffxbld): remote: ssh_exchange_identification: Connection closed by remote host#015
Jul 14 15:39:11 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com releases/l10n/mozilla-beta/ms from ffxbld): abort: no suitable response from remote hg!
There were no relevant logs on hgweb6 (PAM error came later):
[root@hgweb6.dmz.scl3 log]# grep 15:39 secure messages|grep -v puppet
secure:Jul 13 15:39:05 hgweb6 sshd[6355]: Set /proc/self/oom_score_adj to 0
secure:Jul 13 15:39:05 hgweb6 sshd[6355]: Connection from 10.22.75.42 port 35459
secure:Jul 13 15:39:05 hgweb6 sshd[6356]: Connection closed by 10.22.75.42
secure:Jul 14 15:39:27 hgweb6 sudo: PAM unable to dlopen(/lib64/security/pam_fprintd.so): /lib64/security/pam_fprintd.so: cannot open shared object file: No such file or directory
secure:Jul 14 15:39:27 hgweb6 sudo: PAM adding faulty module: /lib64/security/pam_fprintd.so
Comment 2•11 years ago
|
||
(the repository was synced out correctly when running the syncing script manually)
Comment 3•11 years ago
|
||
Handing off to general dev services for whoever has time to investigate this
Assignee: bkero → server-ops-webops
Updated•11 years ago
|
Assignee: server-ops-webops → server-ops-devservices
Component: WebOps: Source Control → Server Operations: Developer Services
Product: Infrastructure & Operations → mozilla.org
Assignee | ||
Comment 5•11 years ago
|
||
hgssh1.dmz.scl3# grep repo-push messages* | grep 'Connection closed by remote host' | wc -l
328
So, this has hit us 328 times in the last month, but we only noticed now. I have greatly mixed feelings about that.
I've manually synced up everything; there were only a few outstanding repos.
I think we have two paths from here:
1) figure out why the sync is failing - because it's all in the same DC, this has some merit, in case there are larger issues, but overall I feel this is a lower priority than..
2) make the sync process more robust - currently, it's just a bash for loop that runs logger and ssh (that dumps output to logger). at the very least, it's probably worth setting pipefail and check the return code for the ssh pipeline. ideal might be checking the output and attempting to re-run the push to failed nodes.
Assignee | ||
Comment 6•11 years ago
|
||
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb6.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg!
Jul 18 09:54:07 hgssh1 sshd[11734]: Connection from 10.22.74.212 port 14693
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb7.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb7.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg!
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb5.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb5.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg!
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb8.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): remote: ssh_exchange_identification: Connection closed by remote host#015
Jul 18 09:54:07 hgssh1 repo-push.sh to hgweb8.dmz.scl3.mozilla.com build/buildbot-configs from asasaki@mozilla.com): abort: no suitable response from remote hg!
Assignee | ||
Comment 7•11 years ago
|
||
list of similar sync failures since the 15th;
Jul 16 11:05:14 hgweb3.dmz.scl3.mozilla.com users/mgervasini_mozilla.com/en-GB
Jul 16 11:06:48 hgweb6.dmz.scl3.mozilla.com try
Jul 16 18:09:40 hgweb7.dmz.scl3.mozilla.com try
Jul 16 18:09:40 hgweb8.dmz.scl3.mozilla.com try
Jul 16 18:09:40 hgweb3.dmz.scl3.mozilla.com try
Jul 17 21:10:30 hgweb4.dmz.scl3.mozilla.com releases/mozilla-aurora
Jul 17 23:27:10 hgweb6.dmz.scl3.mozilla.com releases/gaia-l10n/v1_3/zh-TW
Jul 17 23:30:31 hgweb8.dmz.scl3.mozilla.com releases/mozilla-aurora
Jul 18 09:54:07 hgweb6.dmz.scl3.mozilla.com build/buildbot-configs
Jul 18 09:54:07 hgweb7.dmz.scl3.mozilla.com build/buildbot-configs
Jul 18 09:54:07 hgweb5.dmz.scl3.mozilla.com build/buildbot-configs
Jul 18 09:54:07 hgweb8.dmz.scl3.mozilla.com build/buildbot-configs
Assignee | ||
Comment 8•11 years ago
|
||
Added ssh option 'ConnectionAttempts=3' to help with the ssh errors. Default is 1. The bitbucket thread discussing similar issues noted that an immediate retry usually (always?) got around the problem. (ssh options also moved to variable for easier reading)
Failing that... exit status of ssh now checked; on error, waits a second and then tries again, with additional logging. Because bash, this was turned into a function that we can background, while still checking return codes, etc.
I can still elicit ssh errors, but it requires multiple simultaneous evocations. It's possible that they may still crop up, particularly on bigger/slower repos, but the connect attempts and retry should help with that.
Attachment #8460548 -
Flags: review?(bkero)
Attachment #8460548 -
Flags: feedback?(chris.lonnen)
Comment on attachment 8460548 [details]
updated repo-push.sh to better handle/report errors
This is better than what we have -- klibby's tests show that this does detect and log a class of ssh error which was "invisible" before.
Let's go with it for now.
Attachment #8460548 -
Flags: review+
Comment 10•11 years ago
|
||
Comment on attachment 8460548 [details]
updated repo-push.sh to better handle/report errors
lgtm
Attachment #8460548 -
Flags: review?(bkero) → review+
Comment 11•11 years ago
|
||
Comment on attachment 8460548 [details]
updated repo-push.sh to better handle/report errors
Lines {8,9} and {14,15} could DRY out a little, but this will do the job.
Attachment #8460548 -
Flags: feedback?(chris.lonnen) → feedback+
Assignee | ||
Comment 12•11 years ago
|
||
Added "-o ServerAliveInterval=5" to ssh options after conversation in #vcs. Script committed to puppet and is in production.
Comment 13•11 years ago
|
||
update: no new out-of-sync issues reported since deployment
Assignee | ||
Comment 14•11 years ago
|
||
checked all repos updated since Jul 1, and we're still all good.
Assignee: server-ops-devservices → klibby
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 15•11 years ago
|
||
we missed adding the ssh ServerAlive* and ConnectionAttempts to the hg user on hgweb* (for hgweb*->hg.m.o:22 pulls), and ran into a hangup with hgweb6 pulling a try update.
added in r91124.
Comment 16•11 years ago
|
||
See also https://bugzilla.mozilla.org/show_bug.cgi?id=1036998#c2 (comments 2 - 4)
Assignee | ||
Comment 17•11 years ago
|
||
high load on the hg web heads this morning caused issues, e.g.:
Jul 31 04:39:00 hgssh1 repo-push.sh[17062] integration/gaia-central to hgweb1.dmz.scl3.mozilla.com for vcs-sync@mozilla.com: Connection timed out during banner exchange#015
Jul 31 04:39:04 hgssh1 repo-push.sh[17062] retry integration/gaia-central to hgweb3.dmz.scl3.mozilla.com for vcs-sync@mozilla.com: Connection timed out during banner exchange#015
Jul 31 04:39:04 hgssh1 repo-push.sh[17062] retry integration/gaia-central to hgweb4.dmz.scl3.mozilla.com for vcs-sync@mozilla.com: Connection timed out during banner exchange#015
integration/gaia-central, integration/gaia, try-comm-central, and one user repo were affected.
I've increased the ConnectTimeout in repo-push.sh from 3s to 10s.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 18•11 years ago
|
||
We had another case with https://hg.mozilla.org/releases/l10n/mozilla-beta/si in the Firefox 32.0b4 release. The build machine did a fresh clone then tried to update to the tag FIREFOX_32_0b4_RELEASE. This failed because 23eeb52560a7 was the last changeset in the repo, ie it's missing d6bdca917766.
Reporter | ||
Comment 19•11 years ago
|
||
Same story for https://hg.mozilla.org/releases/l10n/mozilla-beta/nn-NO
Comment 20•11 years ago
|
||
Same events in the logs:
Aug 4 16:27:27 hgssh1 repo-push.sh[28452] releases/l10n/mozilla-beta/si to hgweb3.dmz.scl3.mozilla.com for ffxbld: remote: ssh_exchange_identification: Connection closed by remote host#015
Aug 4 16:27:27 hgssh1 repo-push.sh[28452] releases/l10n/mozilla-beta/si to hgweb3.dmz.scl3.mozilla.com for ffxbld: abort: no suitable response from remote hg!
Reporter | ||
Comment 21•11 years ago
|
||
https://hg.mozilla.org/releases/l10n/mozilla-beta/kn today, do we have any more ideas about tweaking the config ? This release automation seems to tickle this bug every beta.
Summary: releases/l10n/mozilla-beta/ms repo is inconsistent across hgweb* → releases/l10n/mozilla-beta/ repos are inconsistent across hgweb*
Assignee | ||
Comment 22•11 years ago
|
||
Repos resynced. If this is blocking stuff overnight, let the MOC know and they can use the docs at https://mana.mozilla.org/wiki/display/SYSADMIN/Mercurial+-+Common+Repository+Operations#Mercurial-CommonRepositoryOperations-Verifyingandre-syncingwebheads to fix it.
Need to go and look through the logs to see where these syncs fell apart. I know we can make a few more tweaks in one part of the process, but I'm not sure we've seen it break there yet.
Assignee | ||
Comment 23•11 years ago
|
||
The zeus connection mgmt settings for the hgssh pool were kinda low - 4s connect timeout and 30s no-reply timeout. Increased to 30s and 45s.
mirror-pull still needs to be changed to retry pulls on failure. Also, we may have glanced over it, but we currently set ConnectionAttempts=3 for ssh, which is actually only three attempts one second apart, rather than 3*N seconds. We might also want to increase the ConnectTimeout to match zeus; currently set to 15s.
Assignee | ||
Comment 24•11 years ago
|
||
unherped my derp and found the easy way to have mirror-pull retry pulls/clones and scp's of pushlog. also makes the pushlog swap contingent on scp's success.
Attachment #8472534 -
Flags: review?(bkero)
Attachment #8472534 -
Flags: feedback?(hwine)
Comment 25•11 years ago
|
||
Comment on attachment 8472534 [details] [diff] [review]
add retries to mirror-pull
lgtm as long as the retry function is also fetching the commands after && and honoring the redirects
Attachment #8472534 -
Flags: review?(bkero) → review+
Assignee | ||
Comment 26•11 years ago
|
||
it does if I put quotes around it. thx.
Assignee | ||
Comment 27•11 years ago
|
||
At least a dozen failed syncs over the course of last night. Commited changes to mirror-pull in r92016. Rolling out to hgweb heads now.
Assignee | ||
Comment 28•11 years ago
|
||
success!
Aug 14 07:00:43 hgssh1 repo-push.sh[17266] -e try to hgweb9.dmz.scl3.mozilla.com for pvanderbeken@mozilla.com: remote: ssh_exchange_identification: Connection closed by remote host#015
Aug 14 07:00:45 hgssh1 repo-push.sh[17266] -e try to hgweb9.dmz.scl3.mozilla.com for pvanderbeken@mozilla.com: retrying hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/try
still unknown what's causing the intermittent ssh failures on pull, though.
Reporter | ||
Comment 29•11 years ago
|
||
We didn't have any problems with 32.0b7 today, there was much rejoicing!
Assignee | ||
Comment 30•11 years ago
|
||
Increased sshd's MaxSessions and MaxStartups to 50 yesterday after continued failures and (successful) retries. No unexpected issues since.
Attachment #8472534 -
Attachment is patch: true
Comment 31•11 years ago
|
||
Comment on attachment 8472534 [details] [diff] [review]
add retries to mirror-pull
Review of attachment 8472534 [details] [diff] [review]:
-----------------------------------------------------------------
::: mirror-pull.erb
@@ +115,5 @@
>
> cd $REPO_TARGET || die "$REPO_TARGET does not exist, cannot create repositories there"
>
> +retry() {
> + local _cmd=$*
nit - _cmd="$@" catches some cases that show up as more mac/windows folks name things :)
But functionally equivalent for sane (old school) unix ;)
Attachment #8472534 -
Flags: feedback?(hwine) → feedback+
Assignee | ||
Comment 32•11 years ago
|
||
There was one mis-sync during one of the last super high load episodes, but nothing since. Calling it good.
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•