Closed
Bug 1030759
Opened 10 years ago
Closed 10 years ago
hgweb* sync issues
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: fubar, Assigned: fubar)
References
Details
Reports in #it of pushes hanging, even after they appeared on hg.m.o. Noticed my own push to hg_templates was hanging. ~15 pushes were stuck attempting to sync to hgweb8. I'd noticed a huge bout of latency on hgweb8 at the same time; I'd had a terminal open that suddenly stopped responding, but chalked it up to wireless/vpn, as I see small bits of that semi-regularly. Am currently in the process of manually terminating the pushes and syncing each repo. The data's already in the repo, so shouldn't be any issues there. Users will get an ssh error, though. Also, hgweb8 was set to draining in zeus, so there was no chance of it sending out-of-date data to any one. Color us lucky, there.
Assignee | ||
Comment 1•10 years ago
|
||
list of repo's that were waiting on hgweb: mozilla-central build/puppet projects/cedar users/lwagner_mozilla.com/inbound-patches releases/mozilla-esr31 build/buildbot-configs integration/mozilla-inbound private/fuzzing releases/mozilla-b2g28_v1_3t hgcustom/hg_templates integration/fx-team projects/cypress integration/gaia-central try integration/gaia-try integration/b2g-inbound
Assignee | ||
Comment 3•10 years ago
|
||
Killed the cypress push and immediately all others finished. Not sure if related or odd coincidence; stopping other pushes had no effect on others afaict.
Assignee | ||
Comment 4•10 years ago
|
||
Definitely related. hgweb8 had another wedge on Friday (June 27) that blocked vcs-sync and user pushes. We freed most manually but left a handful around to debug: hg 15818 15793 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/mgervasini_mozilla.com/mozharness hg 22885 22861 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- build/relengapi hg 24585 24561 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/asasaki_mozilla.com/ash-mozharness hg 31817 31793 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/jlund_mozilla.com/slaveapi Started by stracing the matching ssh/hg processes on hgweb8 and found an unusual twist with fd's: # ps -Af | grep relengapi hg 24541 24506 0 Jun27 ? 00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/build/relengapi hg 24542 24541 0 Jun27 ? 00:00:00 ssh hg.mozilla.org hg -R build/relengapi serve --stdio # strace -fae -p 24541 Process 24541 attached - interrupt to quit read(5, ^C <unfinished ...> Process 24541 detached # ls -AlF /proc/24541/fd total 0 dr-x------ 2 hg hg 0 Jun 30 17:38 ./ dr-xr-xr-x 8 hg hg 0 Jun 27 20:48 ../ lr-x------ 1 hg hg 64 Jun 30 17:39 0 -> pipe:[112261711] l-wx------ 1 hg hg 64 Jun 30 17:39 1 -> pipe:[112261712] l-wx------ 1 hg hg 64 Jun 30 17:39 2 -> pipe:[112261713] l-wx------ 1 hg hg 64 Jun 30 17:39 4 -> pipe:[112261755] lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112261756] lr-x------ 1 hg hg 64 Jun 30 17:39 7 -> pipe:[112261757] # lsof | grep 112261756 ssh 21776 hg 26w FIFO 0,8 0t0 112261756 pipe hg 24541 hg 5r FIFO 0,8 0t0 112261756 pipe ssh 24542 hg 1w FIFO 0,8 0t0 112261756 pipe # ps -Af |grep 21776 hg 21776 21775 0 Jun27 ? 00:00:00 ssh hg.mozilla.org hg -R try serve --stdio ... wait, what? Check again with another repo (users/mgervasini_mozilla.com/mozharness) and found the same thing: hg 22343 22308 0 Jun27 ? 00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/users/mgervasini_mozilla.com/mozharness # ll /proc/22343/fd/5 lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112252724] # lsof | grep 112252724 ssh 21776 hg 22w FIFO 0,8 0t0 112252724 pipe hg 22343 hg 5r FIFO 0,8 0t0 112252724 pipe ssh 22344 hg 1w FIFO 0,8 0t0 112252724 pipe And like cedar before it, killing off that try serve freed up all of the other hung procs. I'm not sure what's crossing the streams.
Comment 5•10 years ago
|
||
Another instance of this on hgweb3 occurring nowish From #releng: 12:49 < KWierso|sheriffduty> https://tbpl.mozilla.org/php/getParsedLog.php?id=42876378&tree=Mozilla-Inbound#error0 12:49 < KWierso|sheriffduty> 12:38:40 INFO - ERROR: We tried to download the talos.json file but something failed. 12:49 < KWierso|sheriffduty> 12:38:40 INFO - ERROR: HTTP Error 404: Not Found 12:50 < jmaher> KWierso|sheriffduty: that isn't good From vcs email: Unknown delay: pipe_w S+ /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o 0 S vcs2vcs 25317 25316 0 80 0 - 58941 pipe_w 19:16 pts/3 00:00:00 /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o Investigating for 30 minute before reset.
Updated•10 years ago
|
Summary: hgweb8 sync issues → hgweb* sync issues
Assignee | ||
Comment 6•10 years ago
|
||
Suspected culprit is ssh ControlMaster setup. About half way down on http://www.anchor.com.au/blog/2010/02/ssh-controlmaster-the-good-the-bad-the-ugly/, para starting at "Finally, there is something of an annoyance with ControlMaster..." His description matches what we were seeing. Additionally, at one point there had been a screen session on hgssh1 with persistent ssh connections to each web head; that has since been killed, fairly recently (I'm certain it was there last week), though not all ssh connections were still there. Also, also, from mpm on #mercurial: < mpm_> controlmaster is broken. < mpm_> The 'master' session doesn't properly close fds or something. Everyone doing automation over controlmaster hits this. I've commented out the ControlMaster bits on the hg web heads. On the off chance it happens again, we'll need to try again. Also, manual intervention is required to unwedge and resync the stuck repos.
Comment 7•10 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #6) > I've commented out the ControlMaster bits on the hg web heads. I think we need to do this on the hgssh* masters as well. Currently some web heads are configured from there: [root@hgssh1.dmz.scl3 ~]# grep -i controlmaster ~hg/.ssh/config ControlMaster auto ControlMaster auto ControlMaster auto ControlMaster auto [root@hgssh1.dmz.scl3 ~]#
Assignee | ||
Comment 8•10 years ago
|
||
I'm not sure if that would have any effect, but it's been commented out. It wasn't in puppet.
Assignee: server-ops-webops → klibby
Assignee | ||
Comment 9•10 years ago
|
||
No recurrences, calling this fixed.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•