Closed
Bug 1030759
Opened 11 years ago
Closed 11 years ago
hgweb* sync issues
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: fubar, Assigned: fubar)
References
Details
Reports in #it of pushes hanging, even after they appeared on hg.m.o. Noticed my own push to hg_templates was hanging. ~15 pushes were stuck attempting to sync to hgweb8. I'd noticed a huge bout of latency on hgweb8 at the same time; I'd had a terminal open that suddenly stopped responding, but chalked it up to wireless/vpn, as I see small bits of that semi-regularly.
Am currently in the process of manually terminating the pushes and syncing each repo. The data's already in the repo, so shouldn't be any issues there. Users will get an ssh error, though.
Also, hgweb8 was set to draining in zeus, so there was no chance of it sending out-of-date data to any one. Color us lucky, there.
Assignee | ||
Comment 1•11 years ago
|
||
list of repo's that were waiting on hgweb:
mozilla-central
build/puppet
projects/cedar
users/lwagner_mozilla.com/inbound-patches
releases/mozilla-esr31
build/buildbot-configs
integration/mozilla-inbound
private/fuzzing
releases/mozilla-b2g28_v1_3t
hgcustom/hg_templates
integration/fx-team
projects/cypress
integration/gaia-central
try
integration/gaia-try
integration/b2g-inbound
Assignee | ||
Comment 3•11 years ago
|
||
Killed the cypress push and immediately all others finished. Not sure if related or odd coincidence; stopping other pushes had no effect on others afaict.
Assignee | ||
Comment 4•11 years ago
|
||
Definitely related. hgweb8 had another wedge on Friday (June 27) that blocked vcs-sync and user pushes. We freed most manually but left a handful around to debug:
hg 15818 15793 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/mgervasini_mozilla.com/mozharness
hg 22885 22861 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- build/relengapi
hg 24585 24561 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/asasaki_mozilla.com/ash-mozharness
hg 31817 31793 0 Jun27 ? 00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/jlund_mozilla.com/slaveapi
Started by stracing the matching ssh/hg processes on hgweb8 and found an unusual twist with fd's:
# ps -Af | grep relengapi
hg 24541 24506 0 Jun27 ? 00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/build/relengapi
hg 24542 24541 0 Jun27 ? 00:00:00 ssh hg.mozilla.org hg -R build/relengapi serve --stdio
# strace -fae -p 24541
Process 24541 attached - interrupt to quit
read(5, ^C <unfinished ...>
Process 24541 detached
# ls -AlF /proc/24541/fd
total 0
dr-x------ 2 hg hg 0 Jun 30 17:38 ./
dr-xr-xr-x 8 hg hg 0 Jun 27 20:48 ../
lr-x------ 1 hg hg 64 Jun 30 17:39 0 -> pipe:[112261711]
l-wx------ 1 hg hg 64 Jun 30 17:39 1 -> pipe:[112261712]
l-wx------ 1 hg hg 64 Jun 30 17:39 2 -> pipe:[112261713]
l-wx------ 1 hg hg 64 Jun 30 17:39 4 -> pipe:[112261755]
lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112261756]
lr-x------ 1 hg hg 64 Jun 30 17:39 7 -> pipe:[112261757]
# lsof | grep 112261756
ssh 21776 hg 26w FIFO 0,8 0t0 112261756 pipe
hg 24541 hg 5r FIFO 0,8 0t0 112261756 pipe
ssh 24542 hg 1w FIFO 0,8 0t0 112261756 pipe
# ps -Af |grep 21776
hg 21776 21775 0 Jun27 ? 00:00:00 ssh hg.mozilla.org hg -R try serve --stdio
... wait, what? Check again with another repo (users/mgervasini_mozilla.com/mozharness) and found the same thing:
hg 22343 22308 0 Jun27 ? 00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/users/mgervasini_mozilla.com/mozharness
# ll /proc/22343/fd/5
lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112252724]
# lsof | grep 112252724
ssh 21776 hg 22w FIFO 0,8 0t0 112252724 pipe
hg 22343 hg 5r FIFO 0,8 0t0 112252724 pipe
ssh 22344 hg 1w FIFO 0,8 0t0 112252724 pipe
And like cedar before it, killing off that try serve freed up all of the other hung procs. I'm not sure
what's crossing the streams.
Another instance of this on hgweb3 occurring nowish
From #releng:
12:49 < KWierso|sheriffduty> https://tbpl.mozilla.org/php/getParsedLog.php?id=42876378&tree=Mozilla-Inbound#error0
12:49 < KWierso|sheriffduty> 12:38:40 INFO - ERROR: We tried to download the talos.json file but something failed.
12:49 < KWierso|sheriffduty> 12:38:40 INFO - ERROR: HTTP Error 404: Not Found
12:50 < jmaher> KWierso|sheriffduty: that isn't good
From vcs email:
Unknown delay: pipe_w S+ /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o
0 S vcs2vcs 25317 25316 0 80 0 - 58941 pipe_w 19:16 pts/3 00:00:00 /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o
Investigating for 30 minute before reset.
Assignee | ||
Comment 6•11 years ago
|
||
Suspected culprit is ssh ControlMaster setup. About half way down on http://www.anchor.com.au/blog/2010/02/ssh-controlmaster-the-good-the-bad-the-ugly/, para starting at "Finally, there is something of an annoyance with ControlMaster..."
His description matches what we were seeing. Additionally, at one point there had been a screen session on hgssh1 with persistent ssh connections to each web head; that has since been killed, fairly recently (I'm certain it was there last week), though not all ssh connections were still there.
Also, also, from mpm on #mercurial:
< mpm_> controlmaster is broken.
< mpm_> The 'master' session doesn't properly close fds or something. Everyone doing automation over controlmaster hits this.
I've commented out the ControlMaster bits on the hg web heads. On the off chance it happens again, we'll need to try again. Also, manual intervention is required to unwedge and resync the stuck repos.
(In reply to Kendall Libby [:fubar] from comment #6)
> I've commented out the ControlMaster bits on the hg web heads.
I think we need to do this on the hgssh* masters as well. Currently some web heads are configured from there:
[root@hgssh1.dmz.scl3 ~]# grep -i controlmaster ~hg/.ssh/config
ControlMaster auto
ControlMaster auto
ControlMaster auto
ControlMaster auto
[root@hgssh1.dmz.scl3 ~]#
Assignee | ||
Comment 8•11 years ago
|
||
I'm not sure if that would have any effect, but it's been commented out. It wasn't in puppet.
Assignee: server-ops-webops → klibby
Assignee | ||
Comment 9•11 years ago
|
||
No recurrences, calling this fixed.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•