Status

RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: fubar, Assigned: fubar)

Tracking

Details

(Assignee)

Description

4 years ago
Reports in #it of pushes hanging, even after they appeared on hg.m.o. Noticed my own push to hg_templates was hanging. ~15 pushes were stuck attempting to sync to hgweb8. I'd noticed a huge bout of latency on hgweb8 at the same time; I'd had a terminal open that suddenly stopped responding, but chalked it up to wireless/vpn, as I see small bits of that semi-regularly.

Am currently in the process of manually terminating the pushes and syncing each repo. The data's already in the repo, so shouldn't be any issues there. Users will get an ssh error, though.

Also, hgweb8 was set to draining in zeus, so there was no chance of it sending out-of-date data to any one. Color us lucky, there.
(Assignee)

Comment 1

4 years ago
list of repo's that were waiting on hgweb:

mozilla-central
build/puppet
projects/cedar
users/lwagner_mozilla.com/inbound-patches
releases/mozilla-esr31
build/buildbot-configs
integration/mozilla-inbound
private/fuzzing
releases/mozilla-b2g28_v1_3t
hgcustom/hg_templates
integration/fx-team
projects/cypress
integration/gaia-central
try
integration/gaia-try
integration/b2g-inbound

Updated

4 years ago
Duplicate of this bug: 1030744
(Assignee)

Comment 3

4 years ago
Killed the cypress push and immediately all others finished. Not sure if related or odd coincidence; stopping other pushes had no effect on others afaict.
(Assignee)

Comment 4

4 years ago
Definitely related. hgweb8 had another wedge on Friday (June 27) that blocked vcs-sync and user pushes. We freed most manually but left a handful around to debug:

hg       15818 15793  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/mgervasini_mozilla.com/mozharness

hg       22885 22861  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- build/relengapi

hg       24585 24561  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/asasaki_mozilla.com/ash-mozharness

hg       31817 31793  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/jlund_mozilla.com/slaveapi


Started by stracing the matching ssh/hg processes on hgweb8 and found an unusual twist with fd's:

# ps -Af | grep relengapi
hg       24541 24506  0 Jun27 ?        00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/build/relengapi
hg       24542 24541  0 Jun27 ?        00:00:00 ssh hg.mozilla.org hg -R build/relengapi serve --stdio

# strace -fae -p 24541
Process 24541 attached - interrupt to quit
read(5, ^C <unfinished ...>
Process 24541 detached

# ls -AlF /proc/24541/fd
total 0
dr-x------ 2 hg hg  0 Jun 30 17:38 ./
dr-xr-xr-x 8 hg hg  0 Jun 27 20:48 ../
lr-x------ 1 hg hg 64 Jun 30 17:39 0 -> pipe:[112261711]
l-wx------ 1 hg hg 64 Jun 30 17:39 1 -> pipe:[112261712]
l-wx------ 1 hg hg 64 Jun 30 17:39 2 -> pipe:[112261713]
l-wx------ 1 hg hg 64 Jun 30 17:39 4 -> pipe:[112261755]
lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112261756]
lr-x------ 1 hg hg 64 Jun 30 17:39 7 -> pipe:[112261757]

# lsof | grep 112261756
ssh       21776        hg   26w     FIFO                0,8       0t0  112261756 pipe
hg        24541        hg    5r     FIFO                0,8       0t0  112261756 pipe
ssh       24542        hg    1w     FIFO                0,8       0t0  112261756 pipe

# ps -Af |grep 21776
hg       21776 21775  0 Jun27 ?        00:00:00 ssh hg.mozilla.org hg -R try serve --stdio

... wait, what? Check again with another repo (users/mgervasini_mozilla.com/mozharness) and found the same thing:

hg       22343 22308  0 Jun27 ?        00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/users/mgervasini_mozilla.com/mozharness

# ll /proc/22343/fd/5
lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112252724]

# lsof | grep 112252724
ssh       21776        hg   22w     FIFO                0,8       0t0  112252724 pipe
hg        22343        hg    5r     FIFO                0,8       0t0  112252724 pipe
ssh       22344        hg    1w     FIFO                0,8       0t0  112252724 pipe

And like cedar before it, killing off that try serve freed up all of the other hung procs. I'm not sure
what's crossing the streams.
Another instance of this on hgweb3 occurring nowish

From #releng:

    12:49 < KWierso|sheriffduty> https://tbpl.mozilla.org/php/getParsedLog.php?id=42876378&tree=Mozilla-Inbound#error0
    12:49 < KWierso|sheriffduty> 12:38:40     INFO -  ERROR: We tried to download the talos.json file but something failed.
    12:49 < KWierso|sheriffduty> 12:38:40     INFO -  ERROR: HTTP Error 404: Not Found
    12:50 < jmaher> KWierso|sheriffduty: that isn't good

From vcs email:
    Unknown delay: pipe_w S+ /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o
    0 S vcs2vcs  25317 25316  0  80   0 - 58941 pipe_w 19:16 pts/3    00:00:00 /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o

Investigating for 30 minute before reset.
Summary: hgweb8 sync issues → hgweb* sync issues
(Assignee)

Comment 6

4 years ago
Suspected culprit is ssh ControlMaster setup. About half way down on http://www.anchor.com.au/blog/2010/02/ssh-controlmaster-the-good-the-bad-the-ugly/, para starting at "Finally, there is something of an annoyance with ControlMaster..."

His description matches what we were seeing. Additionally, at one point there had been a screen session on hgssh1 with persistent ssh connections to each web head; that has since been killed, fairly recently (I'm certain it was there last week), though not all ssh connections were still there.

Also, also, from mpm on #mercurial:
< mpm_> controlmaster is broken.
< mpm_> The 'master' session doesn't properly close fds or something. Everyone doing automation over controlmaster hits this.

I've commented out the ControlMaster bits on the hg web heads. On the off chance it happens again, we'll need to try again. Also, manual intervention is required to unwedge and resync the stuck repos.
(In reply to Kendall Libby [:fubar] from comment #6)
> I've commented out the ControlMaster bits on the hg web heads. 

I think we need to do this on the hgssh* masters as well. Currently some web heads are configured from there:

    [root@hgssh1.dmz.scl3 ~]# grep -i controlmaster ~hg/.ssh/config
    ControlMaster auto
    ControlMaster auto
    ControlMaster auto
    ControlMaster auto
    [root@hgssh1.dmz.scl3 ~]#
(Assignee)

Comment 8

4 years ago
I'm not sure if that would have any effect, but it's been commented out. It wasn't in puppet.
Assignee: server-ops-webops → klibby
(Assignee)

Comment 9

4 years ago
No recurrences, calling this fixed.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
You need to log in before you can comment on or make changes to this bug.