Closed Bug 1030759 Opened 10 years ago Closed 10 years ago

hgweb* sync issues

Categories

(Developer Services :: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fubar, Assigned: fubar)

References

Details

Reports in #it of pushes hanging, even after they appeared on hg.m.o. Noticed my own push to hg_templates was hanging. ~15 pushes were stuck attempting to sync to hgweb8. I'd noticed a huge bout of latency on hgweb8 at the same time; I'd had a terminal open that suddenly stopped responding, but chalked it up to wireless/vpn, as I see small bits of that semi-regularly.

Am currently in the process of manually terminating the pushes and syncing each repo. The data's already in the repo, so shouldn't be any issues there. Users will get an ssh error, though.

Also, hgweb8 was set to draining in zeus, so there was no chance of it sending out-of-date data to any one. Color us lucky, there.
list of repo's that were waiting on hgweb:

mozilla-central
build/puppet
projects/cedar
users/lwagner_mozilla.com/inbound-patches
releases/mozilla-esr31
build/buildbot-configs
integration/mozilla-inbound
private/fuzzing
releases/mozilla-b2g28_v1_3t
hgcustom/hg_templates
integration/fx-team
projects/cypress
integration/gaia-central
try
integration/gaia-try
integration/b2g-inbound
Killed the cypress push and immediately all others finished. Not sure if related or odd coincidence; stopping other pushes had no effect on others afaict.
Definitely related. hgweb8 had another wedge on Friday (June 27) that blocked vcs-sync and user pushes. We freed most manually but left a handful around to debug:

hg       15818 15793  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/mgervasini_mozilla.com/mozharness

hg       22885 22861  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- build/relengapi

hg       24585 24561  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/asasaki_mozilla.com/ash-mozharness

hg       31817 31793  0 Jun27 ?        00:00:00 /usr/bin/ssh -l hg -i /etc/mercurial/mirror -o StrictHostKeyChecking=no -o ConnectTimeout=3s -o PasswordAuthentication=no -o PreferredAuthentications=publickey -o UserKnownHostsFile=/etc/mercurial/known_hosts hgweb8.dmz.scl3.mozilla.com -- users/jlund_mozilla.com/slaveapi


Started by stracing the matching ssh/hg processes on hgweb8 and found an unusual twist with fd's:

# ps -Af | grep relengapi
hg       24541 24506  0 Jun27 ?        00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/build/relengapi
hg       24542 24541  0 Jun27 ?        00:00:00 ssh hg.mozilla.org hg -R build/relengapi serve --stdio

# strace -fae -p 24541
Process 24541 attached - interrupt to quit
read(5, ^C <unfinished ...>
Process 24541 detached

# ls -AlF /proc/24541/fd
total 0
dr-x------ 2 hg hg  0 Jun 30 17:38 ./
dr-xr-xr-x 8 hg hg  0 Jun 27 20:48 ../
lr-x------ 1 hg hg 64 Jun 30 17:39 0 -> pipe:[112261711]
l-wx------ 1 hg hg 64 Jun 30 17:39 1 -> pipe:[112261712]
l-wx------ 1 hg hg 64 Jun 30 17:39 2 -> pipe:[112261713]
l-wx------ 1 hg hg 64 Jun 30 17:39 4 -> pipe:[112261755]
lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112261756]
lr-x------ 1 hg hg 64 Jun 30 17:39 7 -> pipe:[112261757]

# lsof | grep 112261756
ssh       21776        hg   26w     FIFO                0,8       0t0  112261756 pipe
hg        24541        hg    5r     FIFO                0,8       0t0  112261756 pipe
ssh       24542        hg    1w     FIFO                0,8       0t0  112261756 pipe

# ps -Af |grep 21776
hg       21776 21775  0 Jun27 ?        00:00:00 ssh hg.mozilla.org hg -R try serve --stdio

... wait, what? Check again with another repo (users/mgervasini_mozilla.com/mozharness) and found the same thing:

hg       22343 22308  0 Jun27 ?        00:00:00 /usr/bin/python /usr/bin/hg pull --config hooks.pretxnchangegroup.z_linearhistory= --config hooks.pretxnchangegroup.z_loghistory= --config trusted.users=root,hg --config paths.default=ssh://hg.mozilla.org/users/mgervasini_mozilla.com/mozharness

# ll /proc/22343/fd/5
lr-x------ 1 hg hg 64 Jun 30 17:39 5 -> pipe:[112252724]

# lsof | grep 112252724
ssh       21776        hg   22w     FIFO                0,8       0t0  112252724 pipe
hg        22343        hg    5r     FIFO                0,8       0t0  112252724 pipe
ssh       22344        hg    1w     FIFO                0,8       0t0  112252724 pipe

And like cedar before it, killing off that try serve freed up all of the other hung procs. I'm not sure
what's crossing the streams.
Another instance of this on hgweb3 occurring nowish

From #releng:

    12:49 < KWierso|sheriffduty> https://tbpl.mozilla.org/php/getParsedLog.php?id=42876378&tree=Mozilla-Inbound#error0
    12:49 < KWierso|sheriffduty> 12:38:40     INFO -  ERROR: We tried to download the talos.json file but something failed.
    12:49 < KWierso|sheriffduty> 12:38:40     INFO -  ERROR: HTTP Error 404: Not Found
    12:50 < jmaher> KWierso|sheriffduty: that isn't good

From vcs email:
    Unknown delay: pipe_w S+ /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o
    0 S vcs2vcs  25317 25316  0  80   0 - 58941 pipe_w 19:16 pts/3    00:00:00 /opt/vcs2vcs/venv/hg/bin/python /opt/vcs2vcs/bin/hg --cwd /opt/vcs2vcs/repos/integration-gaia-central push --force hg.m.o

Investigating for 30 minute before reset.
Summary: hgweb8 sync issues → hgweb* sync issues
Suspected culprit is ssh ControlMaster setup. About half way down on http://www.anchor.com.au/blog/2010/02/ssh-controlmaster-the-good-the-bad-the-ugly/, para starting at "Finally, there is something of an annoyance with ControlMaster..."

His description matches what we were seeing. Additionally, at one point there had been a screen session on hgssh1 with persistent ssh connections to each web head; that has since been killed, fairly recently (I'm certain it was there last week), though not all ssh connections were still there.

Also, also, from mpm on #mercurial:
< mpm_> controlmaster is broken.
< mpm_> The 'master' session doesn't properly close fds or something. Everyone doing automation over controlmaster hits this.

I've commented out the ControlMaster bits on the hg web heads. On the off chance it happens again, we'll need to try again. Also, manual intervention is required to unwedge and resync the stuck repos.
(In reply to Kendall Libby [:fubar] from comment #6)
> I've commented out the ControlMaster bits on the hg web heads. 

I think we need to do this on the hgssh* masters as well. Currently some web heads are configured from there:

    [root@hgssh1.dmz.scl3 ~]# grep -i controlmaster ~hg/.ssh/config
    ControlMaster auto
    ControlMaster auto
    ControlMaster auto
    ControlMaster auto
    [root@hgssh1.dmz.scl3 ~]#
I'm not sure if that would have any effect, but it's been commented out. It wasn't in puppet.
Assignee: server-ops-webops → klibby
No recurrences, calling this fixed.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: WebOps: Source Control → General
Product: Infrastructure & Operations → Developer Services
You need to log in before you can comment on or make changes to this bug.