636462 - MTV slaves unable to clone

Reporter

Description

•

14 years ago

(possibly related to bug 636342) We're seeing very long clone times for cloning build/tools (which is less than 10s on a working linux system): C:\tmp>hg clone http://hg.mozilla.org/build/tools tools requesting all changes adding changesets (this hasn't finished yet, so I don't have a time for you) On mw32-ix-slave23, bsmedberg saw: requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed ** unknown exception encountered, please report by visiting ** http://mercurial.selenic.com/wiki/BugTracker ** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] ** Mercurial Distributed SCM (version 1.7.5) ** Extensions loaded: win32text, graphlog, share, purge Traceback (most recent call last): File "hg", line 38, in <module> File "mercurial\dispatch.pyc", line 16, in run File "mercurial\dispatch.pyc", line 36, in dispatch File "mercurial\dispatch.pyc", line 58, in _runcatch File "mercurial\dispatch.pyc", line 593, in _dispatch File "mercurial\dispatch.pyc", line 401, in runcommand File "mercurial\dispatch.pyc", line 644, in _runcommand File "mercurial\dispatch.pyc", line 598, in checkargs File "mercurial\dispatch.pyc", line 591, in <lambda> File "mercurial\util.pyc", line 426, in check File "mercurial\commands.pyc", line 736, in clone File "mercurial\hg.pyc", line 337, in clone File "mercurial\localrepo.pyc", line 1886, in clone File "mercurial\localrepo.pyc", line 1295, in pull File "mercurial\localrepo.pyc", line 1739, in addchangegroup File "mercurial\revlog.pyc", line 1381, in addgroup File "mercurial\revlog.pyc", line 1220, in _addrevision mpatch.mpatchError: patch cannot be decoded

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 1

•

14 years ago

Is this related to the problems we had a few days ago in bug 635501?

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

14 years ago

This seems to affect more than just Windows.

Summary: MTV windows slaves unable to clone → MTV slaves unable to clone

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 3

•

14 years ago

We hit this exact same hg error yesterday on slaves because of the firewall change in 650castro. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=636342#c3. Looping Ravi in case its something in his court.

Ravi Pina [:ravi]

Comment 4

•

14 years ago

302 https://bugzilla.mozilla.org/show_bug.cgi?id=636342#c23 I'll also note the config is rolled back to before you were experiencing the problems. And by that it is gone completely. Did this problem manifest itself overnight or was it broken since ~1720 yesterday or even after dmoore rolled back the changes ~2200?

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 5

•

14 years ago

I don't have a good way to determine when this started. We currently have most of our mtv slaves disabled. I've been running cd /tmp && time hg clone http://hg.mozilla.org/build/tools && rm -rf /tmp/tools on slaves to verify that they can do a checkout quickly, and I cannot get it to hang in scl, but it often hangs in mtv. The hang doesn't seem to be specific to the host machine. So this is still a problem, and has a significant chunk of our slave architecture down at the moment.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 6

•

14 years ago

This also seems to work fine in mpt. I'm running the hg command above in a while loop on 5 slaves in each datacenter, and it works predictably in scl and mpt and fails reliably in mtv, even if only run on one slave. This was not the case an hour or so ago, when I was running this command successfully on (the same) mtv slaves.

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

14 years ago

I had a quick look and found failures at the following times with the same symptoms. There may be more, I haven't done a complete look. Feb 23 16:31 Feb 23 23:41 Feb 24 05:11 Feb 24 05:39 Feb 24 05:52 Feb 24 08:03 Feb 24 08:40

Ravi Pina [:ravi]

Comment 8

•

14 years ago

What constitutes a failure here? Completely unable to do `hg clone http://hg.mozilla.org/build/tools`? Who maintains hg? Can someone look into that?

bhearsum@mozilla.com (:bhearsum)

Comment 9

•

14 years ago

A failure is the clone failing to complete before the 20 minute timeout. IT owns HG.m.o.

Aki Sasaki (not active)

Updated

•

14 years ago

Assignee: nobody → server-ops

Component: Release Engineering → Server Operations

QA Contact: release → mrz

Aki Sasaki (not active)

Updated

•

14 years ago

Severity: normal → critical

Aki Sasaki (not active)

Updated

•

14 years ago

Severity: critical → blocker

matthew zeier [:mrz]

Updated

•

14 years ago

Severity: blocker → critical

matthew zeier [:mrz]

Comment 10

•

14 years ago

Moving to get IT eyes on hg infra.

matthew zeier [:mrz]

Updated

•

14 years ago

Severity: critical → blocker

Corey Shields [:cshields]

Comment 11

•

14 years ago

(In reply to comment #8) > What constitutes a failure here? Completely unable to do `hg clone > http://hg.mozilla.org/build/tools`? Who maintains hg? Can someone look into > that? Aravind can you look at hg today?

Assignee: server-ops → aravind

Severity: blocker → critical

Aravind Gottipati [:aravind]

Comment 12

•

14 years ago

Can the build machines that are failing even telnet to hg.m.o port 80?

bhearsum@mozilla.com (:bhearsum)

Comment 13

•

14 years ago

(In reply to comment #12) > Can the build machines that are failing even telnet to hg.m.o port 80? Yup, and they even receive *some* data. Sometimes they time out after 20 minutes, sometimes they get a weird network issue, like in comment #0.

bhearsum@mozilla.com (:bhearsum)

Comment 14

•

14 years ago

I'm starting to wonder if this is hg-specific -- I'm currently running: wget http://people.mozilla.com/~bhearsum/tegra-host-utils.zip on bm-remote-talos-webhost-01 (A linux machine in MTV), and its only getting 10kb/sec

bhearsum@mozilla.com (:bhearsum)

Comment 15

•

14 years ago

(In reply to comment #14) > I'm starting to wonder if this is hg-specific -- I'm currently running: > wget http://people.mozilla.com/~bhearsum/tegra-host-utils.zip s/is/isn't/, of course. > on bm-remote-talos-webhost-01 (A linux machine in MTV), and its only getting > 10kb/sec

Aki Sasaki (not active)

Comment 16

•

14 years ago

This seems to be causing issues for the tegras downloading builds/test files from stage.m.o.

Derek Moore [:dmoore]

Assignee

Updated

•

14 years ago

Assignee: aravind → dmoore

Derek Moore [:dmoore]

Assignee

Comment 17

•

14 years ago

netops is taking this bug, it seems to be an interaction with the ethernet drivers on the servers in MPT. We'll follow up once we've worked with infra to gather more data. At this time, you should be seeing significantly improved throughput for connections to hg.

Derek Moore [:dmoore]

Assignee

Comment 18

•

14 years ago

Our current fix is disabling the hardware-based TCP segmentation offloading on the *server* side: ethtool -K <dev> tso off This is not a permanent fix, as this value has always defaulted to on.

Derek Moore [:dmoore]

Assignee

Updated

•

14 years ago

Severity: critical → major

Derek Moore [:dmoore]

Assignee

Comment 19

•

14 years ago

We've applied this change to hg (dm-hg02), people, and stage (surf).

Shyam Mani [:fox2mike]

Updated

•

14 years ago

Severity: major → blocker

Aki Sasaki (not active)

Comment 21

•

14 years ago

Ravi applied nat workaround at midnight:45 PST, so we can sleep.

Aki Sasaki (not active)

Comment 23

•

14 years ago

Stage.m.o wasn't touched at midnight:45, so the tegras are seeing issues.

Ravi Pina [:ravi]

Comment 24

•

14 years ago

Just applied a similar work around to stage as with hg.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 25

•

14 years ago

After some IRC discussion, it sounds like we should bring the remaining disabled mtv slaves (about 45 of them) back online. It's bear's call, but probably wisest is to bring up some fraction of those 45 tomorrow (Tuesday) morning, then watch for problems throughout the day, and barring any failures, bring the rest up on Wednesday.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 26

•

14 years ago

from discussion with dmoore, ravi, zandr, dustin: 1) IT do not (currently) believe the failures are load-related, and so the workarounds in place should continue to work even as RelEng brings machines in 650castro back into production. RelEng is nervous of bringing these all back into production because * it takes a long time to bring them into production * if they fail, they burn builds in production (and cause tree closures) * it takes a long time to take them all out of production again To get out of this deadlock, RelEng will bring up some slaves Wed, watch with IT and if all is still ok, then bring the rest back into production Thurs. (We're explicitly not doing anything tomorrow, Tues, because of release embargo in progress).

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 27

•

14 years ago

oops, Tuesday's an embargo day, so we bring up a fraction on Wednesday and another fraction on Thursday.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

14 years ago

Whiteboard: [slaveduty]

Ravi Pina [:ravi]

Comment 28

•

14 years ago

This shouldn't be blocking anything. We have had a work-around in place since Friday. We're in a holding pattern until slaves can be brought back to test and verify.

Severity: blocker → major

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 29

•

14 years ago

Bear brought 50% of the remainder up in the last hour or so. Now we wait to see what happens.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 30

•

14 years ago

Shortly after bringing up that 50%, we saw more failures - bug 638309. It may be unrelated, but I would *much* appreciate it if the two of you can take a look at what we're seeing there. At this point, we're planning some significant changes to our architecture to work around the general area where things are failing (the mountain view firewalls), but doing so without full knowledge of the problem(s), so our plans are probably not optimal. We need help desperately.

matthew zeier [:mrz]

Comment 31

•

14 years ago

For clarification, who is "you two"?

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 32

•

14 years ago

sorry, ravi and dmoore who were so helpful in chasing this bug down.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 34

•

14 years ago

during triage; bug#639630 is tracking bringing remaining ix machines in 650castro online in *staging* to recreate load issues and see if that helps IT debug.

Comment 35

•

14 years ago

(In reply to comment #28) > This shouldn't be blocking anything. We have had a work-around in place since > Friday. We're in a holding pattern until slaves can be brought back to test > and verify. The remaining ix machines in 650castro have been running in production since Monday, and so far no problems. Looks like the workarounds are doing the trick. From email+irc with mrz, ravi, instead of spending more time on debugging this, lets close this bug and just get these ix machines moved to real homes. The curious can follow along in bug#642305, bug#636743

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard