Closed Bug 636462 Opened 14 years ago Closed 14 years ago

MTV slaves unable to clone

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dmoore)

References

Details

(Whiteboard: [slaveduty])

(possibly related to bug 636342) We're seeing very long clone times for cloning build/tools (which is less than 10s on a working linux system): C:\tmp>hg clone http://hg.mozilla.org/build/tools tools requesting all changes adding changesets (this hasn't finished yet, so I don't have a time for you) On mw32-ix-slave23, bsmedberg saw: requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed ** unknown exception encountered, please report by visiting ** http://mercurial.selenic.com/wiki/BugTracker ** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] ** Mercurial Distributed SCM (version 1.7.5) ** Extensions loaded: win32text, graphlog, share, purge Traceback (most recent call last): File "hg", line 38, in <module> File "mercurial\dispatch.pyc", line 16, in run File "mercurial\dispatch.pyc", line 36, in dispatch File "mercurial\dispatch.pyc", line 58, in _runcatch File "mercurial\dispatch.pyc", line 593, in _dispatch File "mercurial\dispatch.pyc", line 401, in runcommand File "mercurial\dispatch.pyc", line 644, in _runcommand File "mercurial\dispatch.pyc", line 598, in checkargs File "mercurial\dispatch.pyc", line 591, in <lambda> File "mercurial\util.pyc", line 426, in check File "mercurial\commands.pyc", line 736, in clone File "mercurial\hg.pyc", line 337, in clone File "mercurial\localrepo.pyc", line 1886, in clone File "mercurial\localrepo.pyc", line 1295, in pull File "mercurial\localrepo.pyc", line 1739, in addchangegroup File "mercurial\revlog.pyc", line 1381, in addgroup File "mercurial\revlog.pyc", line 1220, in _addrevision mpatch.mpatchError: patch cannot be decoded
Is this related to the problems we had a few days ago in bug 635501?
This seems to affect more than just Windows.
Summary: MTV windows slaves unable to clone → MTV slaves unable to clone
We hit this exact same hg error yesterday on slaves because of the firewall change in 650castro. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=636342#c3. Looping Ravi in case its something in his court.
302 https://bugzilla.mozilla.org/show_bug.cgi?id=636342#c23 I'll also note the config is rolled back to before you were experiencing the problems. And by that it is gone completely. Did this problem manifest itself overnight or was it broken since ~1720 yesterday or even after dmoore rolled back the changes ~2200?
I don't have a good way to determine when this started. We currently have most of our mtv slaves disabled. I've been running cd /tmp && time hg clone http://hg.mozilla.org/build/tools && rm -rf /tmp/tools on slaves to verify that they can do a checkout quickly, and I cannot get it to hang in scl, but it often hangs in mtv. The hang doesn't seem to be specific to the host machine. So this is still a problem, and has a significant chunk of our slave architecture down at the moment.
This also seems to work fine in mpt. I'm running the hg command above in a while loop on 5 slaves in each datacenter, and it works predictably in scl and mpt and fails reliably in mtv, even if only run on one slave. This was not the case an hour or so ago, when I was running this command successfully on (the same) mtv slaves.
I had a quick look and found failures at the following times with the same symptoms. There may be more, I haven't done a complete look. Feb 23 16:31 Feb 23 23:41 Feb 24 05:11 Feb 24 05:39 Feb 24 05:52 Feb 24 08:03 Feb 24 08:40
What constitutes a failure here? Completely unable to do `hg clone http://hg.mozilla.org/build/tools`? Who maintains hg? Can someone look into that?
A failure is the clone failing to complete before the 20 minute timeout. IT owns HG.m.o.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Severity: normal → critical
Severity: critical → blocker
Severity: blocker → critical
Moving to get IT eyes on hg infra.
Severity: critical → blocker
(In reply to comment #8) > What constitutes a failure here? Completely unable to do `hg clone > http://hg.mozilla.org/build/tools`? Who maintains hg? Can someone look into > that? Aravind can you look at hg today?
Assignee: server-ops → aravind
Severity: blocker → critical
Can the build machines that are failing even telnet to hg.m.o port 80?
(In reply to comment #12) > Can the build machines that are failing even telnet to hg.m.o port 80? Yup, and they even receive *some* data. Sometimes they time out after 20 minutes, sometimes they get a weird network issue, like in comment #0.
I'm starting to wonder if this is hg-specific -- I'm currently running: wget http://people.mozilla.com/~bhearsum/tegra-host-utils.zip on bm-remote-talos-webhost-01 (A linux machine in MTV), and its only getting 10kb/sec
(In reply to comment #14) > I'm starting to wonder if this is hg-specific -- I'm currently running: > wget http://people.mozilla.com/~bhearsum/tegra-host-utils.zip s/is/isn't/, of course. > on bm-remote-talos-webhost-01 (A linux machine in MTV), and its only getting > 10kb/sec
This seems to be causing issues for the tegras downloading builds/test files from stage.m.o.
Assignee: aravind → dmoore
netops is taking this bug, it seems to be an interaction with the ethernet drivers on the servers in MPT. We'll follow up once we've worked with infra to gather more data. At this time, you should be seeing significantly improved throughput for connections to hg.
Our current fix is disabling the hardware-based TCP segmentation offloading on the *server* side: ethtool -K <dev> tso off This is not a permanent fix, as this value has always defaulted to on.
Severity: critical → major
We've applied this change to hg (dm-hg02), people, and stage (surf).
Severity: major → blocker
Ravi applied nat workaround at midnight:45 PST, so we can sleep.
Stage.m.o wasn't touched at midnight:45, so the tegras are seeing issues.
Just applied a similar work around to stage as with hg.
After some IRC discussion, it sounds like we should bring the remaining disabled mtv slaves (about 45 of them) back online. It's bear's call, but probably wisest is to bring up some fraction of those 45 tomorrow (Tuesday) morning, then watch for problems throughout the day, and barring any failures, bring the rest up on Wednesday.
from discussion with dmoore, ravi, zandr, dustin: 1) IT do not (currently) believe the failures are load-related, and so the workarounds in place should continue to work even as RelEng brings machines in 650castro back into production. RelEng is nervous of bringing these all back into production because * it takes a long time to bring them into production * if they fail, they burn builds in production (and cause tree closures) * it takes a long time to take them all out of production again To get out of this deadlock, RelEng will bring up some slaves Wed, watch with IT and if all is still ok, then bring the rest back into production Thurs. (We're explicitly not doing anything tomorrow, Tues, because of release embargo in progress).
oops, Tuesday's an embargo day, so we bring up a fraction on Wednesday and another fraction on Thursday.
Whiteboard: [slaveduty]
This shouldn't be blocking anything. We have had a work-around in place since Friday. We're in a holding pattern until slaves can be brought back to test and verify.
Severity: blocker → major
Bear brought 50% of the remainder up in the last hour or so. Now we wait to see what happens.
Shortly after bringing up that 50%, we saw more failures - bug 638309. It may be unrelated, but I would *much* appreciate it if the two of you can take a look at what we're seeing there. At this point, we're planning some significant changes to our architecture to work around the general area where things are failing (the mountain view firewalls), but doing so without full knowledge of the problem(s), so our plans are probably not optimal. We need help desperately.
For clarification, who is "you two"?
sorry, ravi and dmoore who were so helpful in chasing this bug down.
during triage; bug#639630 is tracking bringing remaining ix machines in 650castro online in *staging* to recreate load issues and see if that helps IT debug.
See Also: → 639630
(In reply to comment #28) > This shouldn't be blocking anything. We have had a work-around in place since > Friday. We're in a holding pattern until slaves can be brought back to test > and verify. The remaining ix machines in 650castro have been running in production since Monday, and so far no problems. Looks like the workarounds are doing the trick. From email+irc with mrz, ravi, instead of spending more time on debugging this, lets close this bug and just get these ix machines moved to real homes. The curious can follow along in bug#642305, bug#636743
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
See Also: → 642305
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.