Closed
Bug 636462
Opened 14 years ago
Closed 14 years ago
MTV slaves unable to clone
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dmoore)
References
Details
(Whiteboard: [slaveduty])
(possibly related to bug 636342)
We're seeing very long clone times for cloning build/tools (which is less than 10s on a working linux system):
C:\tmp>hg clone http://hg.mozilla.org/build/tools tools
requesting all changes
adding changesets
(this hasn't finished yet, so I don't have a time for you)
On mw32-ix-slave23, bsmedberg saw:
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
** unknown exception encountered, please report by visiting
** http://mercurial.selenic.com/wiki/BugTracker
** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)]
** Mercurial Distributed SCM (version 1.7.5)
** Extensions loaded: win32text, graphlog, share, purge
Traceback (most recent call last):
File "hg", line 38, in <module>
File "mercurial\dispatch.pyc", line 16, in run
File "mercurial\dispatch.pyc", line 36, in dispatch
File "mercurial\dispatch.pyc", line 58, in _runcatch
File "mercurial\dispatch.pyc", line 593, in _dispatch
File "mercurial\dispatch.pyc", line 401, in runcommand
File "mercurial\dispatch.pyc", line 644, in _runcommand
File "mercurial\dispatch.pyc", line 598, in checkargs
File "mercurial\dispatch.pyc", line 591, in <lambda>
File "mercurial\util.pyc", line 426, in check
File "mercurial\commands.pyc", line 736, in clone
File "mercurial\hg.pyc", line 337, in clone
File "mercurial\localrepo.pyc", line 1886, in clone
File "mercurial\localrepo.pyc", line 1295, in pull
File "mercurial\localrepo.pyc", line 1739, in addchangegroup
File "mercurial\revlog.pyc", line 1381, in addgroup
File "mercurial\revlog.pyc", line 1220, in _addrevision
mpatch.mpatchError: patch cannot be decoded
Reporter | ||
Comment 1•14 years ago
|
||
Is this related to the problems we had a few days ago in bug 635501?
Comment 2•14 years ago
|
||
This seems to affect more than just Windows.
Summary: MTV windows slaves unable to clone → MTV slaves unable to clone
Comment 3•14 years ago
|
||
We hit this exact same hg error yesterday on slaves because of the firewall change in 650castro. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=636342#c3.
Looping Ravi in case its something in his court.
Comment 4•14 years ago
|
||
302 https://bugzilla.mozilla.org/show_bug.cgi?id=636342#c23
I'll also note the config is rolled back to before you were experiencing the problems. And by that it is gone completely.
Did this problem manifest itself overnight or was it broken since ~1720 yesterday or even after dmoore rolled back the changes ~2200?
Reporter | ||
Comment 5•14 years ago
|
||
I don't have a good way to determine when this started.
We currently have most of our mtv slaves disabled. I've been running
cd /tmp && time hg clone http://hg.mozilla.org/build/tools && rm -rf /tmp/tools
on slaves to verify that they can do a checkout quickly, and I cannot get it to hang in scl, but it often hangs in mtv. The hang doesn't seem to be specific to the host machine.
So this is still a problem, and has a significant chunk of our slave architecture down at the moment.
Reporter | ||
Comment 6•14 years ago
|
||
This also seems to work fine in mpt. I'm running the hg command above in a while loop on 5 slaves in each datacenter, and it works predictably in scl and mpt and fails reliably in mtv, even if only run on one slave.
This was not the case an hour or so ago, when I was running this command successfully on (the same) mtv slaves.
Comment 7•14 years ago
|
||
I had a quick look and found failures at the following times with the same symptoms. There may be more, I haven't done a complete look.
Feb 23 16:31
Feb 23 23:41
Feb 24 05:11
Feb 24 05:39
Feb 24 05:52
Feb 24 08:03
Feb 24 08:40
Comment 8•14 years ago
|
||
What constitutes a failure here? Completely unable to do `hg clone http://hg.mozilla.org/build/tools`? Who maintains hg? Can someone look into that?
Comment 9•14 years ago
|
||
A failure is the clone failing to complete before the 20 minute timeout.
IT owns HG.m.o.
Updated•14 years ago
|
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Updated•14 years ago
|
Severity: normal → critical
Updated•14 years ago
|
Severity: critical → blocker
Updated•14 years ago
|
Severity: blocker → critical
Comment 10•14 years ago
|
||
Moving to get IT eyes on hg infra.
Updated•14 years ago
|
Severity: critical → blocker
Comment 11•14 years ago
|
||
(In reply to comment #8)
> What constitutes a failure here? Completely unable to do `hg clone
> http://hg.mozilla.org/build/tools`? Who maintains hg? Can someone look into
> that?
Aravind can you look at hg today?
Assignee: server-ops → aravind
Severity: blocker → critical
Comment 12•14 years ago
|
||
Can the build machines that are failing even telnet to hg.m.o port 80?
Comment 13•14 years ago
|
||
(In reply to comment #12)
> Can the build machines that are failing even telnet to hg.m.o port 80?
Yup, and they even receive *some* data. Sometimes they time out after 20 minutes, sometimes they get a weird network issue, like in comment #0.
Comment 14•14 years ago
|
||
I'm starting to wonder if this is hg-specific -- I'm currently running:
wget http://people.mozilla.com/~bhearsum/tegra-host-utils.zip
on bm-remote-talos-webhost-01 (A linux machine in MTV), and its only getting 10kb/sec
Comment 15•14 years ago
|
||
(In reply to comment #14)
> I'm starting to wonder if this is hg-specific -- I'm currently running:
> wget http://people.mozilla.com/~bhearsum/tegra-host-utils.zip
s/is/isn't/, of course.
> on bm-remote-talos-webhost-01 (A linux machine in MTV), and its only getting
> 10kb/sec
Comment 16•14 years ago
|
||
This seems to be causing issues for the tegras downloading builds/test files from stage.m.o.
Assignee | ||
Updated•14 years ago
|
Assignee: aravind → dmoore
Assignee | ||
Comment 17•14 years ago
|
||
netops is taking this bug, it seems to be an interaction with the ethernet drivers on the servers in MPT. We'll follow up once we've worked with infra to gather more data.
At this time, you should be seeing significantly improved throughput for connections to hg.
Assignee | ||
Comment 18•14 years ago
|
||
Our current fix is disabling the hardware-based TCP segmentation offloading on the *server* side:
ethtool -K <dev> tso off
This is not a permanent fix, as this value has always defaulted to on.
Assignee | ||
Updated•14 years ago
|
Severity: critical → major
Assignee | ||
Comment 19•14 years ago
|
||
We've applied this change to hg (dm-hg02), people, and stage (surf).
Updated•14 years ago
|
Severity: major → blocker
Comment 21•14 years ago
|
||
Ravi applied nat workaround at midnight:45 PST, so we can sleep.
Comment 23•14 years ago
|
||
Stage.m.o wasn't touched at midnight:45, so the tegras are seeing issues.
Comment 24•14 years ago
|
||
Just applied a similar work around to stage as with hg.
Reporter | ||
Comment 25•14 years ago
|
||
After some IRC discussion, it sounds like we should bring the remaining disabled mtv slaves (about 45 of them) back online.
It's bear's call, but probably wisest is to bring up some fraction of those 45 tomorrow (Tuesday) morning, then watch for problems throughout the day, and barring any failures, bring the rest up on Wednesday.
Comment 26•14 years ago
|
||
from discussion with dmoore, ravi, zandr, dustin:
1) IT do not (currently) believe the failures are load-related, and so the workarounds in place should continue to work even as RelEng brings machines in 650castro back into production. RelEng is nervous of bringing these all back into production because
* it takes a long time to bring them into production
* if they fail, they burn builds in production (and cause tree closures)
* it takes a long time to take them all out of production again
To get out of this deadlock, RelEng will bring up some slaves Wed, watch with IT and if all is still ok, then bring the rest back into production Thurs. (We're explicitly not doing anything tomorrow, Tues, because of release embargo in progress).
Reporter | ||
Comment 27•14 years ago
|
||
oops, Tuesday's an embargo day, so we bring up a fraction on Wednesday and another fraction on Thursday.
Reporter | ||
Updated•14 years ago
|
Whiteboard: [slaveduty]
Comment 28•14 years ago
|
||
This shouldn't be blocking anything. We have had a work-around in place since Friday. We're in a holding pattern until slaves can be brought back to test and verify.
Severity: blocker → major
Reporter | ||
Comment 29•14 years ago
|
||
Bear brought 50% of the remainder up in the last hour or so. Now we wait to see what happens.
Reporter | ||
Comment 30•14 years ago
|
||
Shortly after bringing up that 50%, we saw more failures - bug 638309. It may be unrelated, but I would *much* appreciate it if the two of you can take a look at what we're seeing there. At this point, we're planning some significant changes to our architecture to work around the general area where things are failing (the mountain view firewalls), but doing so without full knowledge of the problem(s), so our plans are probably not optimal. We need help desperately.
Comment 31•14 years ago
|
||
For clarification, who is "you two"?
Reporter | ||
Comment 32•14 years ago
|
||
sorry, ravi and dmoore who were so helpful in chasing this bug down.
Comment 34•14 years ago
|
||
during triage; bug#639630 is tracking bringing remaining ix machines in 650castro online in *staging* to recreate load issues and see if that helps IT debug.
See Also: → 639630
Comment 35•14 years ago
|
||
(In reply to comment #28)
> This shouldn't be blocking anything. We have had a work-around in place since
> Friday. We're in a holding pattern until slaves can be brought back to test
> and verify.
The remaining ix machines in 650castro have been running in production since Monday, and so far no problems. Looks like the workarounds are doing the trick.
From email+irc with mrz, ravi, instead of spending more time on debugging this, lets close this bug and just get these ix machines moved to real homes. The curious can follow along in bug#642305, bug#636743
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•