Closed Bug 661828 Opened 14 years ago Closed 13 years ago

HG clone via "ssh hg.m.o ..." times out after 1h

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
Linux
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: nmeyerhans)

References

Details

Releng preproduction system can't clone "releases/mozilla-beta" within 1h. The command is: ssh -l prepr-ffxbld '-oIdentityFile=~cltbld/.ssh/ffxbld_dsa' hg.mozilla.org clone mozilla-beta releases/mozilla-beta command timed out: 3600 seconds without output, attempting to kill Should we bump the timeout or something wrong with hg.m.o?
cc: noahm for comment
There's nothing unusual going on with hg. In what datacenter is this machine located? Transferring tens of thousands of files adding up to more than 700 MB over the internet, which is basically what you'd doing when you clone from hg, is going to see performance variations. I suspect that's what you're running into here. There are plans under way to add hg mirrors near the releng hosts. That should help with problems like this. Until then, it's probably reasonable to raise your timeout.
Hmmm, I thought that cloning using "ssh hg.m.o clone to from" (not "hg clone url") doesn't transfer any files across the Internet. Do I misunderstand the way how it works?
Nope, you're quite right. Apologies for that; I mis-read your command. I'm not entirely clear on which repositories you're trying to clone to where. The canonical releases/mozilla-beta repo lives in /repo/hg/mozilla/releases/mozilla-beta on hg.m.o. From your command, it looks like you're trying to clone with that as the destination. If so, what is the source? I'm trying to understand what IO paths are involved.
same as bug 660514 ?
A workaround could be to increase the timeout to 60 * 60 * 10 (or 12, maybe, to be safe). However, taking ~10 hours to clone a repo on hg.m.o to another repo on hg.m.o seems like an indication that something is very wrong.
(In reply to comment #5) > same as bug 660514 ? Yeah. (In reply to comment #4) > I'm not entirely clear on which repositories you're trying to clone to > where. The canonical releases/mozilla-beta repo lives in > /repo/hg/mozilla/releases/mozilla-beta on hg.m.o. From your command, it > looks like you're trying to clone with that as the destination. If so, what > is the source? I'm trying to understand what IO paths are involved. "ssh hg.m.o clone" uses reverse logic, destination first then source :) So source is "releases/mozilla-beta" (/repo/hg/mozilla/releases/mozilla-beta as you said) and the destination is "mozilla-beta" in prepr-ffxbld user repository (no idea about its absolute location). BTW, in the past this type of clone used to take less than 30 mins. This is a new problem.
Anything new here. It really hurts (read prevents) preproduction release automation and mobile release automation development. :/
12h20m and it's still trying to clone :/ Bumping the importance to critical.
Severity: normal → critical
Severity: critical → normal
Thanks for the update. I've just fired preproduction release process and will update the bug with the results.
Severity: normal → critical
Can't update with the results because of bug 662760 :/
(In reply to comment #13) > Releng preproduction system can't clone "releases/mozilla-beta" within 1h. > > The command is: > > ssh -l prepr-ffxbld '-oIdentityFile=~cltbld/.ssh/ffxbld_dsa' hg.mozilla.org > clone mozilla-beta releases/mozilla-beta > > command timed out: 3600 seconds without output, attempting to kill > > Should we bump the timeout or something wrong with hg.m.o? (In reply to comment #7) > (In reply to comment #5) > > same as bug 660514 ? > > Yeah. > > (In reply to comment #4) > > I'm not entirely clear on which repositories you're trying to clone to > > where. The canonical releases/mozilla-beta repo lives in > > /repo/hg/mozilla/releases/mozilla-beta on hg.m.o. From your command, it > > looks like you're trying to clone with that as the destination. If so, what > > is the source? I'm trying to understand what IO paths are involved. > > "ssh hg.m.o clone" uses reverse logic, destination first then source :) > > So source is "releases/mozilla-beta" (/repo/hg/mozilla/releases/mozilla-beta > as you said) and the destination is "mozilla-beta" in prepr-ffxbld user > repository (no idea about its absolute location). > > BTW, in the past this type of clone used to take less than 30 mins. This is > a new problem. Rail, any idea when this type of clone was <30mins? That might help narrow down the regression window.
https://bugzilla.mozilla.org/show_bug.cgi?id=660514#c4 "Rail says this first started 3-4 weeks ago, and this doesn't seem to have resolved by itself over that time. Hoping that regression window is helpful."
(In reply to comment #13) > Rail, any idea when this type of clone was <30mins? That might help narrow > down the regression window. It used to take < 30mins 3-4 weeks ago. BTW, I have just cloned build/tools to users/prepr-ffxbld/tools repo and it took ~1.5h. Probably some of the destination directories are in some weird state?
(In reply to comment #15) > BTW, I have just cloned build/tools to users/prepr-ffxbld/tools repo and it > took ~1.5h. Probably some of the destination directories are in some weird > state? Yeah, that's really weird. It should take just a few seconds to clone tools: khan:~$ time ssh -l nmeyerhans@mozilla.com hg.mozilla.org clone tools build/tools Please wait. Cloning /build/tools to /users/nmeyerhans_mozilla.com/tools Clone complete. Fixing permissions, don't interrupt. ssh -l nmeyerhans@mozilla.com hg.mozilla.org clone tools build/tools 0.02s user 0.03s system 0% cpu 1:10.64 total I'm not sure what's wrong at the moment...
deathduck:~/src/bear/buildbotcustom [11:39:43] 527$ ssh hg.mozilla.org clone mozilla-beta releases/mozilla-beta Please wait. Cloning /releases/mozilla-beta to /users/asasaki_mozilla.com/mozilla-beta I didn't have an existing mozilla-beta repo, and this is still running at 15:43 (>4 hours later). I wonder if coming from the build network is causing any issues here.
deathduck:~/src/bear/buildbotcustom [11:39:43] 527$ ssh hg.mozilla.org clone mozilla-beta releases/mozilla-beta Please wait. Cloning /releases/mozilla-beta to /users/asasaki_mozilla.com/mozilla-beta Read from remote host hg.mozilla.org: Operation timed out
Oops, missed the next prompt for timestamp: Read from remote host hg.mozilla.org: Operation timed out deathduck:~/src/bear/buildbotcustom [19:54:00] 528$ 8:15:43 running time before timeout.
~10h and still cloning mozilla-beta :/
Wow. :( Backups are currently disabled for the filesystem, so this problem isn't caused by contention with them. NFS read and write ops/s are not particularly high (typically <100), but getattr call volume is a fair bit higher. They average around 1000/s over the past 24 hours, but the volume is pretty constant. We also don't see any change in volume in the 3-4 week timeframe, when this problem is reported to have begun. I've got a clone running on the server right now, which I started at 10:10 this morning. Now, 40 minutes later, that process has spent 30 minutes on the CPU. From top: 22999 1547 23 0 118m 110m 3252 R 99.4 2.8 30:45.57 hg This suggests to me that hg is very much CPU bound, and that maybe it's doing something inefficiently. It might be a good idea to get somebody familiar with hg internals to provide some insight.
I forget: were we routing the build network to the external hg.m.o interface to speed things up, or was that something else? If it was hg.m.o, did that route get changed ?
(In reply to comment #22) > I forget: were we routing the build network to the external hg.m.o interface > to speed things up, or was that something else? > > If it was hg.m.o, did that route get changed ? We're in the process of deploying hg mirrors close to the build network in scl1, which will speed up some operations considerably. There will be a fast network path to the mirror, and the mirror will be dedicated to the build network and not shared with the global internet. However, this performance issue appears to be entirely within the hg infrastructure. Very little network traffic is flowing between your client and hg.
cc'ing djc, a Mercurial developer
What version of hg are we talking about? Can we run it with --debug to see where it's getting stuck? Is this cloning across a network file system boundary?
(In reply to comment #20) > ~10h and still cloning mozilla-beta :/ Finally got it cloned after ~24h.
I just conducted a couple of experiments on the server, one where I cloned mozilla-beta into a directory on the local disk, and one where I cloned into NFS, similar to the operation you're trying to perform. The slowest of the two clones only took 7m20s. System activity on the hg server and on the backend NetApp filer appears to be at roughly average levels. This suggests that IO can probably be ruled out as the primary source of this problem. I'm not sure why previous clones, when run via ssh, spent so much time on the CPU...
(In reply to comment #26) > What version of hg are we talking about? Can we run it with --debug to see > where it's getting stuck? Is this cloning across a network file system > boundary? hg.mozilla.org runs version 1.5.4 Repositories typically all live in an NFS filesystem. Comment 28 shows the results of some experimenting I've done on that front. The obvious differences between the experiments I did there and the clones that are experiencing the problems are: 1. My clones were running as root, not an LDAP user. 2. Normal users use -U, I did not. I'll look into getting hg running with --debug. We use a wrapper script for most access. It'd be easy to change globally, but I'm not sure I want to do that. I'd rather customize the hg call for a single user...
Really? ISTR we were up to 1.7.x something. Perhaps we should look into upgrading soon; but holding of until after 1.9 might be nice. --debug info would probably be helpful to have for this sort of thing.
bug 551015 was the last update. I think there's a bug open on updating to a newer version, I can't find it at the moment. We have a hg staging server, so it's definitely something we could do.
This is hurting my work a lot. I am trying to get prepared to do mobile staging releases for when aki and lsblakk are gone for holidays (they are the only ones who have done them). I would rather be ready back to back for when the real releases happens. Did we determine what changed 3-4 weeks ago? What can we do to get this back to normal? Pasting from the other bug as it has value: (In reply to comment #5) > We've found a likely culprit: disk contention due to filesystem backups. It > seems that backups weren't being made roughly 4-5 weeks ago due to a > hardware failure affecting the backup host. The hardware was repaired > roughly 3 to 4 weeks ago. Backups are apparently scheduled to start at 1 AM > Pacific time and have recently been taking >7 hours to complete. > > We've cancelled the currently running backup job, which should free up IO > capacity to let any current hg operations complete in a reasonable amount of > time. We've got to revisit how we back up these filesystems, though. I > suspect that if we can find a time window when releng isn't making heavy use > of hg, we can at least reschedule the backups to run during that time.
Blocks: 664994
See Also: → 662760
No longer blocks: 664994
See Also: → 664994
Since there was some expressed hope that today's pulse fixes would have fixed this problem as well, I can report that the following push: dustin@Lorentz ~ $ time ssh hg.mozilla.org clone mozilla-mobile-5.0 releases/mozilla-mobile-5.0 Please wait. Cloning /releases/mozilla-mobile-5.0 to /users/dmitchell_mozilla.com/mozilla-mobile-5.0 has taken 4.25 hours so far, and hasn't completed yet.
Assignee: server-ops-releng → nmeyerhans
Things are looking much better now: --- dustin@Lorentz ~ $ date && time ssh hg.mozilla.org clone mc-tmp2 mozilla-central Tue Jun 21 12:10:59 CDT 2011 Please wait. Cloning /mozilla-central to /users/dmitchell_mozilla.com/mc-tmp2 Breaking after 8 lines Clone complete. Fixing permissions, don't interrupt. Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines ^KBreaking after 0 lines Breaking after 0 lines Breaking after 0 lines real 33m41.285s user 0m0.011s sys 0m0.006s --and-- dustin@Lorentz ~ $ date && time ssh hg.mozilla.org clone rmm50-tmp1 releases/mozilla-mobile-5.0 Tue Jun 21 12:49:11 CDT 2011 Please wait. Cloning /releases/mozilla-mobile-5.0 to /users/dmitchell_mozilla.com/rmm50-tmp1 Breaking after 8 lines Clone complete. Fixing permissions, don't interrupt. Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines real 29m2.106s user 0m0.011s sys 0m0.008s
This is looking much better. Cloning under 20 mins. Do we know what has changed? Can we assume it was unrelated to the pulse problems in the past few days? What can we do to prevent this from degrading again? I see some debug code which I assume is from the hooks; where is that code living so we can understand this process? Once we run the cloning command, on which host does this happen? Do we have monitoring of the resources? I would like to understand the process so we can determine if there is anything we can do to have things under humans' control rather than under the machines' control. BTW what do the "breaking lines" mean? It's new to me. [cltbld@staging-master mozilla-mobile-5.0]$ time ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org edit mozilla-mobile-beta delete YES; Breaking after 0 lines real 4m7.526s user 0m0.012s sys 0m0.017s [cltbld@staging-master mozilla-mobile-5.0]$ time ssh -l stage-ffxbld -i ~cltbld/.ssh/ffxbld_dsa hg.mozilla.org clone mozilla-mobile-beta releases/mozilla-mobile-5.0 Please wait. Cloning /releases/mozilla-mobile-5.0 to /users/stage-ffxbld/mozilla-mobile-beta Breaking after 8 lines Clone complete. Fixing permissions, don't interrupt. Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines Breaking after 0 lines real 19m46.047s user 0m0.012s sys 0m0.025s
(In reply to comment #35) > This is looking much better. Cloning under 20 mins. > > Do we know what has changed? Can we assume it was unrelated to the pulse > problems in the past few days? It's not related to pulse. The pulse hook was only activated on 5/26, and hg was performing poorly long before then. Pulse was de-activated yesterday. Nothing else has changed, though. So we still have no better idea about why performance is better now than it has been recently, and no way of saying when it will degrade again. I sort of wonder if today's firefox release is leading people to do less work that touches hg, and that's why it's faster. None of the metrics we're currently graphing indicate a significantly reduced activity level, but it's certainly possible that we're simply not graphing the right data. > BTW what do the "breaking lines" mean? It's new to me. Oops, that's only supposed to be shown to users for whom "verbose" mode is enabled. Server-side hg operations are performed within a wrapper script, which I've recently modified to allow us to collect some additional output from hg for performance diagnostics. The messages you're seeing are related to those modifications
This is always an awkward position, when things magically fix themselves. We've discussed a few actions now: - set up a nightly clone operation and time it, for trending and maybe even alerting. It sounds like this could run nightly on the hg box itself, rather than requiring ssh keys, since the blocking operation is entirely local - testing the hg hook with a user repo to see if it affects the timing But unless (until?) this recurs, the heat is off, at least!
As Dustin mentions the "heat is off" Thanks Noah for your answers. Do you know the answer of any of the following questions: (In reply to comment #35) > What can we do to prevent this from degrading again? > I see some debug code which I assume is from the hooks; where is that code > living so we can understand this process? > Once we run the cloning command, on which host does this happen? Do we have > monitoring of the resources? ############ With regard pushes/load here is a short analysis. From checking pushes we can see the following: * from 5/22 to 6/22 - 2632 pushes [1] * from 4/22 to 5/22 - 2489 pushes [2] I can see an upward trend starting on May 9th. After that day we have many days with more than a 100 pushes a day. The day I complained (6/17) there were only 28 pushes (which is also when the tree was closed for bug 665077). The day before that there were only 98 pushes. [1] https://build.mozilla.org/buildapi/reports/pushes?starttime=1306036800&endtime=1308715200&int_size=86400 [2] https://build.mozilla.org/buildapi/reports/pushes?starttime=1303444800&endtime=1306036800&int_size=86400
Severity: critical → normal
(In reply to comment #38) > Thanks Noah for your answers. Do you know the answer of any of the following > questions: > > (In reply to comment #35) > > What can we do to prevent this from degrading again? Figure out why it got bad in the first place and prevent that from happening again. > > I see some debug code which I assume is from the hooks; where is that code > > living so we can understand this process? The debug output you saw was from the wrapper scripts that take care of executing hg with the proper arguments and environment. This code lives in a repository that is restricted to IT. A lot of it is really brittle code with hardcoded paths, IP addresses, and hostnames. When cloning a repository, the command used is simply "hg clone -U <src> <dest>". The wrapper scripts don't interfere with the normal execution of hg. The hooks themselves, with the current exception of pulse, live at https://hg.mozilla.org/hgcustom Pulse lives on github somewhere, although I've asked legneato to move it to hg.m.o with the rest of them. > > Once we run the cloning command, on which host does this happen? Do we have > > monitoring of the resources? Clones are performed on dm-svn02. There's a schematic at http://people.mozilla.com/~cshields/hg_svn_flow.svg We graph numerous performance metrics for the various resources in this cluster at https://ganglia.mozilla.org/ganglia/?m=load_five&r=month&s=descending&c=Hg&h=&sh=1&hc=3&z=medium and http://nm-dash01.nms.mozilla.org/cacti/graph_view.php?action=preview&host_id=39&graph_template_id=0&filter= (access to these graphs may be restricted to IT) Note that the entire hg infra is getting a lot of attention right now. There are several potential bottlenecks and single-points-of-failure in it. I'm hoping to change that...
Severity: normal → critical
This is a lot of useful information specially the schematic and the link to ganglia (I can access it). We know that pulse was active from 5/26 to 6/20 and the CPU graph shows that since we disabled it the CPU/load spikes are less accentuated. Nevertheless, the CPU usage and load doesn't look dramatic (IIUC) even though I can see some "WAIT CPU" (orange in the cpu graph). I see that tcp_established has grown since April. Is it worrysome to have a lot of tcp_closewait? IIUC that means that the socket is waiting to be closed by the app. I see also an NFS metrics section that shows up after April and does not exist before April. Does this mean that the bottle neck is on the R/O mounts on NetApp? Network usage seems to have been growing since April. Do "hg pushes" (dev -> remote) and "ssh hg clone" (remote->remote) go both through dm-svn02? Forgive me if I am perhaps talking non-sense and just misunderstanding everything or just causing unnecessary noise. I am trying to understand all the concepts for when you guys do your changes, perhaps, have a chance to better understand where we are now and where we will be in the future. Is there any actionable items left on this bug? Or should we just dupe it against the mercurial structural changes you mention?
The reason everything seems to jump as of mid-april is that everything was zero before then. We only started graphing data for these hosts in mid April. :( All hg write operations are performed on dm-svn02. dm-svn01 is capable of hosting the same operations, but it functions as a hot-spare. This entire system was built before I started at Mozilla in March, and the person who previously ran it has left, but I believe that having only one host handle write operations is done to avoid having to keep ssh host keys in sync between multiple hosts.
still looking ok here? (still unknown what caused all this - was it the pulse hook?)
I'll run another check for speed, but AFAIK this is doing well now. Cause is still undetermined, but the timing doesn't work out to blame it on the pulse hook.
(In reply to comment #42) > still looking ok here? > > (still unknown what caused all this - was it the pulse hook?) Noah discarded pulse as the cause on a previous comment and also timing of when it got enabled confirms it.
My additional check shows clone times are still fine. Noah, are you monitoring clone times as in comment 37?
(In reply to comment #45) > Noah, are you monitoring clone times as in comment 37? We aren't currently. When I inherited hg.m.o, we weren't monitoring anything, so fixing that is an ongoing project...
Severity: critical → major
We haven't seen this same behavior in two months now as far as I know, and Noah is no longer with Mozilla. Unless there's more information to be gleaned here, I'd like to close this bug.
wfm
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.