Closed Bug 660514 Opened 13 years ago Closed 13 years ago

cloning hg.m.o/users/prepr-ffxbld/mozilla-2.0 taking over 4 hours, still not finished


( Graveyard :: Server Operations, task)

Not set


(Not tracked)



(Reporter: mozilla, Assigned: nmeyerhans)



I'm trying to run a preproduction release to make sure bug 557260 doesn't break Firefox releases when it lands Tuesday morning.

The first part of this is cloning a bunch of user repos in prepr-ffxbld.
This has been timing out (>3600 seconds) several times since Friday afternoon.

Today I decided to stop using the buildbot automation and do it manually:

[cltbld@moz2-linux-slave51 build]$ ssh -l prepr-ffxbld -oIdentityFile=~cltbld/.ssh/ffxbld_dsa clone mozilla-2.0 releases/mozilla-2.0
Please wait.  Cloning /releases/mozilla-2.0 to /users/prepr-ffxbld/mozilla-2.0

This has been running between 4-5 hours and still hasn't completed.
Determining and fixing the root cause would be ideal.
A short term workaround would be cloning that user repo for me, at which time we can lower the priority on this bug.
Blocks: 557260
The clone finished overnight.

I went afk around 3:30am-ish PDT and I believe it wasn't done by that point.
I'd love to know why the clone took over 9 hours ?
Severity: major → normal
Assignee: server-ops → nmeyerhans
Raising priority, as this (and bug 661828, which is probably a dup) is killing our ability to quickly port+test mobile releases, which we're trying to do by Friday.
Severity: normal → critical
Rail says this first started 3-4 weeks ago, and this doesn't seem to have resolved by itself over that time.  Hoping that regression window is helpful.
We've found a likely culprit: disk contention due to filesystem backups.  It seems that backups weren't being made roughly 4-5 weeks ago due to a hardware failure affecting the backup host. The hardware was repaired roughly 3 to 4 weeks ago.  Backups are apparently scheduled to start at 1 AM Pacific time and have recently been taking >7 hours to complete.

We've cancelled the currently running backup job, which should free up IO capacity to let any current hg operations complete in a reasonable amount of time.  We've got to revisit how we back up these filesystems, though.  I suspect that if we can find a time window when releng isn't making heavy use of hg, we can at least reschedule the backups to run during that time.
duping forward since all the action is on bug 661828.
Closed: 13 years ago
Resolution: --- → DUPLICATE
Product: → Graveyard
You need to log in before you can comment on or make changes to this bug.