Closed Bug 601025 Opened 14 years ago Closed 13 years ago

improve sync time for release builds between stage.m.o and pvt-mirror1/2/n

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: zandr)

References

Details

(Whiteboard: [q1] [tracking])

During the FF3.6.9 release, we noticed that it took a long time to get the release builds visible on pvt-mirror1/2/n. As best as I can tell, the timeline was:

13:10:wed: RelEng does "push to mirrors" (aka on stage.m.o, run rsync from 3.6.9-candidates to releases/3.6.9 directories)
14:36:wed: pvt-mirror1 has release bits; mirroring to external sites starts.


pvt-mirror1 is the machine which all external mirror nodes sync from, so this delay caused delayed in mirror absorption during a release.

From a quick postmortem with mrz, it might be possible to speed this up by more carefully target what exactly to rsync over to pvt-mirror1. Currently everything under "/" is being sync'd over. One proposal was to have a different (additional?) sync running which only brought over files under "firefox/releases". This smaller set of files should sync more quickly, which means having releases visible on pvt-mirror1 faster.
Summary: improve sync time between stage.m.o and pvt-mirror1 → improve sync time for release builds between stage.m.o and pvt-mirror1/2/n
Another idea that came up while I was chatting with justdave about this earlier:
He said that the slowest part of "start of rsync from candidates -> release" to "mirrors getting files" was the rsync that releng runs, because it's copied from a remote NFS partition to another one. We came to the conclusion that this could be sped up if the rsync was run on the ftp.mozilla.org machine rather than stage.mozilla.org, though I don't recall exactly why. This is a bit tricky because (rightly) nobody from RelEng has access to this machine, so it wolud have to be done by IT or triggered through a system that doesn't currently exist.
Assignee: server-ops → justdave
Yeah, what Ben said.  Here's the graphic overview:

https://people.mozilla.com/~justdave/MirrorNetwork.pdf

The files being copied are on the Equalogic SAN box.  When running the rsync from surf, they files go through dm-ftp01's network interfaces 4 times...  once from eql to ftp, once from ftp to stage where the copy is being run, then from stage back to ftp and then back to the eql.  Running the rsync directly on dm-ftp01 would cut the bandwidth in half used by the rsync process, as well as eliminating NFS from the equation (since the eql SAN is iscsi at that point)
Could we rsync ahead of time to pub-test? and from there symlink from pub to pub-test when we get the go from drivers?
I am saying this to have the wanted bits on stage-rsync.mozilla.org ahead of time and only require a symlink change. This would cut all the networks transfers.

Would this be a viable option?
justdave: 

There's a few options here so far - which way do you want to proceed?
Whiteboard: [q4] [tracking]
Bug 614786 will likely help with this.
Assignee: justdave → zandr
Whiteboard: [q4] [tracking] → [q1] [tracking]
Component: Server Operations → Server Operations: RelEng
Depends on: 614786
QA Contact: mrz → zandr
(In reply to comment #5)
> Bug 614786 will likely help with this.

Did it? OK to close?
It took 20 minutes to copy 10G into firefox/releases/5.0b2 on May 20, and 17.5 minutes to copy 11.8G into firefox/releases/4.0rc2 on Mar 18. That's about 12 and 8 MB/s respectively, on a path which is netapp-b -> surf -> netapp-b. While not awesome it's better than when this bug was first filed.

I can't find any data for how long it took to get to pv-mirror01 due to assorted automation failures/edge cases for 4.0rc2, 4.0.1 and 5.0b2, but my experience is that it's not dissimilar.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.