Closed Bug 659436 Opened 13 years ago Closed 13 years ago

migrate build vms to tintri storage

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: phong, Assigned: phong)

Details

Attachments

(2 files, 3 obsolete files)

Can you get me a list of win32 and linux vms to test out the performance of the new ESX storage (TinTri)?

I just need a list of about 5 to 10 VMs of each flavour.  Once we have them migrated over, we can test how long it takes to build.
Here's our current set of preproduction systems.  Let's migrate this batch and make sure things don't fall apart.  Then we can do some production systems?

moz2-linux64-slave07
win32-slave04
win32-slave60
moz2-linux-slave51
moz2-linux-slave03
moz2-linux64-slave10
win32-slave10
win32-slave21
Assignee: server-ops → phong
There are all migrated to the TinTri storage.  Can you run some test builds and see if they are any faster?
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
I don't think we have "before" measurements against which to compare the "after" results, sadly.

From comment #0 it sounds like you want more hosts, so I'm re-opening this.  Also, we have more data about production hosts, so let's get some "before" numbers for builds on production VMs, and then move those over to TinTri and see if we see any difference.

Relengers: thoughts on which VMs to choose to move, or anything special I should do to measure build times?  I'm thinking: take the mean build time for all of the jobs run on VMs over the last, say, week (grouped by silo), drawn from statusdb.  Then we can make a similar measurement of TinTri VMs in another week.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #3)
> I'm thinking: take the mean build time
> for all of the jobs run on VMs over the last, say, week (grouped by silo),
> drawn from statusdb.  Then we can make a similar measurement of TinTri VMs
> in another week.

Sounds reasonable.

Anyone else?
jlebar is also going to get some build times on moz2-linux-slave17 before and after the migration.
Clobber PGO build, default mozconfig, all ccache hits: 205m (before migration).
Do you have some numbers on the win32 vms as well?
The only reason I'm getting Linux numbers is because I have access to a Linux VM for the purposes of bug 653961.  I don't have access to a Windows VM.

Can't you get this information from the build logs?
(Actually, that's very few ccache hits; ccache apparently disables itself when it sees that you're doing anything with PGO, even during the first pass.)
I'll see if I can get some win32 times before we do the migration.

Justin: Thanks!  Should I migrate your VM to the new storage now?
Assignee: phong → dustin
Status: REOPENED → ASSIGNED
This db is really slow, so I did the analysis with some temporary tables taking the place of materialized views:

create temporary table slaves_of_interest as select id from slaves where name like '%win32%';
 (116 slaves)

create temporary table builders_of_interest as  select id from builders where (name like '%win32' or name like '%win32-debug') and (name like 'mozilla-central%' or name like 'try-%' or name like 'mozilla-aurora%');
 (28 builders)

create temporary table builds_of_interest as select builds.id from builds join builders_of_interest on builders_of_interest.id = builder_id join slaves_of_interest on slaves_of_interest.id = slave_id and starttime between '2011-05-10' and '2011-05-17';
 (118 builds)

create temporary table steps_of_interest as select steps.id from steps join builds_of_interest on(build_id=builds_of_interest.id) where name='compile' and description not like '%failed%' and endtime is not NULL;
 (98 steps)

create temporary table results as select steps.id as step_id, builder_id, slave_id, timediff(steps.endtime,steps.starttime) as dur from steps join steps_of_interest using(id) join builds on builds.id = build_id;

I'll analyze the results with R and post the findings here.  When we repeat this, we'll just need to select a time period after the storage migration has taken place (and get a much smaller N)
Attached file results.csv
The data -- dur is in seconds.

I tried a 2-way ANOVA here, and it basically told me to take a hike - the huge time variance here is due neither to the slaves nor the builders.  So basically, we don't have any good numbers that can show an improvement.

Once we've migrated jlebar's system, re-running the build should give us a ballpark % improvement, but of course that will be with N=1.
Back to phong to migrate the rest of the VMs.  Ping in #build as you're doing so, in case the migration causes any problem (although the smoothness of the staging migration has me quite optimistic on that front).

Also, before migrating moz2-linux-slave17, please check with jlebar in #build.
Assignee: dustin → phong
Do you have a list of production VMs for me to move?
Why not do all of 'em?
Not sure if we have enough space for all of them and secondly, I want to test and make sure that there is a performance gain before we migrate everything.  This is still just a demo unit.
Oh, hm

Well, we don't have a great way to do performance measurements, since times are all over the place.  Let's migrate moz2-linux-slave17 (still checking with jlebar) and re-run his build to see if that shows a difference.
I'd like to finish this build I'm currently running, since there's a chance that changing the disk subsystem will affect the profiles generated by PGO.  I can ping back in this bug when it finishes.  ETA is 2hr.
Build is done.  Fire at will.

(Also, this build took 3h:39m, fwiw.)
OK, let's see what happens when moz2-linux-slave17 moves over.
migration in progress.
This VM now lives on the TinTri storage.  I also took the liberty of upgrading VMWare tools.  Let's kick off some builds and see how it goes.
Build finished last night.  I think it's comparable to the build from comment 19.

Comment 19 time: 3:39
TinTri time:     3:20

So...a little faster, maybe.

That's not such a surprising result, considering that we're building with just one CPU but -j4.  You'd hope the process is CPU-bound.
zandr, you've been looking at performance quite a bit (albiet on macs).  Do you have any additional suggestions for comparisons we should do?
I think catlee was looking at some of the timings here?
I'm responsible for customer engineering at Tintri. I thought I'd post some information here that we were able to glean from the June 7 autosupport report from the system storing some of your build slaves. I haven't attached any of the referenced documents -- feel free to email me if anyone on this bug would like a copy.

While we're happy that that a single (mostly compute bound) build completed ~9% faster, the real value, of course, is completing more builds (in parallel) in a given amount of time. To that end we wanted to estimate how many similar builds we could sustain in parallel on a single Tintri VMstore appliance (assuming sufficient host resources, of course).

Here's the summary of our analysis:

1. The load against the storage remains fairly constant all day, with no big peaks or valleys. See hourlyperf0606.png. The "TRU" column is for Tintri Resource Units, it's a coarse gauge (from 0 to 10,000) of how busy the array is each hour. As you can see, the system is never more than about 2% to 3% loaded -- there's plenty of headroom.

2. Unsurprisingly, of the IO seen by the storage, most of it is coming from the build slaves (see vmstat0606.png). Note that no single VM consumes more than 1-2% of the TRU (the individual peaks don't always line up, of course, so the hourly totals and the per-VM max's can't be directly compared).

3. The most data generated yesterday was from by moz2-linux64-slave07, moz2-win32-slave60, and ganglia1.dmz.

4. The build in comment 23 was run on moz2-linux-slave17 on 6/1. On that day, its peak rates and total amount of data written was considerably less than any of the three VMs above (see vmstat0602.png).

5. Currently there is absolutely no pressure on flash -- everything fits comfortably in flash and 100% of all IO is satisfied from flash (see statsummary0607.txt). Because of this all IO is consistently seeing sub-millisecond latencies across the board, read and write.

6. Drilling down further and looking at one ten minute interval where a fair amount of writing was going on (detailed-stats0607.txt) we see that while overall latencies are excellent, the 1.22 ms of write latency is mostly consumed in the "recv" state -- this means we are spending about half of the time waiting for the esx host to send us more data. It could be that the host cpu was simply busy or there was a bit of network congestion at this time, but it's definitely possible to lower the write latency significantly (it should be possible to cut it in half).

7. The format of the histograms in the detailed stats are complicated to explain, but they clearly show that the majority of the IO is unaligned to a 4K boundary. You should be able to get another 10-20% performance boost by modifying your VM templates to only use 4K-aligned vdisk partitions (stay tuned -- we may make this trivial to accomplish in a future release). The may be other host-side optimizations that could help as well (as alluded in comment 23, you may benefit from more than one virtual cpu).

Summary and conclusions:
-----------------------

Even taking into account that only 3 of the 8 build slaves appeared to do any work yesterday, the datapoints above allow us to conclude that a single Tintri VMstore appliance will run out of capacity before it runs out of performance in your environment. Your 8 build slaves are currently consuming ~600 GiB of capacity (of the ~8192 GiB available). 

We are confident that we could handle approximately 10X as many build slaves (80 slaves total) all running against the single appliance, while still providing plenty of headroom both in terms of capacity and performance. Even with 80 build slaves you would have approximately 2 TiB free and would only consume < 50% of the available performance resources.

Please let me know if this analysis was useful.
Here's some data from our existing build farm:

'hg_update' step for win32 nightly builds since jan 1, 2011
ix machines: min 504s, avg 1024s, stddev 409s (n=201)
VMs:         min 814s, avg 1771s, stddev 502s (n=10)
VMs on new storage: min 1150s, avg 1374s (n=3)

hg_update is pretty I/O intensive, but also depends on the network. Let's see how the 'compile' step fares:

ix machines: min 7778s, avg 9190s, stddev 1065s (n=192)
VMs:         min 19412s, avg 22596s, stddev 2724s (n=8)

waiting for compiles to finish to see how they do.
compile numbers from the VMs on new storage:
min 21926s, avg 23456s, stddev 1331s (n=3)

so the compiles here are a little bit slower on windows than our other VMs, and still quite a bit slower than our dedicated hardware boxes.
From catlee, the script should repeat:

16:04 < catlee> hg clone mozilla-central
16:05 < catlee> put our mozconfig in .mozconfig
16:05 < catlee> make -f client.mk profilebuild
16:05 < catlee> make -f client.mk clean
 (and presumably deleting mozilla-central)

I'd like to also log timestamps for each of these, so we can correlate with other logging to see how the load changes for the different steps.

Phong, let me know which systems I should run this on.
Attached file test.sh (obsolete) —
I'm running this script on win32-slave04.  It's still cloning, so no word on whether it works - I'll upload a new version if I need to tweak it.

It will output a CSV containing UNIX timestamps and symbols like start-clone or end-profilebuild, so we can both determine how long these operations take, and correlate them with other data from the VM system.

I can fetch that data on request (or anyone in #build, really).  I'm also happy to run this on other VMs, if you'd like, once I'm sure it works.
Attachment #540613 - Attachment mime type: application/x-sh → text/plain
Attached file test.sh (obsolete) —
Here's a better version.

I got "no rule to make target profilebuild", so I replaced profilebuild with just build.

This is running now.
Attachment #540613 - Attachment is obsolete: true
Ergh, and failed with "Cannot open include file: 'd3dx9.h'".  I forgot what the fix for that is - does it mean this VM needs to be re-OPSI'd?
bhearsum helped me out - indeed, OPSI had failed to install directx on this system.  Shocking!
OPSI continues to fail - I'll install this by hand tomorrow.
Installed by hand.  It still didn't work.
Attached file test.sh (obsolete) —
This is building now.  If I'm not online when it finishes, I'll pick it up tomorrow and get it running again.  I'd like to clone this VM to the new VMs that phong set up yesterday.
Attachment #540629 - Attachment is obsolete: true
And it failed again, with a different error:

LINK: fatal error LNK1266: error reading instrumentation file `crashinject.pgd`: File not found

I'll start it over ("clobber") and see what happens.
This is still building, but has been building for longer, so I'm hopeful.
OK, this script is working fine now, and win32-slave04 has been building steadily since my last comment.
I've started running the same script on win32-slave02 and win32-slave03.  I'll check again tomorrow to see if they're still going.
OK, win32-slave{02,03,04} are all running a build loop.  I think my work is done here for the moment.

Build times (with UNIX timestamps) are in e:\buildtimes.log on all three machines, if you want to correlate particular behavior with particular times.
So the last times I see in the log files are from last Thursday (the 23rd).  Here's the data:

win32-slave02:

1308803482,start-loop
1308803482,start-clone
1308804368,end-clone
1308804368,start-clone-cfg
1308804392,end-clone-cfg
1308804392,start-profiledbuild
1308820350,end-profiledbuild
1308820350,start-clean
1308820531,end-clean
1308820531,end-loop
1308820531,start-loop
1308820531,start-rm
1308820798,start-rm
1308820798,start-clone
1308821572,end-clone
1308821572,start-clone-cfg
1308821597,end-clone-cfg
1308821597,start-profiledbuild
1308837496,end-profiledbuild
1308837496,start-clean
1308837681,end-clean
1308837681,end-loop
1308837681,start-loop
1308837681,start-rm
1308868293,start-loop
1308868293,start-rm
1308868304,start-rm
1308868304,start-clone
1308869230,end-clone
1308869230,start-clone-cfg
1308869259,end-clone-cfg
1308869259,start-profiledbuild

win32-slave03:

1308803579,start-loop
1308803580,start-clone
1308804395,end-clone
1308804396,start-clone-cfg
1308804423,end-clone-cfg
1308804423,start-profiledbuild
1308818883,end-profiledbuild
1308818883,start-clean
1308819080,end-clean
1308819080,end-loop
1308819080,start-loop
1308819080,start-rm
1308819384,start-rm
1308819384,start-clone
1308820092,end-clone
1308820092,start-clone-cfg
1308820118,end-clone-cfg
1308820118,start-profiledbuild
1308835228,end-profiledbuild
1308835228,start-clean
1308835395,end-clean
1308835395,end-loop
1308835395,start-loop
1308835395,start-rm
1308866459,start-loop
1308866459,start-rm
1308866466,start-rm
1308866466,start-clone
1308867369,end-clone
1308867369,start-clone-cfg
1308867398,end-clone-cfg
1308867398,start-profiledbuild

win32-slave04:

1308686426,start-loop
1308686426,start-rm
1308686736,start-rm
1308686736,start-clone
1308687932,end-clone
1308687932,start-clone-cfg
1308687974,end-clone-cfg
1308687974,start-profiledbuild
1308708862,end-profiledbuild
1308708862,start-clean
1308709108,end-clean
1308709108,end-loop
1308709108,start-loop
1308709108,start-rm
1308709397,start-rm
1308709398,start-clone
1308710573,end-clone
1308710573,start-clone-cfg
1308710613,end-clone-cfg
1308710614,start-profiledbuild
1308729740,end-profiledbuild
1308729740,start-clean
1308729968,end-clean
1308729968,end-loop
1308729969,start-rm
1308730228,start-rm
1308730228,start-clone
1308731742,end-clone
1308731742,start-clone-cfg
1308731781,end-clone-cfg
1308731782,start-profiledbuild
1308751282,end-profiledbuild
1308751282,start-clean
1308751518,end-clean
1308751518,end-loop
1308751518,start-loop
1308751518,start-rm
1308751795,start-rm
1308751795,start-clone
1308752974,end-clone
1308752974,start-clone-cfg
1308753017,end-clone-cfg
1308753017,start-profiledbuild
1308772149,end-profiledbuild
1308772149,start-clean
1308772400,end-clean
1308772401,end-loop
1308772401,start-loop
1308772401,start-rm
1308772697,start-rm
1308772697,start-clone
1308774159,end-clone
1308774159,start-clone-cfg
1308774199,end-clone-cfg
1308774199,start-profiledbuild

(those are UNIX timestamps; the last - 1308774199 - is Wed Jun 22 13:23:19 2011)

What's weird is that all three of these builds are still running - make.exe is using lots of CPU, etc.  I'm not sure what to make of that - it suggests either a bug in test.sh (but why would it *stop* logging?) or a *really* long compile process -- which is certainly not impossible, since for most builds we time out and kill the build, then re-run it from scratch on real hardware.
I rebooted all three of these VMs, and I'm running test.sh again on them.

I also started this up on w32-ix-slave01, which has been taken out of the dev/pp pool temporarily.

Nick is looking into why these might have hung like this.
I don't have any great ideas, other than keeping the log from profiledbuild to try to figure out if it's doing anything or just spinning its wheels. If you're lucky there's some verbose logging mode for the OS storage driver, or vmware tools.
The logs are onscreen, and from day to day when I checked them they had not changed at all.
w32-ix-slave01:
 building strong - completed three builds overnight
 active logging
 CPU usage 25% (of four CPUs)
 Physical memory: 3.4GB (of 4GB) available
 Paging File: 400 MB used, graph is fairly flat

win32-slave02:
 no runs completed overnight
 has been running 'make' for ~10.5h
 nothing active in onscreen logging
 CPU usage 50% (one pegged, one idle)
 Physical memory: 1.6GB (of 2GB) available
 Paging File: 340 MB used, graph is flat

win32-slave03:
 no runs completed overnight
 has been running 'make' for ~10.5h
 nothing active in onscreen logging
 CPU usage 50% (one pegged, one idle)
 Physical memory: 3.4GB (of 4GB) available
 Paging File: 291 MB used, graph is flat

win32-slave04:
 has "Windows - Virtual Memory Minimum Too Low" popup
 one run completed overnight (21032s / 5.84h)
 has been running 'make' for ~4.7h
 nothing active in onscreen logging
 CPU usage 99% (pegged)
 Physical memory: 100MB (of 2GB) available
 Paging File: 2.18GB used, graph is flat

In consultation with some of the devs, here's what I've determined: we use 'make -j4' to do parallel builds on VMs, but not on physical hardware.  This decision was made because make.exe has a concurrency bug and hangs when doing parallel builds (<rant> Really?  Does NOTHING work on Windows? </rant>).  We left it on on VMs because -j4 was faster than -j1, although I'm not sure that a broken faster build was really the best choice.  Rex, this is a little embarassing on our part, but I'll re-jigger the test.sh script to run with -j1 even on VMs, and see what happens.
Attached file test.sh
New version of test.sh, this time with an explicit -j1.  Running now on all four hosts.
Attachment #540798 - Attachment is obsolete: true
For anyone following along - our win32 VMs are single CPU, except for the test ones win32-slave02/03 which were made dual CPU. So the make bug gets tickled in the tests with -j4.
Here's the details from the runs so far:

w32-ix-slave01-buildtimes.log
number of builds: 62
build durations: [11514, 12120, 11849, 12319, 11232, 11815, 11866, 11913, 11712, 11960, 11704, 11733, 11562, 11604, 11808, 11609, 11743, 11918, 11281, 11604, 11339, 11433, 11235, 11403, 11535, 11825, 11718, 11813, 11749, 11655, 11321, 11838, 12137, 11864, 11857, 11431, 12107, 11572, 11507, 11329, 11896, 12019, 11736, 11701, 13041, 11439, 11739, 13322, 12166, 11645, 11780, 11253, 12242, 11571, 12146, 11996, 12827, 11844, 11780, 11694, 12300, 12451]
average duration: 11808

win32-slave02-buildtimes.log
number of builds: 33
build durations: [21631, 22327, 21972, 21186, 20658, 20478, 20204, 20957, 20607, 20325, 20363, 20611, 20557, 20449, 20427, 20608, 20948, 20618, 20397, 20770, 21129, 21816, 21128, 20688, 21114, 20804, 20730, 20955, 21658, 22126, 21630, 21396, 20903]
average duration: 20974

win32-slave03-buildtimes.log
number of builds: 1
build durations: [20880]
average duration: 20880

win32-slave04-buildtimes.log
number of builds: 33
build durations: [21032, 22415, 23306, 21206, 20555, 20885, 21260, 20576, 21332, 21129, 19887, 20072, 20906, 20228, 20918, 20618, 21536, 21507, 20838, 21619, 20577, 22357, 22932, 22290, 20840, 23179, 21942, 20823, 20791, 22077, 22393, 23228, 21443]
average duration: 21415

win32-slave03 crashed, apparently in basedir.exe (!!), so I expect it's not a useful datapoint here.  I'll restart it.  The other three are still building strong.

We'll need to wrap this up to a conclusion soon - we will be switching to VS2010, which doesn't fit on these relatively small VMs, so within the next week or so we need to decide whether to do a lot of work to set up a new, faster vmware image, or just stop building on Windows VMs at all.
These numbers are still pretty bad - apparently multiple CPUs and more RAM don't help.  Let's try isolating one of these test VMs (preferably 02 or 04) on its own ESX host to see how that works.

Phong, that's beyond my VMware-fu - can you do so?
ftr I'm halting w32-ix-slave01 right now to do some hardware upgrades.
In light of bug 673174, I'm discontinuing the work here.
I am going to close this out.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → INCOMPLETE
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: