Closed Bug 386074 Opened 17 years ago Closed 17 years ago

Build cycle times are up across the board after moving to NetApp-backed storage

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: justin)

References

Details

(Whiteboard: Waiting on NetApp shelf)

We've seen increases -- sometimes quite large -- in build cycle times since the VMs have been migrated to ESX3 hosts backed by the NetApp. Things were meant to be getting _faster_. :/ 

Here are some sample cycle times:

Machine: before migration vs. after
patrocles-vm:  1 hr vs. 6 hr
fx-linux-tbox: 10 min vs. 30 min
fx-win32-tox:  30 min vs. 90 min

Is there anything in the new host configuration that would be slowing things down like this? Any tweakable params on the NetApps we can work with?
Blocks: 385972
This shouldn't have anything to do with the netapp as it's running at about 15% cpu load.  Preed said he thought this had to do with poor distribution on the ESX hosts - thoughts preed?
Severity: critical → major
It's not clear to me that the move was supposed to improve speed as much as it was to grow the storage and the ability to move VMs around.  

We are using software iscsi and, if that's the constraint, we could experiment with a host using a QLogic iSCSI HBA.  I was on the phone with vmware yesterday going over the settings and there wasn't anything to tweak on that end.  

In any event, fx-win32-tbox is moving to bm-vmware06.
Severity: major → critical
Severity: critical → major
The move is done - can you report back what the new cycle times are and verify that builds are working as expected?
I'm not sure what this says, but since I looked them up - times for fx-win32-tbox to do the nightly clobber, and to do the no-checkin depend closest to the nightly:

      clobber  depend
6/26     1:41     :57
6/25     1:42     :57
6/24     1:40     :56
6/23     1:39     :58
6/22     1:29     :46
6/21     1:22     :36
6/20     1:21     :37
6/19     1:20     :36

The question, though, is whether that means that last week's migrations added a flat 20 minutes to the build time, or that it increased clobbers by 25% and depends by almost 100% (see bug 381247 for an example of how changing storage could make a much bigger difference for depend builds).
For comparison, these are "no-change depend"/clobber times (in minutes). "Before" is June 18th, "After" is June 27th.

App-Branch        Before/After 	       Windows    Linux     Mac (control)

Firefox-Trunk        Before             35/80     11/44     31/55
                     After             100/?      25/68     31/55

Thunderbird-Trunk    Before             38/130    10/43     20/51
                     After              90/160    25/64     21/51 
No longer blocks: 385972
We moved fx-win32-tbox onto an empty VM host today, and the cycle time is still crappy (100 minutes for a depend build, 150 minutes for a clobber build). Host loading doesn't seem to be a factor in this case.
I talked to kev who's worked with NetApps before:

[8:39pm] kev: you running iscsi or nfs?
[8:54pm] coop: using software iscsi right now, according to mrz
[8:56pm] kev: we had a problem with network controllers in a cluster config
[8:57pm] kev: basically one of them would synflood the network
[8:57pm] kev: ended up replacing the controller on three boxes
[8:57pm] kev: very annoying
[9:05pm] coop: sadly, i don't know enough about how it's setup to be very helpful in debugging the problem
[9:06pm] kev: yeah, it took a box that from all appearances was fine, and completely hosed performance

Not sure whether that's pertinent in this case.

I'm unsure as to next steps here. Is there any provision for going *back* to local storage, even if only for one host? How about changing the iscsi as mrz suggests in comment #2?
I talked to Kev - he'll give me more info on the netapp issue in the morning.

Comment #2 involves some money and downtime on some ESX host so if we go that route, we should keep one free.

None of the ESX hosts have a lot of local storage (mostly 72GB) and doing so would eliminate vmotion.  For comparison testing I could manually move one.
Assignee: server-ops → mrz
Just because I can't resist ringing the patrocles bell one more time, cf's table for 1.8:

App-Branch        Before/After         Windows    Linux     Mac (control)

Firefox-1.8         Before             24/55      13/59     19/88
                     After             38/72      34/81     18/87

Thunderbird-1.8     Before            58/134       9/23     13/90
                     After           377/424      34/54     12/90

If you're looking for one to move where you can't possibly do any harm, patrocles with the 550% increase in depend build time seems like a good choice.
I'd be very surprised if this is a netapp issue for a few reasons:

1)  We tested local storage vs netapp iscsi for sequential, non-sequential, random reads/writes and all compared withing 5% to local storage.
2)  We have many other apps that are as fast or faster on iscsi
3)  The netapp cpu is 20% or less
4)  Preed tested builds on the netapp before the migration and purchase and didn't mention any issue about huge slowdowns

If we were having synflood issues, we'd see these issues on many other apps.

I would look more to vmware - I'll work tomorrow with mrz/vmware/netapp to try to see what the issue is.
As of now, we're moving fx-win32-tbox from the Netapp to local storage, so there will be ~1 hour of downtime.

We'll see if build times improve for this VM. We put this on a machine by itself and it didn't get any faster, so we're trying to isolate if it's iSCSI or not.
Just had another IRC convo with kev about the NetApp:

[11:59am] kev: so yeah, there's two data stores defined for all the VM disks
[12:00pm] kev: how many VM disks are on them?
[12:17pm] coop: i.e. how many VMs do we have, total? or individual disks within those VMs?
[12:17pm] kev: individual disks
[12:18pm] kev: mrz mentioned 32 machines
[12:18pm] kev: but that can still be high, especially if they're paging
[12:18pm] kev: the datastore is essentially on LUN
[12:18pm] kev: one, even
[12:18pm] coop: we have more disks than machines, rest assured
[12:18pm] coop: although i don't know precisely how many
[12:18pm] kev: so, if you have 32 virtual disks, all of those requests go through the one device on the filer
[12:19pm] coop: ugh
[12:19pm] kev: so you get i/o request queuing, and bad things happen
[12:19pm] coop: older VMs will only have one disk because they were cloned from original desktop machines
[12:19pm] coop: new VMs based on ref images could have up to 3 disks each
[12:20pm] kev: a re-allocation may be required 
[12:20pm] coop: let me grab an approximate number from the email i sent out a few days ago
[12:20pm] kev: apparently there are two datastores using all the disk
[12:20pm] kev: which will make things entertaining
[12:21pm] kev: the other thing is to get a good idea of how much swap the vms are using
[12:21pm] kev: because swap/paging file usage will just compound the problem
[12:23pm] coop: i'm estimating 40-45 disks used by active VMs
[12:24pm] kev: so 20 per datastore potentially. which isn't that high, but I'm guessing I/O utilization is relatively high when they're building
[12:24pm] coop: yeah, crazy amounts of file access, especially for clobber builds
[12:24pm] coop: which interestingly almost all occur at the same time every night 
[12:25pm] kev: that'll do it
[12:25pm] kev: other thing is to verify that all the network links are gbit and are set to autonegotiate
[12:26pm] kev: there's also some fun settings you have to use to make sure interfaces serving linux boxes are tweaked or bad things can also happen
[12:27pm] coop: k
[12:28pm] kev: I sent mrz the guide
[12:28pm] coop: k, i'm also gonna post most of this transcript in the bug for IT, if that's alright by you
[12:28pm] kev: http://www.netapp.com/news/techontap/3248.pdf
[12:28pm] kev: soitenly
[12:30pm] coop: when you say re-allocation, what are you referring to? consolidating disks on the newer VMs?
[12:32pm] kev: assuming they are actually set up in one VMFS Datastore, by reallocation I mean creating new stores and spreading the virtual disks across them
[12:32pm] coop: ah, k
I'm pretty sure I've said this before in the bug, but i/o ops on the heads is not the bottleneck here.  I've run far more intensive stats than above, and the issue is not there.

Also, setting network interfaces to autoneg is in fact a *bad* thing, especially on the netapp.  You should force 1gb full duplex and you need correct flow control settings - something auto neg doesn't do right usually.
I agree with justin - at the end of the day, regardless of how many LUNs, traffic is going out the same interface to the same netapp to the same set of disks (which the netapp has virtualized a LUN out of).  

In comparison, our install is small so I'm confident that 32 VMs on two datastores (19/13) isn't over capacity.
Here's some more data. I just ran iostat on tb-linux-tbox during the 'cvs co' portion of its latest run. The high %util is troubling:

[cltbld@tb-linux-tbox ~]$ iostat -d -x 2 10 | grep -v sda
Linux 2.6.9-42.ELsmp (tb-linux-tbox.build.mozilla.org)  06/29/2007

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdb          0.35 377.64 61.12 25.57 3058.54 3227.32  1529.27  1613.66    72.51     5.32   61.37   8.44  73.14
sdb          0.00   0.00 41.29  0.00  334.33    0.00   167.16     0.00     8.10     0.98   23.47  23.75  98.06
sdb          0.00   0.00 53.96  0.00  431.68    0.00   215.84     0.00     8.00     0.98   18.35  18.17  98.07
sdb          0.00  12.63 60.10  1.01  480.81  109.09   240.40    54.55     9.65     1.00   16.02  16.31  99.70
sdb          0.00   0.00  9.45  0.00   75.62    0.00    37.81     0.00     8.00     1.00  107.32 105.32  99.55
sdb          0.00   5.97 51.74  1.00  413.93   55.72   206.97    27.86     8.91     1.02   18.83  18.73  98.76
sdb          0.00  38.69 62.81  8.54  502.51  377.89   251.26   188.94    12.34     1.02   14.70  13.97  99.70
sdb          0.00   0.00 73.76  0.00  594.06    0.00   297.03     0.00     8.05     0.98   13.32  13.28  97.92
sdb          0.00  19.00 58.00  1.50  464.00  164.00   232.00    82.00    10.55     1.04   17.31  16.71  99.45
sdb          0.00   0.00 67.34  0.00  538.69    0.00   269.35     0.00     8.00     1.00   13.86  14.81  99.75

Also posted here: http://pastebin.mozilla.org/111373
I should perhaps also note that the lowest I've seen %util go during other portions of the build is 58%.
Good info - how do those cpu number compare to the build running on local disk?
I have fx-win32-tbox-vmware02 online at 10.2.71.254.

vmware02 is using a seperate, untagged ethernet interface for iscsi traffic.  

Can someone start off the build process and time it on this machine?
(In reply to comment #19)

Build is running now, reporting to 
 http://tinderbox.mozilla.org/showbuilds.cgi?tree=MozillaExperimental
as 
 WINNT 5.2 fx-win32-tbox Dep VM testing

First build is a clobber, then depends like its sibling. Note that bsmedberg turned objdir's on today, but it hasn't impacted build time - still 1 hour for a clobber.

[build: config is to not release builds, symbols, or update info from this instance. Nor is it updating its configs from CVS - ie manual clobbers only]
Build times look better, will let it run till tomorrow, build to discuss at build meeting and let me know if times are more acceptable.
I wonder if the problem is with the tagging itself, or at least VMWare's implementation of 802.1q (we have none of these issues with RHEL).

bm-vmware02 has four physical interfaces, two of which are on the storage network (untagged, no 802.1q).  I want to move back to the other two interfaces but without any Vlan tagging.  If the numbers are on par then great, if they're longer than it appears separate physical interfaces is better.
Times look normal, no?  I'm still waiting on vmware but if I can move vms off a host for 15 minutes, I can change the storage interfaces.  When I did so on bm-vmware02, it needed a reboot before vmkping worked.
To summarize the data (fx-win32-tbox):
                                                     Depend        Clobber     

before migration (local storage, shared host)           35            140
 
after migration (netapp, shared host)                  100              ?

make devs happier (local storage, host to self)         16             60

clone, mrz tests round 1 (netapp, host to self)         42              ?  

clone, mrz tests round 2 (netapp, host to self)         39             83
In order to debug this further I need to move the original fx-win32-tbox off local disk and back to iscsi and re-verify that dropping the Vlan tag on the storage ports (both service console and vmkernel) make numbers more acceptable. 

When can I do this?
I have some things back from netapp too - I'll talk with mrz and we'll formulate a plan for the tests/migrations we need to do.
Adding bug 378440 to this bug; it tracks file system corruption on cereberus-vm; cf also points out that using the newer SCSI driver may improve build times.
Depends on: 378440
For tracking - netapp says this could be a perf issue.  We'll need to migrate the LUNs to type netapp and we'll need the build team to move the guest partitions as outlined below...

http://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb24492
To clarify, these are the current issues we're tracking here:

1. Problem with Netapp LUN settings; I don't have all the details, but there's something with block alignment that's not turned on or something, we'll be fixing that with tomorrow's 12-hour downtime.

2. Storage network connections: mrz did some testing and found out that the VLAN tagging perf in ESX isn't great, I guess. So he's solving it at the switch level. He also found better performance on the machines with multiple network ports and an external four-port network card. It's unclear to me if we're going to outfit all the machines with these new cards.

3. The SCSI driver issue that Nick detailed. Nick was going to clone cerberus-vm and give this is a try. In some cases, we may *have* to upgrade the SCSI drivers, because the drivers on the current new ref platform seem to have corruption problems we've seen before. mrz found an updated driver from VMware that we'll have to use.
> 2. Storage network connections: mrz did some testing and found out that the
> VLAN tagging perf in ESX isn't great, I guess. So he's solving it at the switch
> level. He also found better performance on the machines with multiple network
> ports and an external four-port network card. It's unclear to me if we're going
> to outfit all the machines with these new cards.

I haven't been able to test a machine with the two onboard NICs to see if non-tagged storage interfaces are better or not.  The test I did doesn't really match how the other boxes are configured.  

I want to move fx-win32-tbox back to iSCSI on vmware06 and try with non-tagged interfaces.  
(In reply to comment #30)

> I haven't been able to test a machine with the two onboard NICs to see if
> non-tagged storage interfaces are better or not.  The test I did doesn't really
> match how the other boxes are configured.  
> 
> I want to move fx-win32-tbox back to iSCSI on vmware06 and try with non-tagged
> interfaces.  

Can we do this with a clone of fx-win32-tbox (with different configs) or do we have to use that VM?
I suppose I could clone it but the test will only be similiar since I don't think there are any empty ESX hosts.  We can talk about it in person Tuesday.
Whiteboard: Waiting on NetApp shelf
Demo shelf is up and I posted numbers for fx-win32-tbox-fcal.  This was on an unloaded ESX host and on an unloaded NAS shelf.

I want to simulate a loaded shelf and have cloned fx-win32-tbox-fcal 5 times.  I talked to preed yesterday about needing someone to show me what needs to change on each such that each reports to the MozillaExperiemental page under different names  or doesn't report at all (I'm only tracking fx-win32-tbox-fcal's times).

Info sent - waiting on resolution.
Assignee: mrz → justin
Closing as we have a new architecture and will be deploying in the next few weeks.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.