Slow NFS performance on scl3-na1a:aggr1

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: bkero, Assigned: dparsons)

Tracking

Details

(Reporter)

Description

6 years ago
on hgssh1.dmz.scl3.mozilla.com:

/repo/hg/mozilla/integration/mozilla-inbound is taking 16 minutes to copy.  Previous clones of this repository have taken less than 2 minutes to copy.

This is my method for reproducing:

[root@hgssh1.dmz.scl3 ~]# cd /repo/hg/mozilla/releases
[root@hgssh1.dmz.scl3 releases]# time hg clone -U mozilla-beta mozilla-beta2
(Assignee)

Updated

6 years ago
Assignee: server-ops → dparsons
Status: NEW → ASSIGNED
(Assignee)

Comment 1

6 years ago
This is hosted by scl3-na1a, and overall load on that system is low, however I did notice that any reads from aggr1 (the aggr hg* is on) are unusually slow, down to like 72mbit/s. Reads from other aggrs run at normal speeds.

nfsstat -l shows this:

10.22.74.34    	hgweb3.dmz.scl3.mozilla.com   	NFSOPS =    2710921 (15%)
10.22.74.35    	hgweb4.dmz.scl3.mozilla.com   	NFSOPS =    1921007 (11%)
10.22.74.33    	hgweb2.dmz.scl3.mozilla.com   	NFSOPS =    1844426 (10%)
10.22.74.32    	hgweb1.dmz.scl3.mozilla.com   	NFSOPS =    1480556 ( 8%)
10.22.74.36    	hgweb5.dmz.scl3.mozilla.com   	NFSOPS =     924019 ( 5%)

I reset the counters about 30 minutes ago, so that means in the last 30 minutes, hgweb* has accounted for 49% of all nfsops on this controller, despite the fact that hundreds of VMs and other hosts use this controller. This is unusually high and could be the cause of the slowness.

I will continue investigating.
(Assignee)

Comment 2

6 years ago
Here's a look at just the last 6 minutes: 

10.22.74.35    	hgweb4.dmz.scl3.mozilla.com   	NFSOPS =    1019162 (30%)
10.22.74.33    	hgweb2.dmz.scl3.mozilla.com   	NFSOPS =     281143 ( 8%)
10.22.74.34    	hgweb3.dmz.scl3.mozilla.com   	NFSOPS =     279652 ( 8%)
10.22.74.36    	hgweb5.dmz.scl3.mozilla.com   	NFSOPS =     257759 ( 8%)
10.22.74.32    	hgweb1.dmz.scl3.mozilla.com   	NFSOPS =     178320 ( 5%)
(Assignee)

Updated

6 years ago
Summary: NFS operations on hgssh1.dmz.scl3.mozilla.com taking longer than normal → Slow NFS performance on hgweb*.dmz.scl3, hgssh*
(Assignee)

Comment 3

6 years ago
Working with :bkero, I added two additional mount options for the NFS mounts on the hgweb* boxes:

noacto: disabled close-to-open cache coherence
actimeo=60: change the minimum time attrs are cached for from 3sec to 60sec

This change has drastically reduced the number of getattr calls to scl3-na1a. It has the potential downside of changes to files or directories taking up to 60 seconds to be noticed by a given hgweb node. We have to wait for the boxes to start caching more data before we can draw conclusions on the other performance issues.
(Assignee)

Comment 4

6 years ago
The mount changes have reduced the nfsops count over a 6 minute test from 2 million to 338k.
(Assignee)

Comment 5

6 years ago
Despite this reduction in nfsops, scl3-na1a is still quite slow. It is only aggr1 that is slow. After looking more closely, it looks like releng's VMs are the culprit. The VMs, alone, can take 95% of a 48TB SATA shelf's iops, and on scl3-na1a we've got those VMs and a ton of other stuff on the same aggr.

After talking to :nthomas, releng is going to temporarily stop their IO usage on those VMs to speed up migrating them to a separate aggr.
Summary: Slow NFS performance on hgweb*.dmz.scl3, hgssh* → Slow NFS performance on scl3-na1a:aggr1
bld-centos5-32-vmw-001 through 022 have been disabled in slavealloc, and if idle disconnected from buildbot. They'll all take no work. bld-centos5-32-vmw-022 was already disabled.
Done:
bld-centos5-32-vmw-005
bld-centos5-32-vmw-010
bld-centos5-32-vmw-011
bld-centos5-32-vmw-012
bld-centos5-32-vmw-013
bld-centos5-32-vmw-014
bld-centos5-32-vmw-015
bld-centos5-32-vmw-016
bld-centos5-32-vmw-017
bld-centos5-32-vmw-018
bld-centos5-32-vmw-019
bld-centos5-32-vmw-020
bld-centos5-32-vmw-021
bld-centos5-32-vmw-022

Going by #c6, I didn't migrate any more VMs
(Assignee)

Comment 8

6 years ago
All the build VMs can safely be used again.
These should be coming back online shortly
(Assignee)

Comment 10

6 years ago
This is no longer an issue.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.