on hgssh1.dmz.scl3.mozilla.com: /repo/hg/mozilla/integration/mozilla-inbound is taking 16 minutes to copy. Previous clones of this repository have taken less than 2 minutes to copy. This is my method for reproducing: [firstname.lastname@example.org ~]# cd /repo/hg/mozilla/releases [email@example.com releases]# time hg clone -U mozilla-beta mozilla-beta2
Assignee: server-ops → dparsons
Status: NEW → ASSIGNED
This is hosted by scl3-na1a, and overall load on that system is low, however I did notice that any reads from aggr1 (the aggr hg* is on) are unusually slow, down to like 72mbit/s. Reads from other aggrs run at normal speeds. nfsstat -l shows this: 10.22.74.34 hgweb3.dmz.scl3.mozilla.com NFSOPS = 2710921 (15%) 10.22.74.35 hgweb4.dmz.scl3.mozilla.com NFSOPS = 1921007 (11%) 10.22.74.33 hgweb2.dmz.scl3.mozilla.com NFSOPS = 1844426 (10%) 10.22.74.32 hgweb1.dmz.scl3.mozilla.com NFSOPS = 1480556 ( 8%) 10.22.74.36 hgweb5.dmz.scl3.mozilla.com NFSOPS = 924019 ( 5%) I reset the counters about 30 minutes ago, so that means in the last 30 minutes, hgweb* has accounted for 49% of all nfsops on this controller, despite the fact that hundreds of VMs and other hosts use this controller. This is unusually high and could be the cause of the slowness. I will continue investigating.
Here's a look at just the last 6 minutes: 10.22.74.35 hgweb4.dmz.scl3.mozilla.com NFSOPS = 1019162 (30%) 10.22.74.33 hgweb2.dmz.scl3.mozilla.com NFSOPS = 281143 ( 8%) 10.22.74.34 hgweb3.dmz.scl3.mozilla.com NFSOPS = 279652 ( 8%) 10.22.74.36 hgweb5.dmz.scl3.mozilla.com NFSOPS = 257759 ( 8%) 10.22.74.32 hgweb1.dmz.scl3.mozilla.com NFSOPS = 178320 ( 5%)
Summary: NFS operations on hgssh1.dmz.scl3.mozilla.com taking longer than normal → Slow NFS performance on hgweb*.dmz.scl3, hgssh*
Working with :bkero, I added two additional mount options for the NFS mounts on the hgweb* boxes: noacto: disabled close-to-open cache coherence actimeo=60: change the minimum time attrs are cached for from 3sec to 60sec This change has drastically reduced the number of getattr calls to scl3-na1a. It has the potential downside of changes to files or directories taking up to 60 seconds to be noticed by a given hgweb node. We have to wait for the boxes to start caching more data before we can draw conclusions on the other performance issues.
The mount changes have reduced the nfsops count over a 6 minute test from 2 million to 338k.
Despite this reduction in nfsops, scl3-na1a is still quite slow. It is only aggr1 that is slow. After looking more closely, it looks like releng's VMs are the culprit. The VMs, alone, can take 95% of a 48TB SATA shelf's iops, and on scl3-na1a we've got those VMs and a ton of other stuff on the same aggr. After talking to :nthomas, releng is going to temporarily stop their IO usage on those VMs to speed up migrating them to a separate aggr.
Summary: Slow NFS performance on hgweb*.dmz.scl3, hgssh* → Slow NFS performance on scl3-na1a:aggr1
bld-centos5-32-vmw-001 through 022 have been disabled in slavealloc, and if idle disconnected from buildbot. They'll all take no work. bld-centos5-32-vmw-022 was already disabled.
Done: bld-centos5-32-vmw-005 bld-centos5-32-vmw-010 bld-centos5-32-vmw-011 bld-centos5-32-vmw-012 bld-centos5-32-vmw-013 bld-centos5-32-vmw-014 bld-centos5-32-vmw-015 bld-centos5-32-vmw-016 bld-centos5-32-vmw-017 bld-centos5-32-vmw-018 bld-centos5-32-vmw-019 bld-centos5-32-vmw-020 bld-centos5-32-vmw-021 bld-centos5-32-vmw-022 Going by #c6, I didn't migrate any more VMs
All the build VMs can safely be used again.
These should be coming back online shortly
This is no longer an issue.
Status: ASSIGNED → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.