investigate hg push performance improvements from disabling sync option on NFS mount
Categories
(Developer Services :: Mercurial: hg.mozilla.org, task)
Tracking
(Not tracked)
People
(Reporter: sheehan, Assigned: sheehan)
References
Details
On the hgssh push server we use an NFS mount to store repo data. Slowness in read/write performance on this mount has been identified as a contributing factor in the general slowness of try push times. gcox and I discussed this in a meeting and after some digging he was able to find a change in the IT puppet logs which enabled a "sync" option on the NFS mount. This option was enabled at a time when NFS was used on hgssh and hgweb and ensuring consistent syncing between all the nodes via NFS was required. Nowadays we have fast SSDs in use on hgweb and the NFS mount is effectively only used on one machine, with our Kafka-based replication system handling data consistency guarantees across all of our read-only nodes.
Quoting from man 5 nfs: "If the sync option is specified on a mount point, any system call that writes data to files on that mount point causes that data to be flushed to the server before the system call returns control to user space. This provides greater data cache coherence among clients, but at a significant performance cost."
gcox and I plan to disable this option on the NFS mount and do some testing to see what kind of impact this will have on try push performance.
Comment 1•2 years ago
|
||
Forensics Notes from long ago:
old-SVN 82078 / infra-puppet 3a2a8e045ef23fa4406420292920a9de5ce7da17, 2014-02-08 18:26:31 +0000, sync was added to the mount options for all hgssh nodes in scl3. That has been carried forward since then, across the scl3->mdc1 move. While there's no bug listed in the commit, there is a reference in bug 974094 comment 2 about trying to avoid stale filehandles. This is adjacent to bug 937732 comment 9 (the rebuilds of hgweb* onto local disk).. which is itself consistent with the theory of comment 0 here.
| Assignee | ||
Comment 2•2 years ago
|
||
gcox ran a minimal test of the performance with and without the sync option enabled on a non-production workload. It isn't like-for-like but does have some promising results:
$ sudo mount -o remount,sync /mnt/scratch/
$ date ; time dd if=/dev/zero of=/mnt/scratch/testfile bs=16k count=128k
Thu Feb 1 06:25:44 UTC 2024
131072+0 records in
131072+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 34.2018 s, 62.8 MB/s
real 0m34.299s
user 0m0.082s
sys 0m2.016s
$ sudo mount -o remount /mnt/scratch/
$ date ; time dd if=/dev/zero of=/mnt/scratch/testfile bs=16k count=128k
Thu Feb 1 06:26:45 UTC 2024
131072+0 records in
131072+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 2.4795 s, 866 MB/s
real 0m2.575s
user 0m0.017s
sys 0m1.091s
The main thing to note is the difference in speed, 62.8MB/s vs 866MB/s. We'll be looking at disabling this next week.
| Assignee | ||
Comment 3•2 years ago
|
||
We disabled sync on the NFS mount today. A test with a basic push before/after the change showed ~10s of difference, which is hard to state as an improvement due to the wide variance of push times. I'll be looking at our telemetry over the next few days as VCS pushes come in to see if there is any trend in the right direction.
Comment 4•2 years ago
|
||
Many thanks for working on this!
Description
•