Closed Bug 1269855 Opened 8 years ago Closed 7 years ago

Disk checks on CentOS7 that traverse NFS go stale

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gcox, Assigned: gcox)

References

Details

<@nagios-scl3>	Tue 13:07:02 PDT [5334] hgssh1.dmz.scl3.mozilla.com:Disk - All is CRITICAL: DISK CRITICAL - /repo/hg/.snapshot/4hour8to20.2016-05-02_1200 is not accessible: Stale file handle

:digi hit something similar on the people1 rebuild.
Something in the combination of (CentOS7) and (an NFS vol that doesn't have -snapdir-access false), when a snapshot rotates out and the disk check is traversing .snapshot, things go stale and need manual cleanup.

Not sure whose fault it is, but, logging it for looking into.
In C7 I can make mounts appear in /proc/mounts by doing `ls /repo/hg/.snapshot/*`
I can get rid of them with 'umount /repo/hg/.snapshot/*'
The alert is from the check expecting every mount in /proc/mounts to work, which, once the snap goes away, is bad.

In RH6, say, web1.bugs.scl3:/mnt/bugzilla_prod/.snapshot/*, I can make them appear/disappear in /proc/mounts the same way.

Difference?  /etc/mtab in RH6 is sane, and /proc/mounts is noisy.  In C7, /etc/mtab is a frickin' symlink to /proc/self/mounts.  Really starting to hate the entire direction Linux is going in.

Bonus: if you sit around for like 10-15 minutes, they disappear out of /proc/mounts on their own in RH6 (because nobody's using the mount).  For C7, the nagios check_disk_all comes along every 5 minutes and keeps them alive, because they're visible in `df`.
(In reply to Greg Cox [:gcox] from comment #1)
> Difference?  /etc/mtab in RH6 is sane, and /proc/mounts is noisy.  In C7,
> /etc/mtab is a frickin' symlink to /proc/self/mounts.  Really starting to
> hate the entire direction Linux is going in.

Context: processes can have their own mount namespace (see CLONE_NEWNS from clone(2)). So instead of a global mount table shared across every process on the system, there is potentially a unique mount table per process. /etc/mtab is a global file. Hence the need for /proc/self/mtab to resolve the mount table on a per-process basis.

It's a bit more complicated. But it does provide some nice security wins.
So, 3 ideas here.
1) Disable .snapshot.  Kinda don't want to do that as a first move, not because we particularly NEED it in hg-land (the tactical problem), but because this is a weak precedent for the strategic problem (i.e. this will come back).
2) modules/nrpe/templates/nrpe.d/common.cfg.erb, add "-X nfs" to check_disk_all.
3) modules/nrpe/templates/nrpe.d/common.cfg.erb, add "-i .snapshot" to check_disk_all.

For 2 and 3, there is already a '-I mnt', meaning we've skipped anything under /mnt, where most services have NFS items mounted.  2 seems a little heavy-handed, but not that far removed from what we're already doing by ignoring /mnt.  3 would need testing (that is, if the nagios check stops looking at .snapshot 'mounts', will they time out and umount on their own like in RH6?), and seems the most elegant, but, there's a chance that it's not the only reason the mount exists.  I mean, SOMETHING went into that path, who's to say how often it happens.
(In reply to Greg Cox [:gcox] from comment #3)
> 2) modules/nrpe/templates/nrpe.d/common.cfg.erb, add "-X nfs" to
> check_disk_all.

If we do this will there be any warning/alerts when NFS volumes do fill up?
(In reply to Peter Radcliffe [:pir] from comment #4)
> > 2) modules/nrpe/templates/nrpe.d/common.cfg.erb, add "-X nfs" to
> > check_disk_all.
> If we do this will there be any warning/alerts when NFS volumes do fill up?

Arguably, you're already in that boat for most of the environment, since the nagios check skips /mnt (where most NFS mounts are).  hg has it under /repo/hg, though.  That's why I say "2 seems a little heavy-handed."  Nagios would stop looking completely, and it'd just be alerts that the filer sends us, outside of nagios.
Then it sounds like monitoring NFS storage volumes is a larger problem to be looked at.
(IMO nothing long term should be mounted under /mnt, that's for temporary mounts)
Bug reboot.

> 3) modules/nrpe/templates/nrpe.d/common.cfg.erb, add "-i .snapshot" to
> check_disk_all.

Seems like the most attractive option here. Any disagreement?
+1 to Just Do It. Can't be worse than an timing-out (and therefore useless) check, suggested by storage team, and no drawbacks pointed out since suggested.
Came across another system that had these hangs, and then remembered this bug was out here.
Checked in modules/nrpe/templates/nrpe.d/common.cfg.erb at change 121623.
TL;DR: there is something client-side that trolls /etc/mtab and tickles the mounts it finds.  Stop that thing.


This requires a deeper dive than usual, mostly because it gets misunderstood and leads to Really Bad Search Results.  But because of that I want to write this up to be referred to in the future.

NetApp ships a technology for their filers called 'junction mounts'.  Simply put, I can have 2 volumes on the filer, /a1 and /b1, and the filer can join them by name and present them as /c/d (/a1, known as c, is a parent of /b1, known as d)  such that you (the NFS client) only need to mount /c as /mnt/x, and when you cd /mnt/x/d, you will see volume /b1.  It makes it so the filer admin only needs to tell clients about one export, and the client only needs to manage one.  The filer admin gains the ability to manage the space on the filer while not needing a complete rearchitecture of the exports on many clients.

There is also a technology for 'snapshots'.  Each volume can have N-many snapshots (usually 0-255).  A snapshot is a read-only freeze of the volume's blocks RIGHT THEN, and if you delete a file it unlinks in the live filesystem, but NOT the snapshot.  Every snapshot is effectively a whole volume, just at a different point in time.  Filers create a .snapshot directory as a magic read-only/filer-managed directory that holds links to the snapshots by name.  Nice thing there is, if you go into .snapshot, you can pull "that critical file I just deleted" back by a simple `cp .snapshot/daily.0/foo ./foo.restore`.  A NetApp admin sees 1 volume with (example) 10 snapshots; an NFS client sees 1 volume they mounted using some userspace tool (fstab, autofs, automount) and then, if someone/something pokes around, 10 subvolumes under .snapshot.  Snapshots can be manual, but are most often time-based, and age out/are deleted automatically.

I list these separately as they demonstrate very similar concepts of volumes within a NetApp filer.  The chief thing to note (and this is crucial), in both cases, these are nfs volumes being FOUND AND AUTOMATICALLY MOUNTED by the nfs client, underneath other nfs volumes that you EXPLICITLY MOUNTED.  Both the junction mount, and the snapshot volumes, are ready to be found, but are not mounted on/by the client until you cd/ls into that subpart of the exported main volume.


The Linux kernel is actually smart enough about this.  While you can quibble about what the kernel shows in later releases (the code changes names/locations as 3.x and beyond), the kernel has LONG had the concept of cleaning up these junctions that it automounts: http://lxr.free-electrons.com/source/fs/nfs/namespace.c?v=2.6.32#L177  Also, /proc/sys/fs/nfs/nfs_mountpoint_timeout exists, and after a default of 500 seconds, attempts to umount its automounted volumes if they're idle for that long.

We didn't have the automatic unmounts failing us in RHEL6, but did start seeing it as we rolled out CentOS7.  As noted in comment 1, RHEL6-era-releases doesn't pollute /etc/mtab with these temporary/auto mounts; CentOS7-era do.  `df` (called in by NRPE checks from nagios) tickles the mounts it finds in /etc/mtab just often enough to keep them alive.  This prevents subvolumes from ever becoming idle once they ever become mounted, and eventually the filer does a snapshot delete... which leads to stale filehandles and an unhappy NFS client.  We eliminated the nagios case in comment 10 but that didn't solve everything (see bug 1344729).

After a long time poring over the kernel code, and thinking I'd hit some weird corner case, I decided to assume the kernel wasn't wrong.  Notably, when /proc/sys/fs/nfs/nfs_mountpoint_timeout was dropped to 15 seconds, it was unmounting correctly in CentOS7.  After a long process of disabling/enabling userspace services, I got a use case pinned down: `collectd` was keeping subvolumes mounted.

https://github.com/collectd/collectd/blob/collectd-5.7/src/utils_mount.c#L78-L79 indicates the glibc case uses _PATH_MOUNTED
https://www.gnu.org/software/libc/manual/html_node/Mount-Information.html indicates this is LIKELY going to get you to /etc/mtab
https://github.com/collectd/collectd/blob/collectd-5.7/src/utils_mount.c#L548 opens the file for reading
https://github.com/collectd/collectd/blob/collectd-5.7/src/utils_mount.c#L644 generates the list.

https://github.com/collectd/collectd/blob/collectd-5.7/src/df.c#L204 is a statfs call against a directory, inside a loop over all mounts https://github.com/collectd/collectd/blob/collectd-5.7/src/df.c#L164-L165 , which comes from a list generated by utils_mounts.c

So, a pseudo-df call from within collectd, that either (doesn't ignore NFS volumes) or (has a collection interval < the unmount idle interval) is a contributing factor.  I've opened bug 1348806 for this.

However, this is fundamentally a client-side problem.  From the filer side, I have 2 tools: don't make snapshots, or don't expose the .snapshot directory to any client.  Not great options.
If someone runs a `find` into a snapshot 1 minute before it is deleted, I can't stop them, and they'll get a stale NFS filehandle.
nagios was a troublemaker before.  collectd is a troublemaker now.  There COULD be others that I just haven't found yet, or will crop up in the future, either everywhere or on particular boxes.  But since /etc/mtab is probably going to be a mess going forward, it's going to have to be some whack-a-mole problems with unrelated client software solved on the client-side.



Red Herrings:
* There are userspace daemons for autofs and automount, that perform a similar function for ANY filesystem, not just nested NFS junctions.  This makes most searches awful.
* It is somewhat uniquely an NFS on Linux issue.  But the pollution of the words 'mount' and 'export' lead to a lot of userspace issues on basic NFS problems.
* https://kb.netapp.com/support/s/article/ka11A0000000swLQAQ/a-linux-system-running-kernel-2-6x-creates-a-mount-point-when-accessing-the-snapshot-directory-via-network-file-system?language=en_US
This link describes a pre-RHEL6 issue where a kernel could NOT detect that a .snapshot volume was a separate volume, because it wasn't looking at the FSID.
We (mozilla) have, on the filer, been doing The Right Thing by showing separate FSID's for separate volume, and we won't be inclined to disable that.  Likewise, the kernel is doing the right thing in expecting a separate FSID.

A few keywords that helped me:
vfs_kern_mount  nfs_follow_mountpoint  nfs_do_submount
See Also: → 1344729
I think we're just about as out-of-the-woods on this as we can get, for now.

Skimmed the filer, and I'm not seeing cases of (CentOS7 client & stale mounts) against places where the filer has (exposed .snapshot directory & snapshots).  We'll have to triage new cases as they come along, but I feel like we've covered the two systemic client-side problems for now.
Assignee: infra → gcox
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.