Closed Bug 1010327 Opened 11 years ago Closed 10 years ago

Permission to delete from /mnt/socorro/symbols_upload/ on stage

Categories

(Socorro :: Infra, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: dmaher)

References

()

Details

When the SymbolsUnpackApp crontabber app runs on stage, it reads from /mnt/socorro/symbols_upload/ and for every .zip (or .tar, tgz etc) it finds it unpacks it into /mnt/socorro/symbols and the DELETES the archive file. It appears it doesn't have permission to do so here. The reason for this is so it knows what has been unpacked already and doesn't need unpack it again. See attached URL.
Ha! Just what I've always suspected. People miss the URL attribute on the bug data. I'll go back to including the URL in the bug description instead.
The problem is that the symbols_upload needs to be writable by both the apache and socorro users. I've the ownership of that directory to "apache.socorro", and ensured mode 2775, which would normally enforce the group permissions (setgid), thus allowing both users to r/w as expected. Unfortunately, the setgid appears to be ignored when the directory is used as an NFS mountpoint. I added 'suid' to the list of mount options (in Puppet), but that didn't appear to have any effect. WIP.
Status: NEW → ASSIGNED
:gcox, the tl;dr is that I'd like both the "apache" and "socorro" users to be able to r/w, without resorting to something horrible like adding socorro to the apache group (or vice versa). In a non-NFS-mount situation the solution would be to simple set the ownership as apache.socorro and to activate setgid to enforce the ownership and perms throughout. Is there a way to do this once the mountpoint is active? Put another way, is it possible for socorroadm.stage.private.phx1:/mnt/socorro/symbols_upload/ (mountpoint for 10.8.75.14:/symbols_upload/stage) to be owned by apache.socorro with perms 2775 ?
Still a problem. The new file I uploaded still errored https://errormill.mozilla.org/webtools/socorro-prod/group/168852/
As a general rule, the set[ug]id is squashed off on the filer exports. I've enabled it back for 10.8.75.48 on stage, and tested that it did what I'd expect: [root@socorroadm.stage.private.phx1 ~]# mkdir /mnt/socorro/symbols_upload/foo [root@socorroadm.stage.private.phx1 ~]# chown apache.socorro /mnt/socorro/symbols_upload/foo [root@socorroadm.stage.private.phx1 ~]# chmod 2775 /mnt/socorro/symbols_upload/foo [root@socorroadm.stage.private.phx1 ~]# sudo -u apache touch /mnt/socorro/symbols_upload/foo/apa [root@socorroadm.stage.private.phx1 ~]# sudo -u socorro touch /mnt/socorro/symbols_upload/foo/soc [root@socorroadm.stage.private.phx1 ~]# ls -al /mnt/socorro/symbols_upload/foo total 8 drwxrwsr-x 2 apache socorro 4096 May 15 11:34 . drwxrwsr-x 4 socorro apache 4096 May 15 11:33 .. -rw-r--r-- 1 apache socorro 0 May 15 11:34 apa -rw-r--r-- 1 socorro socorro 0 May 15 11:34 soc * If you want me to turn that up for the webapp nodes in stage, just say so. * I assume you'll want it in the prod version of this volume, for parallelism. There are more nodes there (socorroadm.private and sp-admin01, as well as the webapp nodes), so, again, just ping me with what you'd like not-squashed.
Per IRC, added the suid permissions on the filer for webheads and admin nodes, stage and prod.
Great, thanks :gcox ! The webapp nodes in stage don't need it since nothing runs as user socorro on those nodes (at least, not for now :P ). I have also now realised that extended attrs are not possible either, which means that I cannot guarantee permission inheritance within the tree via setfacl as previously planned (doh). That said, there is still a solution: since the parent directory now honours the group ID (if not the permissions), if the interacting processes (the webapp on the webheads, and crontabber on admin) set g+rw, we should end up with the desired behaviour. To illustrate: * On the webhead, Apache writes out files to symbols_upload/, and since the directory is setgid, they are owned by "apache.socorro". * On the webhead, Apache sets g+rw (recursively) on the files it has written. * Since symbols_upload/ is a mount, the ownership and permissions are the same on the admin node. * On the admin node, Crontabber (user socorro) can read and write those files - and ultimately delete them. This will require a small amount of additional functionality in the code.
Oops - my bad, the webapp nodes in stage *do* need it. Ignore that sentence. The rest of comment 8 is accurate afaik. ;)
I will be doing further NFS ACL tests (see bug 1011497), but if that doesn't work, then we'll just have to ensure that the code pulls a "chmod -R g+rw" every time symbols_upload/ gets touched (jaaaaaank).
21:21:45 <gcox> So, banged my head on the facl thing. If possible, it'd be good if we can get the OS upgraded on the stage box. I don't think that's the solution, but that a few things are behind would at least eliminate question marks. I'll keep on it.. but I think the NFS side may be a filer problem. It's just a REALLY weird problem. I'd love to have you amplify this statement. :) In particular, what does "OS upgrade" mean exactly (just a yum update, or a dist upgrade, or something else?), and what problem do you expect said upgrade to solve ? (/me resists urge to needinfo ;) )
Fwiw, nfs4_setfacl returns "Failed setxattr operation: Input/output error" every time I try to use it in symbols_upload, so I guess something is, indeed, amiss.
> I'd love to have you amplify this statement. :) In particular, what does > "OS upgrade" mean exactly (just a yum update, or a dist upgrade, or > something else?), and what problem do you expect said upgrade to solve ? Yum update was all I was thinking. There are bits of info that suggest rhel6.5 is better than earlier releases for doing nfsv4. If I need it I'll call for it. This was more "if you were thinking about doing one anyway, go for it." > Fwiw, nfs4_setfacl returns "Failed setxattr operation: Input/output error" > every time I try to use it in symbols_upload, so I guess something is, > indeed, amiss. Yeah. This was all guinea-pigging and not-ready-to-launch. I have seen this behave in not-your-case places (other clients, other exports on the filer) but have yet to fully bisect to figure out where the issue is.
phrawzty, How about this for a curve-ball... When I built it I made the webapp write the zips for a directory and nothing else. Then a cronjob unpacks those zip files and puts them on ted's symbol server. How about we re-architecture it significantly so that the webapp also deals with unpacking the zip file on-the-fly straight into the final destination. It's just a suggestion. It'd mean the upload might be more fragile and slightly slower but at least it could potentially simplify the whole NFS story significantly. What do you think?
(In reply to Peter Bengtsson [:peterbe] from comment #14) > phrawzty, > > How about this for a curve-ball... > > When I built it I made the webapp write the zips for a directory and nothing > else. Then a cronjob unpacks those zip files and puts them on ted's symbol > server. > > How about we re-architecture it significantly so that the webapp also deals > with unpacking the zip file on-the-fly straight into the final destination. > > It's just a suggestion. It'd mean the upload might be more fragile and > slightly slower but at least it could potentially simplify the whole NFS > story significantly. > > What do you think? The webheads only have "symbols_upload" mounted on them. If the final destination is "symbols", then we'd need to have that mount set up on the webheads as well. Furthermore, if that mount is already on the webheads, and it's the final destination, then there would be no need for "symbols_upload". I agree that this is a more simple (and therefore superior) approach, and I would be happy to see it implemented; however, it does not actually solve the underlying permissions problem. That issue still needs to be resolved either way.
Depends on: 1015187
Component: WebOps: Socorro → Infra
Product: Infrastructure & Operations → Socorro
QA Contact: nmaul
Turns out that you can specify a default umask for a WSGI Daemon[1]. Combined with the setgid bit, this has the net effect of forcing newly created directories to inherit the permissions and ownership of the root dir - which, combined with setting newly uploaded files g+w in Django itself, would seem to meet all of our conditions for success. I will add this setting to Stage in Puppet now. --- [dmaher@socorro2.stage.webapp.phx1 symbols_upload]$ pwd /data/socorro/webapp-django/media/symbols_upload [dmaher@socorro2.stage.webapp.phx1 symbols_upload]$ stat . File: `.' Size: 4096 Blocks: 8 IO Block: 65536 directory Device: 14h/20d Inode: 75235392 Links: 3 Access: (2775/drwxrwsr-x) Uid: ( 48/ apache) Gid: (10000/ socorro) Access: 2014-06-05 03:08:31.116646000 -0700 Modify: 2014-06-05 03:08:26.242544000 -0700 Change: 2014-06-05 03:08:26.242544000 -0700 [dmaher@socorro2.stage.webapp.phx1 symbols_upload]$ stat 2014 File: `2014' Size: 4096 Blocks: 8 IO Block: 65536 directory Device: 14h/20d Inode: 75235424 Links: 3 Access: (2775/drwxrwsr-x) Uid: ( 48/ apache) Gid: (10000/ socorro) Access: 2014-06-05 03:08:34.844732000 -0700 Modify: 2014-06-05 03:08:26.242558000 -0700 Change: 2014-06-05 03:08:26.242558000 -0700 [dmaher@socorro2.stage.webapp.phx1 symbols_upload]$ stat 2014/06/05/cf9dbabf97487b2123085e94943fb3b0.zip File: `2014/06/05/cf9dbabf97487b2123085e94943fb3b0.zip' Size: 183 Blocks: 8 IO Block: 65536 regular file Device: 14h/20d Inode: 75235427 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 48/ apache) Gid: (10000/ socorro) Access: 2014-06-05 03:08:26.244552000 -0700 Modify: 2014-06-05 03:08:26.247550000 -0700 Change: 2014-06-05 03:08:26.247555000 -0700 --- [1] https://code.google.com/p/modwsgi/wiki/ConfigurationDirectives#WSGIDaemonProcess
Note bug 1020901 which, while it is a separate issue, does ultimately impact this bug.
Crontabber can successfully remove the archive as expected. Looks like we've got a viable fix (woo!). 2014-06-05 07:18:28,924 DEBUG - MainThread - about to run <class 'socorro.cron.jobs.symbolsunpack.SymbolsUnpackCronApp'> 2014-06-05 07:18:28,979 DEBUG - MainThread - successfully ran <class 'socorro.cron.jobs.symbolsunpack.SymbolsUnpackCronApp'> on 2014-06-05 14:18:28.950316+00:00 Will roll out to Prod once Puppet comes back from today's maintenance window.
Woops, forgot to close this out. :)
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.