Closed
Bug 1041871
Opened 10 years ago
Closed 9 years ago
Disk - All on developeradm.private.scl3.mozilla.com is WARNING: DISK WARNING - free space: / 4036 MB (10% inode=53%):
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nagiosapi, Assigned: cliang)
References
()
Details
(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/943] [id=nagios1.private.scl3.mozilla.com:386531])
Attachments
(2 files)
Automated alert report from nagios1.private.scl3.mozilla.com: Hostname: developeradm.private.scl3.mozilla.com Service: Disk - All State: WARNING Output: DISK WARNING - free space: / 4036 MB (10% inode=53%): Runbook: http://m.allizom.org/Disk+-+All
Comment 1•10 years ago
|
||
Running out of space during generate_tarball.sh run :(
Reporter | ||
Comment 2•10 years ago
|
||
Automated alert recovery: Hostname: developeradm.private.scl3.mozilla.com Service: Disk - All State: OK Output: DISK OK
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 3•10 years ago
|
||
Had to kill the generate-tarball.sh cronjob (which was running for 2 days) after disk usage hit 100%. Since developeradm.private.scl3 is a VM, suggest increasing the disk to at least 80GB. For reference, after the script cleaned up itself (I killed wget + subsequent tar/gzip), about 30GB of space was taken up just by wget (gzip would some working space as well). And that wasn't even the full archive...
Comment 4•10 years ago
|
||
Let's re-open this and use this bug as a starting place for actually fixing (or removing) this job. @groovecoder: any thoughts on this? The wget job looks like this: wget -q -m -p -k -E -T 5 -t 3 -R 'mov,ogv,mp4,gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$json' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/ This has always been slow to run, but I'm kinda wondering if the site has changed such that it's pulling in other pages now that it shouldn't be. I think I may have found a couple problems: 1) developer.mozilla.org/en-US/search?locale=*&kumascript_macros=interface&page=39.html developer.mozilla.org/en-US/search?locale=*&kumascript_macros=nsprapiref&page=7.html developer.mozilla.org/en-US/search?locale=*&kumascript_macros=DomRef&page=201.html developer.mozilla.org/en-US/search?locale=*&kumascript_macros=gecko_minversion_inline&page=6.html developer.mozilla.org/en-US/search?locale=*&kumascript_macros=jsapi-requires-request&page=1.html 2) developer.mozilla.org/en-US/Firefox_OS/Releases$revision/541825.html developer.mozilla.org/en-US/Firefox_OS/Releases$revision/541827.html developer.mozilla.org/en-US/Firefox_OS/Releases$revision/542127.html developer.mozilla.org/en-US/Firefox_OS/Releases$revision/540489.html developer.mozilla.org/en-US/Firefox_OS/Releases$revision/541837.html 3) developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/627455.html developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360655.html developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/520743.html developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360661.html developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/533155.html developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/533143.html developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360653.html developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360663.html #2 and #3 seem to be the worst offenders. Perhaps the way page revisions are presented changed and now wget is fetching lots of them?
Assignee: nobody → server-ops-webops
Status: RESOLVED → REOPENED
Component: Server Operations: MOC → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → nmaul
Resolution: FIXED → ---
Whiteboard: [id=nagios1.private.scl3.mozilla.com:386531] → [kanban:https://kanbanize.com/ctrl_board/4/569] [id=nagios1.private.scl3.mozilla.com:386531]
Comment 5•10 years ago
|
||
We recently changed the way $history pages are presented[1], but I don't remember changing the $revision view. In any case, we create this tarball for https://bugzilla.mozilla.org/show_bug.cgi?id=757461. Since it's used for reading MDN offline, I think it can exclude all these urls which require the server: * search? * $revert * $edit * $translate * $move * $subscribe If excluding those aren't enough, we can also try: * $revision But that means offline readers of the tarball won't be able to go back thru the content history. :/ [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1034709#c4
Assignee | ||
Updated•10 years ago
|
Assignee: server-ops-webops → cliang
Assignee | ||
Comment 6•10 years ago
|
||
Short form: New version of wget installed manually. Testing is going on. Long form: The latest version of wget from RHEL for this server is 1.12, which does not support the --reject-regex flag. This flag is necessary for rejecting URLS with a query string in them. [1] I shaved yaks and compiled a newer version of wget (1.14), which does. I did test to see how much data was saved by skipping all of the $<foo> URLS. However, this did not save nearly as many entries as the 'search?' ones would, hence the wget upgrade. [1] "Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file."
Assignee | ||
Comment 7•10 years ago
|
||
I've a test tarball on developeradm.private.scl3.mozilla.com: /mnt/netapp/developer.mozilla.org/developer.mozilla.org.test.tar.gz. Is there some way to verify that the tarball is "good" for offline reading? I've been running a test version of the script with the following wget command: wget -q -m -p -k -E -T 5 -t 3 \ -R 'mov,ogv,mp4,gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$edit,*\$translate,*\$move,*\$subscribe,*\$json' \ -D developer.mozilla.org \ -X '*/profiles' \ --reject-regex='(.*)(\$revert|search\?)(.*)' \ -np https://developer.mozilla.org/en-US/ N.B. 1. In the exclude-directories flag (-X), wildcards do not match directory delimiter characters. This is why getting rid of URLS with $revert are done via the --reject-regex flag 2) You can not pass multiple regexes to the --reject-regex flag The temporary directory holding files got up to 9.1GB. This drops the disk free percentage to about 30-35% (depending on the size of other files), which should be enough to not set off the alert. The tarball itself doesn't seem to be much smaller (still about 1.7GB).
Flags: needinfo?(lcrouch)
Comment 8•10 years ago
|
||
No good way to verify except for spot-checking, and I think the best way to do that is to simply release the new version and wait for any bugs to come in. Is there any way to get a report of which paths are taking up the most space? Did you try excluding $revision urls too? That is probably the biggest ...
Flags: needinfo?(lcrouch)
Assignee | ||
Comment 9•10 years ago
|
||
Assignee | ||
Comment 10•10 years ago
|
||
Assignee | ||
Comment 11•10 years ago
|
||
I've committed the change in wget flags from comment #7 to puppet. As I mentioned, with the changes I made, I don't think that Nagios will alert so (as long as I haven't left out anything vital) we should be okay. Excluding the $revision URLS would definitely save on space but I was avoiding doing so as it would mean that off-line readers would not be able to go back through content history. An expanded version of the test tarball is available at /mnt/netapp_dev/developer.mozilla.org. I ran two different commands, looking for large directories (with no subdirectories) and large files: * find . -type d -links 2 -print0 |xargs -0 du |sort -n |tail -100 | cut -f2 | uniq | xargs -I{} du -sh {} * find . -type f -print0 |xargs -0 du |sort -n |tail -100 | cut -f2 | uniq | xargs -I{} du -sh {} I arbitrarily cut off the directory listing at anything 10M or greater. The file listing is cut off at 400K or greater. Both have been added to this bug as attachments.
Assignee | ||
Comment 12•10 years ago
|
||
Ugh. It looks like the job generated a really small tarball (589M), which doesn't seem right. Can someone please 1) forward me any relevant messages sent to cron-mdn@mozilla.com this and or 2) temporarily add me to that alias so I can see what error messages are generated if I try to kick it off again?
Comment 13•10 years ago
|
||
I don't have any cron error emails. I wouldn't be surprised with all those excluded URLs if the tarball is really small. Like I said, the best way to check is probably to upload the new tarball somewhere and let an existing tarball user check it out to see if it has everything they expect.
Assignee | ||
Comment 14•10 years ago
|
||
The small tarball is in the "normal" tarball place (/mnt/netapp/developer.mozilla.org/developer.mozilla.org.tar.gz). I guess I need to wait to see if any brickbats get thrown in my direction.
Assignee | ||
Comment 15•10 years ago
|
||
The tarball this weekend weighs in at (a more reassuring) 1.4GB.
Assignee | ||
Comment 16•10 years ago
|
||
This weekend's tarball comes in at 1.5GB. I've not heard of any complaints yet, so I'm going to mark this bug as closed. If there are issues, a new bug can be opened.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Resolution: --- → FIXED
Comment 17•9 years ago
|
||
cyliang, is there a chance that your heavily-stripped-down tarball can be hosted anywhere "outside"? The only one currently accessible from the Internet is jakem's one in the /media/ folder. However, this one clocks in at about 2.5 GB instead. Your tarball looks much better optimized, but currently is solely accessible to people who have access to MDN's local intranet server cluster.
Comment 18•9 years ago
|
||
Disk went into alarm again, [dgarvey@developeradm.private.scl3 ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 38G 33G 4.0G 90% / tmpfs 3.9G 0 3.9G 0% /dev/shm /dev/sda1 97M 58M 35M 63% /boot 10.22.75.91:/devmo/stage 30G 10G 21G 34% /mnt/netapp_stage 10.22.75.91:/devmo/prod 30G 19G 12G 63% /mnt/netapp [dgarvey@developeradm.private.scl3 ~]$ sudo su - [root@developeradm.private.scl3 ~]# cd /var/log/ [root@developeradm.private.scl3 log]# du -sh * | grep G 1.5G httpd [root@developeradm.private.scl3 log]# cd httpd/ [root@developeradm.private.scl3 httpd]# ls access_log access_log-20140622.gz access_log-20150329 error_log-20150329 access_log-20140518.gz access_log-20140629.gz access_log-20150405 error_log-20150405 access_log-20140525.gz access_log-20140706.gz access_log-20150412 error_log-20150412 access_log-20140601.gz access_log-20140713.gz error_log access_log-20140608.gz access_log-20150322 error_log-20150322 [root@developeradm.private.scl3 httpd]# ls -lrt total 1559084 -rw-r--r-- 1 root root 14669916 May 18 2014 access_log-20140518.gz -rw-r--r-- 1 root root 23328448 May 25 2014 access_log-20140525.gz -rw-r--r-- 1 root root 29450631 Jun 1 2014 access_log-20140601.gz -rw-r--r-- 1 root root 33163136 Jun 8 2014 access_log-20140608.gz -rw-r--r-- 1 root root 38834102 Jun 22 2014 access_log-20140622.gz -rw-r--r-- 1 root root 46096736 Jun 29 2014 access_log-20140629.gz -rw-r--r-- 1 root root 16451568 Jul 6 2014 access_log-20140706.gz -rw-r--r-- 1 root root 1537846 Jul 10 2014 access_log-20140713.gz -rw-r--r-- 1 root root 467089420 Mar 22 03:32 access_log-20150322 -rw-r--r-- 1 root root 635 Mar 22 03:32 error_log-20150322 -rw-r--r-- 1 root root 195542709 Mar 29 03:22 access_log-20150329 -rw-r--r-- 1 root root 321 Mar 29 03:22 error_log-20150329 -rw-r--r-- 1 root root 321 Apr 5 04:33 error_log-20150405 -rw-r--r-- 1 root root 247023813 Apr 5 04:33 access_log-20150405 -rw-r--r-- 1 root root 375465679 Apr 12 03:33 access_log-20150412 -rw-r--r-- 1 root root 531 Apr 12 03:33 error_log-20150412 -rw-r--r-- 1 root root 245 Apr 12 03:33 error_log -rw-r--r-- 1 root root 106172015 Apr 13 20:17 access_log [root@developeradm.private.scl3 httpd]# gzip access_log-20150322 access_log-20150329 access_log-20150405 [root@developeradm.private.scl3 httpd]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 38G 32G 4.8G 88% / tmpfs 3.9G 0 3.9G 0% /dev/shm /dev/sda1 97M 58M 35M 63% /boot 10.22.75.91:/devmo/stage 30G 10G 21G 34% /mnt/netapp_stage 10.22.75.91:/devmo/prod 30G 19G 12G 63% /mnt/netapp [root@developeradm.private.scl3 httpd]#
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/569] [id=nagios1.private.scl3.mozilla.com:386531] → [kanban:https://webops.kanbanize.com/ctrl_board/2/943] [id=nagios1.private.scl3.mozilla.com:386531]
Comment 19•9 years ago
|
||
That was short lived. developeradm.private.scl3.mozilla.com:Disk - All is WARNING: again.
Comment 20•9 years ago
|
||
kill -15'd the pid. df -h /dev/mapper/VolGroup00-LogVol00 38G 36G 938M 98% /
Assignee | ||
Comment 21•9 years ago
|
||
[For comparison's sake, disk usage right now is: /dev/mapper/VolGroup00-LogVol00 38G 11G 26G 30% /] The MDN tarball has grown in size since I last worked with it, weighing in at about 6GB. As part of that, our need for “swing space” as the tarball is generated has also grown: the total size of the files in the old tarball was about 9GB; in the current one, it’s about 24.7GB. Before I work with folks to coordinate getting more disk space, I wanted to re-verify that it is worth retaining revision information in the tarball. Back in the fall of 2014, it looked liked we had a ratio of roughly 3 to 1 (revision files versus non revision files). Right now, the ratio is more like 4 to 1. [1] I ask because including the revision information does make the tarball bigger and I wasn’t sure what the tradeoffs were regarding having history versus keep the tarball under a certain size (to make it easier to download). If we need to bifurcate and have two tarballs (one with revisions and the other without), that’s fine. I can add that into my calculations for more disk space. [1] Looking at the byte size in the file listing of the tarball: revisions (file name contained “$revision”): 19GB current v. 6.61GB old non-revision: 4.92 GB current v. 2.3GB old
Flags: needinfo?(lcrouch)
Comment 22•9 years ago
|
||
Well, I had better take back my question from comment 17. This is still so much in alpha or beta stage, that a publically accessible release won't happen overnight so soon...
Comment 23•9 years ago
|
||
Dropped it down below warning. [root@developeradm.private.scl3 log]# du -sh * | grep G 1.7G httpd [root@developeradm.private.scl3 log]# cd httpd/ [root@developeradm.private.scl3 httpd]# ls -lrt total 1755756 -rw-r--r-- 1 root root 14669916 May 18 2014 access_log-20140518.gz -rw-r--r-- 1 root root 23328448 May 25 2014 access_log-20140525.gz -rw-r--r-- 1 root root 29450631 Jun 1 2014 access_log-20140601.gz -rw-r--r-- 1 root root 33163136 Jun 8 2014 access_log-20140608.gz -rw-r--r-- 1 root root 38834102 Jun 22 2014 access_log-20140622.gz -rw-r--r-- 1 root root 46096736 Jun 29 2014 access_log-20140629.gz -rw-r--r-- 1 root root 16451568 Jul 6 2014 access_log-20140706.gz -rw-r--r-- 1 root root 1537846 Jul 10 2014 access_log-20140713.gz -rw-r--r-- 1 root root 11843456 Mar 22 03:32 access_log-20150322.gz -rw-r--r-- 1 root root 4189082 Mar 29 03:22 access_log-20150329.gz -rw-r--r-- 1 root root 321 Apr 5 04:33 error_log-20150405 -rw-r--r-- 1 root root 5377737 Apr 5 04:33 access_log-20150405.gz -rw-r--r-- 1 root root 9016323 Apr 12 03:33 access_log-20150412.gz -rw-r--r-- 1 root root 531 Apr 12 03:33 error_log-20150412 -rw-r--r-- 1 root root 557544514 Apr 19 03:33 access_log-20150419 -rw-r--r-- 1 root root 321 Apr 19 03:34 error_log-20150419 -rw-r--r-- 1 root root 782142116 Apr 26 03:28 access_log-20150426 -rw-r--r-- 1 root root 921 Apr 26 03:28 error_log-20150426 -rw-r--r-- 1 root root 455 Apr 27 12:58 error_log -rw-r--r-- 1 root root 222361785 Apr 27 22:38 access_log [root@developeradm.private.scl3 httpd]# gzip access_log-20150419 [root@developeradm.private.scl3 httpd]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/VolGroup00-LogVol00 38G 33G 4.4G 89% / tmpfs 3.9G 0 3.9G 0% /dev/shm /dev/sda1 97M 58M 35M 63% /boot 10.22.75.91:/devmo/stage 30G 11G 20G 34% /mnt/netapp_stage 10.22.75.91:/devmo/prod 30G 21G 9.7G 68% /mnt/netapp [root@developeradm.private.scl3 httpd]#
Comment 24•9 years ago
|
||
developeradm.private.scl3.mozilla.com:Disk - All is CRITICAL: DISK CRITICAL - free space: / 2202 MB (5% inode=65%) Followed the same steps as above. Has not come down to warning as of now.
Comment 25•9 years ago
|
||
C, Let's get more space on the VM. The time to debug this every time it's blown up (since 2014 July) isn't worth it :D Bump up disk to what we need and let's close this out.
Flags: needinfo?(cliang)
Comment 26•9 years ago
|
||
The root cause of this iteration of disk space issues seems to be the tarball script : [root@developeradm.private.scl3 tmp]# du -hscx * 0 20140623_ocvi 4.0K hsperfdata_infrasec 4.0K hsperfdata_root 4.0K pip-build-root 4.0K ssh-NCiMa29885 4.0K ssh-UNrLqw1467 24G tarball-generate.SpyQKN 280K vmware-root 24G total [root@develop [root@developeradm.private.scl3 tarball-generate.SpyQKN]# ps aux | grep tarball root 3703 0.0 0.0 103244 840 pts/2 S+ 04:37 0:00 grep tarball apache 16026 0.0 0.0 9236 1128 ? Ss Apr26 0:00 /bin/sh /data/bin/tarball-generate.sh [root@developeradm.private.scl3 tarball-generate.SpyQKN]# strace -p 16026 Process 16026 attached - interrupt to quit wait4(-1, ^C <unfinished ...> Process 16026 detached wait4 with -1 implies the process is waiting for it's children (how sweeeet) [root@developeradm.private.scl3 tarball-generate.SpyQKN]# pstree 16026 tarball-generat───wget and that's the hang up, wget. Which is apache 16035 0.5 8.3 719220 673968 ? SN Apr26 18:27 wget -q -m -p -k -E -T 5 -t 3 -R mov,ogv,mp4,gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$edit,*\$translate,*\$move,*\$subscribe,*\$json -D developer.mozilla.org -X */profiles --reject-regex=(.*)(\$revert|search\?)(.*) -np https://developer.mozilla.org/en-US/ And is still in progress since Apr 26th.... I haven't killed that yet, if we're in dire straits, that's the next target.
Assignee | ||
Comment 27•9 years ago
|
||
TL,DR: This bug was discussed in the sprint. Having a tarball that is too big to download is not useful, so just asking for more space would not really fix the problem. Luke: Have you had time to see if one of the other sites that maintain copies of MDN have a downloadable archive that we could point people to? Otherwise, I'll go ahead and excise the revisions from the tarball creation process, which should help.
Flags: needinfo?(cliang)
Assignee | ||
Comment 28•9 years ago
|
||
Brief update: * I've temporarily disabled the tarball generation cron job (since it *will* cause heartburn). * Luke has reached out to see if anyone already has a downloadable doc-set that could act as a substitute.
Flags: needinfo?(lcrouch)
Assignee | ||
Comment 29•9 years ago
|
||
I'm currently testing a tarball generation job that should exclude revisions. In case of Nagios alert: If this causes issues WRT to disk space, please note how much space is being used and grab a file listing of /mnt/netapp_stage/tarball-generate.w0Tynf before deleting it.
Assignee | ||
Comment 30•9 years ago
|
||
Expanding regex and doing another test tarball generation run.
Assignee | ||
Comment 31•9 years ago
|
||
Test tarball needed 13 GB to build and takes up approx. 3GB. I *think* this comes in just under the level at which we'll be setting off disk alerts. Updated tarball generation script and re-enabled job in cron. We'll see how she runs this weekend.
Comment 32•9 years ago
|
||
developeradm alerted for disk again: > [root@developeradm.private.scl3 /]# du -chx --max-depth=1 /tmp | grep G > 17G /tmp/tarball-generate.tfHgN8 > 14G /tmp/tarball-generate.tDJPEs tDJPEs is the older of the two. For some reason, either the script or wget seems to have been killed off at some point. I've started creating a tarball of that into /mnt/netapp/developer.mozilla.org/developer.mozilla.org-new.tar.gz after doing a quick sanity check: > [root@developeradm.private.scl3 tarball-generate.tDJPEs]# find developer.mozilla.org | wc -l > 271231 > [root@developeradm.private.scl3 tarball-generate.tDJPEs]# tar tzvf /mnt/netapp/developer.mozilla.org/developer.mozilla.org.tar.gz | wc -l > 268012 # screen -x tDJPEs to follow the tar I'll leave it to :cyliang to retain one of the .tar.gz copies.
Comment 33•9 years ago
|
||
I've removed /tmp/tarball-generate.tDJPEs after the tarball finished compressing, clear for disk space now...
Assignee | ||
Comment 34•9 years ago
|
||
Starting another test run, with files being written to the /tarball directory.
Assignee | ||
Comment 35•9 years ago
|
||
Test runs have been good. Updated puppet to reflect the location change and will verify that everything is working as expected next week.
Comment 36•9 years ago
|
||
BTW, why don't you guys use bzip2 instead? (tar jcvf) I think it has become pretty standard now. As far as my experience goes, this will result in a significantly smaller tarball, especially with ascii data.
Assignee | ||
Comment 37•9 years ago
|
||
I generally try not to make decisions about file formats for MDN; my responsibility is to make sure the archive gets created. =) In this case, I don't think it would solve the problem with the amount of disk space being used: there's not really enough room on the disk to store all of the temporary files needed to create the archive (before it is compressed). If you feel strongly that the archive should be created with bzip2 instead, please create a separate bug. =) I don't believe that the disk alert has been triggered by the last round of archive creation, which successfully finished on August 17th. Closing this bug.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → FIXED
Comment 38•9 years ago
|
||
:) No, not that strongly. But I was just trying to think a few steps ahead like a chess player would. Though the tarball created here is on PRIVATE mozilla.com intranet network, it might be copied sometime to a location accessible by the public. And with a smaller tarball at hand (because of the more effective compression by bzip2), it will reduce traffic a lot caused by people who download this tarball.
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•