Closed Bug 1041871 Opened 10 years ago Closed 9 years ago

Disk - All on developeradm.private.scl3.mozilla.com is WARNING: DISK WARNING - free space: / 4036 MB (10% inode=53%):

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Other
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nagiosapi, Assigned: cliang)

References

()

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/943] [id=nagios1.private.scl3.mozilla.com:386531])

Attachments

(2 files)

Automated alert report from nagios1.private.scl3.mozilla.com:

Hostname: developeradm.private.scl3.mozilla.com
Service:  Disk - All
State:    WARNING
Output:   DISK WARNING - free space: / 4036 MB (10% inode=53%):

Runbook:  http://m.allizom.org/Disk+-+All
Running out of space during generate_tarball.sh run :(
Automated alert recovery:

Hostname: developeradm.private.scl3.mozilla.com
Service:  Disk - All
State:    OK
Output:   DISK OK
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Had to kill the generate-tarball.sh cronjob (which was running for 2 days) after disk usage hit 100%.

Since developeradm.private.scl3 is a VM, suggest increasing the disk to at least 80GB. For reference, after the script cleaned up itself (I killed wget + subsequent tar/gzip), about 30GB of space was taken up just by wget (gzip would some working space as well). And that wasn't even the full archive...
Let's re-open this and use this bug as a starting place for actually fixing (or removing) this job.

@groovecoder: any thoughts on this?


The wget job looks like this:

wget -q -m -p -k -E -T 5 -t 3 -R 'mov,ogv,mp4,gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$json' -D developer.mozilla.org -X '*/profiles' -np https://developer.mozilla.org/en-US/

This has always been slow to run, but I'm kinda wondering if the site has changed such that it's pulling in other pages now that it shouldn't be.


I think I may have found a couple problems:

1)
developer.mozilla.org/en-US/search?locale=*&kumascript_macros=interface&page=39.html
developer.mozilla.org/en-US/search?locale=*&kumascript_macros=nsprapiref&page=7.html
developer.mozilla.org/en-US/search?locale=*&kumascript_macros=DomRef&page=201.html
developer.mozilla.org/en-US/search?locale=*&kumascript_macros=gecko_minversion_inline&page=6.html
developer.mozilla.org/en-US/search?locale=*&kumascript_macros=jsapi-requires-request&page=1.html

2)
developer.mozilla.org/en-US/Firefox_OS/Releases$revision/541825.html
developer.mozilla.org/en-US/Firefox_OS/Releases$revision/541827.html
developer.mozilla.org/en-US/Firefox_OS/Releases$revision/542127.html
developer.mozilla.org/en-US/Firefox_OS/Releases$revision/540489.html
developer.mozilla.org/en-US/Firefox_OS/Releases$revision/541837.html

3)
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/627455.html
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360655.html
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/520743.html
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360661.html
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/533155.html
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/533143.html
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360653.html
developer.mozilla.org/en-US/Firefox_OS/Developing_Gaia/Localizing_Firefox_OS$revert/360663.html

#2 and #3 seem to be the worst offenders. Perhaps the way page revisions are presented changed and now wget is fetching lots of them?
Assignee: nobody → server-ops-webops
Status: RESOLVED → REOPENED
Component: Server Operations: MOC → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → nmaul
Resolution: FIXED → ---
Whiteboard: [id=nagios1.private.scl3.mozilla.com:386531] → [kanban:https://kanbanize.com/ctrl_board/4/569] [id=nagios1.private.scl3.mozilla.com:386531]
We recently changed the way $history pages are presented[1], but I don't remember changing the $revision view.

In any case, we create this tarball for https://bugzilla.mozilla.org/show_bug.cgi?id=757461.

Since it's used for reading MDN offline, I think it can exclude all these urls which require the server:

* search?
* $revert
* $edit
* $translate
* $move
* $subscribe

If excluding those aren't enough, we can also try:

* $revision

But that means offline readers of the tarball won't be able to go back thru the content history. :/

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=1034709#c4
Assignee: server-ops-webops → cliang
Short form:

New version of wget installed manually.  Testing is going on.

Long form:

The latest version of wget from RHEL for this server is 1.12, which does not support the --reject-regex flag.  This flag is necessary for rejecting URLS with a query string in them.  [1] I shaved yaks and compiled a newer version of wget (1.14), which does. 

I did test to see how much data was saved by skipping all of the $<foo> URLS.  However, this did not save nearly as many entries as the 'search?' ones would, hence the wget upgrade.


[1] "Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file."
I've a test tarball on developeradm.private.scl3.mozilla.com: /mnt/netapp/developer.mozilla.org/developer.mozilla.org.test.tar.gz.  Is there some way to verify that the tarball is "good" for offline reading?


I've been running a test version of the script with the following wget command:

wget -q -m -p -k -E -T 5 -t 3 \
  -R 'mov,ogv,mp4,gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$edit,*\$translate,*\$move,*\$subscribe,*\$json' \
  -D developer.mozilla.org \
  -X '*/profiles' \
  --reject-regex='(.*)(\$revert|search\?)(.*)' \
  -np https://developer.mozilla.org/en-US/


N.B. 1.  In the exclude-directories flag (-X), wildcards do not match directory delimiter characters.  This is why getting rid of URLS with $revert are done via the --reject-regex flag  2) You can not pass multiple regexes to the --reject-regex flag


The temporary directory holding files got up to 9.1GB.  This drops the disk free percentage to about 30-35% (depending on the size of other files), which should be enough to not set off the alert.  The tarball itself doesn't seem to be much smaller (still about 1.7GB).
Flags: needinfo?(lcrouch)
No good way to verify except for spot-checking, and I think the best way to do that is to simply release the new version and wait for any bugs to come in.

Is there any way to get a report of which paths are taking up the most space? Did you try excluding $revision urls too? That is probably the biggest ...
Flags: needinfo?(lcrouch)
I've committed the change in wget flags from comment #7 to puppet. 


As I mentioned, with the changes I made, I don't think that Nagios will alert so (as long as I haven't left out anything vital) we should be okay.  Excluding the $revision URLS would definitely save on space but I was avoiding doing so as it would mean that off-line readers would not be able to go back through content history. 

An expanded version of the test tarball is available at /mnt/netapp_dev/developer.mozilla.org.  I ran two different commands, looking for large directories (with no subdirectories) and large files:

  * find . -type d -links 2 -print0 |xargs -0 du |sort -n |tail -100 | cut -f2 | uniq | xargs -I{} du -sh {}
  * find . -type f -print0 |xargs -0 du |sort -n |tail -100 | cut -f2 | uniq | xargs -I{} du -sh {}

I arbitrarily cut off the directory listing at anything 10M or greater.
The file listing is cut off at 400K or greater.
Both have been added to this bug as attachments.
Ugh.  It looks like the job generated a really small tarball (589M), which doesn't seem right.  Can someone please 1) forward me any relevant messages sent to cron-mdn@mozilla.com this and or 2) temporarily add me to that alias so I can see what error messages are generated if I try to kick it off again?
I don't have any cron error emails. I wouldn't be surprised with all those excluded URLs if the tarball is really small.

Like I said, the best way to check is probably to upload the new tarball somewhere and let an existing tarball user check it out to see if it has everything they expect.
The small tarball is in the "normal" tarball place (/mnt/netapp/developer.mozilla.org/developer.mozilla.org.tar.gz).  I guess I need to wait to see if any brickbats get thrown in my direction.
The tarball this weekend weighs in at (a more reassuring) 1.4GB.
This weekend's tarball comes in at 1.5GB.  

I've not heard of any complaints yet, so I'm going to mark this bug as closed.  If there are issues, a new bug can be opened.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
cyliang, is there a chance that your heavily-stripped-down tarball can be hosted anywhere "outside"? 

The only one currently accessible from the Internet is jakem's one in the /media/ folder. However, this one clocks in at about 2.5 GB instead. Your tarball looks much better optimized, but currently is solely accessible to people who have access to MDN's local intranet server cluster.
Disk went into alarm again,

[dgarvey@developeradm.private.scl3 ~]$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00   38G   33G  4.0G  90% /
tmpfs                            3.9G     0  3.9G   0% /dev/shm
/dev/sda1                         97M   58M   35M  63% /boot
10.22.75.91:/devmo/stage          30G   10G   21G  34% /mnt/netapp_stage
10.22.75.91:/devmo/prod           30G   19G   12G  63% /mnt/netapp
[dgarvey@developeradm.private.scl3 ~]$ sudo su -
[root@developeradm.private.scl3 ~]# cd /var/log/
[root@developeradm.private.scl3 log]# du -sh * | grep G
1.5G	httpd
[root@developeradm.private.scl3 log]# cd httpd/
[root@developeradm.private.scl3 httpd]# ls
access_log              access_log-20140622.gz  access_log-20150329  error_log-20150329
access_log-20140518.gz  access_log-20140629.gz  access_log-20150405  error_log-20150405
access_log-20140525.gz  access_log-20140706.gz  access_log-20150412  error_log-20150412
access_log-20140601.gz  access_log-20140713.gz  error_log
access_log-20140608.gz  access_log-20150322     error_log-20150322
[root@developeradm.private.scl3 httpd]# ls -lrt
total 1559084
-rw-r--r-- 1 root root  14669916 May 18  2014 access_log-20140518.gz
-rw-r--r-- 1 root root  23328448 May 25  2014 access_log-20140525.gz
-rw-r--r-- 1 root root  29450631 Jun  1  2014 access_log-20140601.gz
-rw-r--r-- 1 root root  33163136 Jun  8  2014 access_log-20140608.gz
-rw-r--r-- 1 root root  38834102 Jun 22  2014 access_log-20140622.gz
-rw-r--r-- 1 root root  46096736 Jun 29  2014 access_log-20140629.gz
-rw-r--r-- 1 root root  16451568 Jul  6  2014 access_log-20140706.gz
-rw-r--r-- 1 root root   1537846 Jul 10  2014 access_log-20140713.gz
-rw-r--r-- 1 root root 467089420 Mar 22 03:32 access_log-20150322
-rw-r--r-- 1 root root       635 Mar 22 03:32 error_log-20150322
-rw-r--r-- 1 root root 195542709 Mar 29 03:22 access_log-20150329
-rw-r--r-- 1 root root       321 Mar 29 03:22 error_log-20150329
-rw-r--r-- 1 root root       321 Apr  5 04:33 error_log-20150405
-rw-r--r-- 1 root root 247023813 Apr  5 04:33 access_log-20150405
-rw-r--r-- 1 root root 375465679 Apr 12 03:33 access_log-20150412
-rw-r--r-- 1 root root       531 Apr 12 03:33 error_log-20150412
-rw-r--r-- 1 root root       245 Apr 12 03:33 error_log
-rw-r--r-- 1 root root 106172015 Apr 13 20:17 access_log
[root@developeradm.private.scl3 httpd]# gzip access_log-20150322 access_log-20150329 access_log-20150405
[root@developeradm.private.scl3 httpd]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00   38G   32G  4.8G  88% /
tmpfs                            3.9G     0  3.9G   0% /dev/shm
/dev/sda1                         97M   58M   35M  63% /boot
10.22.75.91:/devmo/stage          30G   10G   21G  34% /mnt/netapp_stage
10.22.75.91:/devmo/prod           30G   19G   12G  63% /mnt/netapp
[root@developeradm.private.scl3 httpd]#
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/569] [id=nagios1.private.scl3.mozilla.com:386531] → [kanban:https://webops.kanbanize.com/ctrl_board/2/943] [id=nagios1.private.scl3.mozilla.com:386531]
That was short lived.  developeradm.private.scl3.mozilla.com:Disk - All is WARNING: again.
kill -15'd the pid. 
df -h
/dev/mapper/VolGroup00-LogVol00   38G   36G  938M  98% /
[For comparison's sake, disk usage right now is:
   /dev/mapper/VolGroup00-LogVol00   38G   11G   26G  30% /]



The MDN tarball has grown in size since I last worked with it, weighing in at about 6GB.  As part of that, our need for “swing space” as the tarball is generated has also grown: the total size of the files in the old tarball was about 9GB; in the current one, it’s about 24.7GB.

Before I work with folks to coordinate getting more disk space, I wanted to re-verify that it is worth retaining revision information in the tarball.  Back in the fall of 2014, it looked liked we had a ratio of roughly 3 to 1 (revision files versus non revision files).  Right now, the ratio is more like 4 to 1. [1]  I ask because including the revision information does make the tarball bigger and I wasn’t sure what the tradeoffs were regarding having history versus keep the tarball under a certain size (to make it easier to download).

If we need to bifurcate and have two tarballs (one with revisions and the other without), that’s fine.  I can add that into my calculations for more disk space. 


[1] Looking at the byte size in the file listing of the tarball:
    revisions (file name contained “$revision”): 19GB current v. 6.61GB old 
    non-revision:  4.92 GB current v. 2.3GB old
Flags: needinfo?(lcrouch)
Well, I had better take back my question from comment 17. 
This is still so much in alpha or beta stage, that a publically accessible release won't happen overnight so soon...
Dropped it down below warning.

[root@developeradm.private.scl3 log]# du -sh * | grep G
1.7G	httpd
[root@developeradm.private.scl3 log]# cd httpd/
[root@developeradm.private.scl3 httpd]# ls -lrt
total 1755756
-rw-r--r-- 1 root root  14669916 May 18  2014 access_log-20140518.gz
-rw-r--r-- 1 root root  23328448 May 25  2014 access_log-20140525.gz
-rw-r--r-- 1 root root  29450631 Jun  1  2014 access_log-20140601.gz
-rw-r--r-- 1 root root  33163136 Jun  8  2014 access_log-20140608.gz
-rw-r--r-- 1 root root  38834102 Jun 22  2014 access_log-20140622.gz
-rw-r--r-- 1 root root  46096736 Jun 29  2014 access_log-20140629.gz
-rw-r--r-- 1 root root  16451568 Jul  6  2014 access_log-20140706.gz
-rw-r--r-- 1 root root   1537846 Jul 10  2014 access_log-20140713.gz
-rw-r--r-- 1 root root  11843456 Mar 22 03:32 access_log-20150322.gz
-rw-r--r-- 1 root root   4189082 Mar 29 03:22 access_log-20150329.gz
-rw-r--r-- 1 root root       321 Apr  5 04:33 error_log-20150405
-rw-r--r-- 1 root root   5377737 Apr  5 04:33 access_log-20150405.gz
-rw-r--r-- 1 root root   9016323 Apr 12 03:33 access_log-20150412.gz
-rw-r--r-- 1 root root       531 Apr 12 03:33 error_log-20150412
-rw-r--r-- 1 root root 557544514 Apr 19 03:33 access_log-20150419
-rw-r--r-- 1 root root       321 Apr 19 03:34 error_log-20150419
-rw-r--r-- 1 root root 782142116 Apr 26 03:28 access_log-20150426
-rw-r--r-- 1 root root       921 Apr 26 03:28 error_log-20150426
-rw-r--r-- 1 root root       455 Apr 27 12:58 error_log
-rw-r--r-- 1 root root 222361785 Apr 27 22:38 access_log
[root@developeradm.private.scl3 httpd]# gzip access_log-20150419
[root@developeradm.private.scl3 httpd]# df -h
Filesystem                       Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup00-LogVol00   38G   33G  4.4G  89% /
tmpfs                            3.9G     0  3.9G   0% /dev/shm
/dev/sda1                         97M   58M   35M  63% /boot
10.22.75.91:/devmo/stage          30G   11G   20G  34% /mnt/netapp_stage
10.22.75.91:/devmo/prod           30G   21G  9.7G  68% /mnt/netapp
[root@developeradm.private.scl3 httpd]#
developeradm.private.scl3.mozilla.com:Disk - All is CRITICAL: DISK CRITICAL - free space: / 2202 MB (5% inode=65%)

Followed the same steps as above. Has not come down to warning as of now.
C,

Let's get more space on the VM. The time to debug this every time it's blown up (since 2014 July) isn't worth it :D Bump up disk to what we need and let's close this out.
Flags: needinfo?(cliang)
The root cause of this iteration of disk space issues seems to be the tarball script :

[root@developeradm.private.scl3 tmp]# du -hscx *
0	20140623_ocvi
4.0K	hsperfdata_infrasec
4.0K	hsperfdata_root
4.0K	pip-build-root
4.0K	ssh-NCiMa29885
4.0K	ssh-UNrLqw1467
24G	tarball-generate.SpyQKN
280K	vmware-root
24G	total
[root@develop

[root@developeradm.private.scl3 tarball-generate.SpyQKN]# ps aux | grep tarball
root      3703  0.0  0.0 103244   840 pts/2    S+   04:37   0:00 grep tarball
apache   16026  0.0  0.0   9236  1128 ?        Ss   Apr26   0:00 /bin/sh /data/bin/tarball-generate.sh
[root@developeradm.private.scl3 tarball-generate.SpyQKN]# strace -p 16026
Process 16026 attached - interrupt to quit
wait4(-1, ^C <unfinished ...>
Process 16026 detached

wait4 with -1 implies the process is waiting for it's children (how sweeeet) 

[root@developeradm.private.scl3 tarball-generate.SpyQKN]# pstree 16026
tarball-generat───wget

and that's the hang up, wget. 

Which is 

apache   16035  0.5  8.3 719220 673968 ?       SN   Apr26  18:27 wget -q -m -p -k -E -T 5 -t 3 -R mov,ogv,mp4,gz,bz2,zip,exe,download,flag*,login*,*\$history,*\$edit,*\$translate,*\$move,*\$subscribe,*\$json -D developer.mozilla.org -X */profiles --reject-regex=(.*)(\$revert|search\?)(.*) -np https://developer.mozilla.org/en-US/

And is still in progress since Apr 26th....

I haven't killed that yet, if we're in dire straits, that's the next target.
TL,DR: This bug was discussed in the sprint.  Having a tarball that is too big to download is not useful, so just asking for more space would not really fix the problem.

Luke: Have you had time to see if one of the other sites that maintain copies of MDN have a downloadable archive that we could point people to?  Otherwise, I'll go ahead and excise the revisions from the tarball creation process, which should help.
Flags: needinfo?(cliang)
Brief update: 
  * I've temporarily disabled the tarball generation cron job (since it *will* cause heartburn).
  * Luke has reached out to see if anyone already has a downloadable doc-set that could act as a substitute.
Flags: needinfo?(lcrouch)
I'm currently testing a tarball generation job that should exclude revisions.  

In case of Nagios alert: If this causes issues WRT to disk space, please note how much space is being used and grab a file listing of /mnt/netapp_stage/tarball-generate.w0Tynf before deleting it.
Expanding regex and doing another test tarball generation run.
Test tarball needed 13 GB to build and takes up approx. 3GB.    I *think* this comes in just under the level at which we'll be setting off disk alerts.

Updated tarball generation script and re-enabled job in cron.  We'll see how she runs this weekend.
See Also: → 1166087
developeradm alerted for disk again:

> [root@developeradm.private.scl3 /]# du -chx --max-depth=1 /tmp | grep G
> 17G	/tmp/tarball-generate.tfHgN8
> 14G	/tmp/tarball-generate.tDJPEs

tDJPEs is the older of the two. For some reason, either the script or wget seems to have been killed off at some point. I've started creating a tarball of that into /mnt/netapp/developer.mozilla.org/developer.mozilla.org-new.tar.gz after doing a quick sanity check:

> [root@developeradm.private.scl3 tarball-generate.tDJPEs]# find developer.mozilla.org | wc -l
> 271231
> [root@developeradm.private.scl3 tarball-generate.tDJPEs]# tar tzvf /mnt/netapp/developer.mozilla.org/developer.mozilla.org.tar.gz | wc -l
> 268012

# screen -x tDJPEs to follow the tar

I'll leave it to :cyliang to retain one of the .tar.gz copies.
I've removed /tmp/tarball-generate.tDJPEs after the tarball finished compressing, clear for disk space now...
Starting another test run, with files being written to the /tarball directory.
Test runs have been good.  Updated puppet to reflect the location change and will verify that everything is working as expected next week.
BTW, why don't you guys use bzip2 instead? (tar jcvf) I think it has become pretty standard now. As far as my experience goes, this will result in a significantly smaller tarball, especially with ascii data.
I generally try not to make decisions about file formats for MDN; my responsibility is to make sure the archive gets created. =) In this case, I don't think it would solve the problem with the amount of disk space being used: there's not really enough room on the disk to store all of the temporary files needed to create the archive (before it is compressed).  If you feel strongly that the archive should be created with bzip2 instead, please create a separate bug. =)

I don't believe that the disk alert has been triggered by the last round of archive creation, which successfully finished on August 17th.  Closing this bug.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
:) No, not that strongly. But I was just trying to think a few steps ahead like a chess player would.

Though the tarball created here is on PRIVATE mozilla.com intranet network, it might be copied sometime to a location accessible by the public.
And with a smaller tarball at hand (because of the more effective compression by bzip2), it will reduce traffic a lot caused by people who download this tarball.
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: