Closed Bug 715840 Opened 12 years ago Closed 12 years ago

Disk space issues on surf (stage)

Categories

(Release Engineering :: General, defect, P2)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: nthomas)

References

Details

Attachments

(1 file)

eg 
[72] surf:disk - /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 180264 MB (4% inode=97%):
surf:disk - /mnt/netapp/stage/archive.mozilla.org/pub/firefox is WARNING: DISK WARNING - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 180264 MB (4% inode=97%):
Attached file Analysis
tl;dr
* we've seen more than usual data arrive on the disks for the last 3 x 24 hours, this is presumably a post-holiday rush
* the rapid release process (particularly doing a beta-a-week) means we consume disk space more rapidly than we did
* there are some short-term mitigations we can do, but we need to be re-examining our retention policies and asking about more space until the scl3 solution arrives
migitation 1, push bits in mobile/tinderbox-builds/ older than 4 days (ie android) onto cm-ixstore. This matches what we do for Firefox. Bug 715706 is setting up the mount points, and I have an initial transfer going on with cron job to follow. Strangely, the free space on /mnt/netapp/stage isn't going up, eg 
                 netapp free     cm-ixstore01 used
    before        164G                69G
    now           162G               112G
Depends on: 715706
/mnt/netapp/stage:
------------------
aka the non-firefox partition, aka everything else

I don't understand the lack of improving netapp free-space, but it's late so perhaps I'm getting confused (certainly can't spell, and the mounts are tangled with symlinks). Populating mobile/tinderbox-builds/old/ should save us about 380G on this partition, once the transfer finishes (running as ffxbld@stage). It will need bug 715706 to be finished by dustin/IT, specifically setting up the mounts on ftp.m.o, so that the files in old/ are visible everywhere.

The emergency response to very low disk is to remove 
  /mnt/netapp/stage/releases.mozilla.org
which will free about 155G. Prior to that you'd need to ask IT to disable the line
  */15 * * * * root /root/bin/sync-stage1-releases
in
  stage:/etc/crond/ftp-staging-rw-server
so that it doesn't get put back again. AFAIK that is only used for populating /pub/mozilla.org/zz/rsyncd-motd and the nagios checks on surf for the module sizes, but check with justdave if you can raise him.

/mnt/netapp/stage/archive.mozilla.org/pub/firefox:
--------------------------------------------------
aka the firefox partition
I've got a move of firefox/nightly/{8.0,8.0.1}-candidates to ~cltbld/old-candidate-dirs/firefox running, which is 95G more space. That might cause problems if we need to do a 8.0 -> 9.0.x partial. 

The emergency response to low disk is to modify
   stage:~ffxbld/bin/cleanup_tinderbox-builds.sh
to change the |-mtime +3| to |-mtime +2| or +1 (3 or 2 days). This will shift files to cm-ixstore01 (non-HA!) earlier than we are now. Note - there is no locking around the cron job, would be worth disabling the cron (ffxbld@stage) and run the command manually for first call after that change.

Trending:
---------
See stage:~nthomas/stage_data.log. The format is 
  <date> <free MB on /mnt/netapp/stage> <free MB on firefox partition>
Graph at http://people.mozilla.com/~nthomas/trend-recent.png
Here's a du -hsx of the subdirs of /mnt/netapp/stage/archive.mozilla.org/pub:
15G     addons
166G    calendar
97G     camino
648K    cck
54M     chimera
2.8G    data
93M     directory
3.3G    diskimages
2.5M    extensions
5.6G    firebird
3.5M    grendel
9.4G    js
33M     l10n-kits
455M    labs
2.8G    minimo
1011G   mobile
128G    mozilla
848K    msgsdk
303M    nspr
620K    OJI
382M    phoenix
20M     profiles
1.1T    seamonkey
8.3G    security
49M     static-analysis
1.4T    thunderbird
251M    utilities
762M    webtools
1.4T    xulrunner
44K     zz

The highlights (over 1TB) are xulrunner, thunderbird, seamonkey, and mobile.  Are any of those sizes surprising (I'm new to this)?
This probably isn't news to anyone but me, but nightlies appear to be the unbounded-growth item here:

30M     mobile/1.0b1/
29M     mobile/1.0b2/
31M     mobile/1.0b3/
38M     mobile/1.0b4/
37M     mobile/1.0b5/
38M     mobile/1.0rc1/
38M     mobile/1.0rc3/
19M     mobile/b1rc3/
96G     mobile/candidates/
62M     mobile/dists/
701G    mobile/nightly/
80G     mobile/releases/
2.7G    mobile/repos/
534M    mobile/source/
132G    mobile/tinderbox-builds/
4.0K    seamonkey/bundles/
3.7G    seamonkey/experimental/
791G    seamonkey/nightly/
219G    seamonkey/releases/
46G     seamonkey/tinderbox-builds/
102M    thunderbird/bundles/
9.4M    thunderbird/extensions/
61M     thunderbird/m-builds/
796G    thunderbird/nightly/
527G    thunderbird/releases/
72M     thunderbird/test/
13G     thunderbird/tinderbox-builds/
29G     thunderbird/try-builds/
3.3M    xulrunner/eclipse/
96K     xulrunner/mar-generation-tools/
1.4T    xulrunner/nightly/
45G     xulrunner/releases/
7.8G    xulrunner/tinderbox-builds/
Depends on: 716441
(In reply to Dustin J. Mitchell [:dustin] from comment #5)
> 701G    mobile/nightly/

Possibly some cleanup to be done here, but this might be accurate.

> 791G    seamonkey/nightly/

cc-ing Callek to see whether there's anything to be done for SeaMonkey here.

> 1.4T    xulrunner/nightly/

Ugh, all the release candidates dirs are still in there, most of which can probably go away since they're surprisingly large with source+bundles+sdks.

Needs similar crontabs setup to ffxbld too, I reckon.
Thanks for looking Dustin. I wasn't aware of size of xulrunner/ or mobile/, and there may be things that needn't be kept in thunderbird/ or seamonkey/ too. Eg for Firefox we only keep a zip for windows instead of windows+zip, delete out old nightly mar and test files. I'll follow up to analyse those some more.

Our current state is pretty good:
> Filesystem            Size  Used Avail Use% Mounted on
> 10.253.0.10:/vol/ftp_stage
>                       6.1T  5.5T  595G  91% /mnt/netapp/stage
> 10.253.0.11:/vol/stage
>                       4.2T  3.7T  471G  89% /mnt/netapp/stage/archive.mozilla.org/pub/firefox

On the first partition that is mostly the mobile/tinderbox-builds change, shuffling the older builds to cm-ixstore. It took some time for that to show up in df output, which everyone professes to be weird but somehow related to when the netapp actually frees the space (eg during the daily deduping).

On the second (firefox) that seems to be a combination of starting to move a big chunk of a very busy last week off to cm-ixstore01, and lerxst allocating another 100G.
xulrunner:
* tinderbox-builds/: static, 2.5G. Copies of the xulrunner nightlies on central, aurora, and 1.9.2

* nightlies/:
 * for each day with code changes, 810M / branch for central, aurora & beta; 220M for 
   1.9.2
 * unnecessary beta nightlies will be disabled in bug 716775
 * bug 661244 should get central and aurora back under 300M (ie save 500M/day)

* release related: 
 * releases/: 1150MB growth per beta or release, permanent storage
 * nightly:/ 1250MB temporary per build cycle for nightly/<version>-candidates. 
             Deletable after we release a version
 * bug 661244 would also help here
 * about 60G in old candidates directories, which I propose we just nuke since 
   there's nothing in there that's not in releases/ already. Oh, actually 9.0b5, 
   9.0b6, and 10.0b1 releases never got copied into releases/, so should double check 
   that assertion and do fixes.
(In reply to Nick Thomas [:nthomas] from comment #8)
>  * about 60G in old candidates directories, which I propose we just nuke
> since 
>    there's nothing in there that's not in releases/ already. Oh, actually
> 9.0b5, 
>    9.0b6, and 10.0b1 releases never got copied into releases/, so should
> double check 
>    that assertion and do fixes.

Nick: were you waiting on feedback before proceeding with the cleanup? I say go for it.
After another disk crunch, I've set the cron jobs to keep 3 days of {firefox,mobile}/tinderbox-builds on the netapp disk before moving them to cm-ixstore. The same is happening for firefox/try-builds, but is still running. The normal cron job is disabled while that happens because the first sync is > 1hr.

(In reply to Chris Cooper [:coop] from comment #9)
> Nick: were you waiting on feedback before proceeding with the cleanup? I say go for 
> it.

Aki mostly did this a few days ago.
try-builds is done and the cron re-enabled.
/mnt/netapp/stage/archive.mozilla.org/pub/firefox is CRITICAL: DISK CRITICAL - free space: /mnt/netapp/stage/archive.mozilla.org/pub/firefox 127766 MB (2% inode=96%)
Turns out some locking was added for moving older Firefox try builds over cm-ixstore, but a stale lock was left after setting that up so it was effectively disabled. The catchup is running.
Depends on: 725476
To get this into a bug instead of just email, the next step to alleviating this disk space problem is to split off one or more directories onto their own partitions on the other sjc1 netapp.  We have verified that snapshots are off and that there's nothing more we can remove there.  From dparsons:

We have 1TB I can allocate on mpt-netapp-b. It has the same connectivity to the network as mpt-netapp-a, but it has less load on it too. I think splitting the data up is the only way to go, if waiting for scl3 isn't an option.
per releng/it meeting:

1) this netapp-b is also tier1, just like netapp-a. 

2) netapp-b has only ~1TB capacity, so RelEng to figure out what subset on partition can be moved from netapp-a.

3) nthomas/dparsons: can you confirm that this netapp-b is as performant as netapp-a?
:joduinn, mpt-netapp-b has the same specs as mpt-netapp-a. In terms of active load, it's a little less loaded than mpt-netapp-a, so it might be a little faster.
nthomas: bear just sent mail about sync-ing up with you. 

I think the easiest win here would be to move the xulrunner/nightly/20[05,06,07,08,09,10,11] to cm-ixstore01, or whatever subset will fit. That could buy us up to 1.3TB by my calculations. 

Should we tackle that here or in bug 725811?
I'm running a heavily nice'd version of

du -hsx of the subdirs of /mnt/netapp/stage/archive.mozilla.org/pub

to confirm the current levels of usage and will put the output info into

https://etherpad.mozilla.org/iDYivvH9c4

From there we can work out what scripts will be needed to perform the moves as moving this much data will probably have to happen overnight to minimize the impact (load and dir changes.)
cm-ixstore01 is going to be decommissioned as part of the move to scl3, and moving data from it to the scl3 netapp is a difficult process. Is there not enough space on mpt-netapp-[b-d]? How much do you need in total?
(In reply to Dan Parsons [:lerxst] from comment #19)
> cm-ixstore01 is going to be decommissioned as part of the move to scl3, and
> moving data from it to the scl3 netapp is a difficult process. 
ok, good to know. Also noting from RelEng/IT meeting this morning that IT does not recommend cm-ixstore for "tier1" files. We'd really dont want to use cm-ixstore for tier1 files unless we cant find any other way out of this recurring-production-impacting corner while we wait for magic of scl3. 



> Is there not enough space on mpt-netapp-[b-d]? 
Oh, didnt know about netapp-c,netapp-d in this morning's meeting. How much space can we have across netapp-b/c/d ? 
netapp-b has ~1TB free?
netapp-c has ?? free?
netapp-d has ?? free? (from irc, some think there's 500gb free, but not certain).


> How much do you need in total?
per comment#18, Bear is calculating, and we expect to have this data tmrw.
From #infra:

[5:24pm] lerxst: there's a possible alternative here
[5:24pm] lerxst: zimbra_backups is on the same aggr as ftp_stage, and it's taking 1.85TB
[5:25pm] lerxst: if we can move zimbra_backups to scl3, I can give the current ftp_stage a lot more space
[5:25pm] lerxst: let me look into that
[5:32pm] coop: lerxst: would that be an immediate fix?
[5:32pm] lerxst: coop: very possibly yes. looking into it now
[5:32pm] • coop didn't know anything was up in scl3 yet. good to hear
[5:32pm] lerxst: i just built the netapp there yesterday. doing the esx servers there right now
I'm moving zimbra_backups from mpt-netapp-a to scl3-na1a right now. It should be done within a few hours. So, some time later tonight, I should be able to give another 1TB to ftp_stage.
(In reply to Dan Parsons [:lerxst] from comment #22)
> I'm moving zimbra_backups from mpt-netapp-a to scl3-na1a right now. It
> should be done within a few hours. So, some time later tonight, I should be
> able to give another 1TB to ftp_stage.

Is that extra 1TB going to be added to the existing partition, or will this be a new partition, i.e. we'll still need to figure out what to move onto that freed space?
If we *do* still need to move stuff around, I've tried to collate our thoughts in a single spot here, since we're also talking about related issues in bug 708865. 

There are two directory structures we're looking at moving, namely the yearly archives under xulrunner/nightly and firefox/nightly. The contents of both dirs are static, i.e. they are dated archives of nightly builds from previous years and will never be modified or added to. 

AIUI, the firefox nightly archives are already backed up to tape (https://bugzilla.mozilla.org/show_bug.cgi?id=708865#c1). I don't know whether the xulrunner nightlies are backed up or not.

If tape backup of the firefox nightlies is tested and reliable, I would be fine moving those firefox nightlies to the slower storage of cm-ixstore01, with the understanding that we would need to either transfer those builds over from cm-ixstore01 to scl3, or simply restore from tape again into scl3.

The xulrunner nightly archives could be redistributed around on any of the current netapp[b-d] devices that have sufficient space. They don't all have to be on the same device, provided the dir structure looks like it does now to someone trying to find the builds.

Here are the sizes of the various yearly archives I'm talking about shuffling around:

/pub/mozilla.org/xulrunner/nightly/20??
6.1G	2005
17G	2006
29G	2007
49G	2008
186G	2009
319G	2010
684G	2011

/pub/mozilla/firefox/nightly/20??
13G	2004
25G	2005
104G	2006
80G	2007
112G	2008
168G	2009
181G	2010
???G    2011 (Still running)
I know cshields is keen to revisit the decision about whether we actually *need* to keep all these nightlies or not. While I don't disagree in principle, we've already had that public discussion in the past two years: http://coop.deadsquid.com/2010/07/reclaiming-space-on-stage-mozilla-org-space-reclaimed 

Developers let us know the archives are a useful tool for regression hunting, and yes, some devs have gone back multiple years when bisecting builds. For now, we need to treat the nightly archives as a necessary, legacy artifact and not delete them in the interim because they are inconvenient in the face of a colo move.

The newsgroup thread on the issue is here: https://groups.google.com/group/mozilla.dev.planning/browse_thread/thread/67c282b346b3f968/3f118a7411a21712?#3f118a7411a21712
(In reply to Chris Cooper [:coop] from comment #24)
> There are two directory structures we're looking at moving, namely the
> yearly archives under xulrunner/nightly and firefox/nightly. The contents of
> both dirs are static, i.e. they are dated archives of nightly builds from
> previous years and will never be modified or added to. 

There are some cleanups still pending on firefox/nightly/2011/, eg installer and xpi's older than 6 months, but otherwise this is true. They can be easily be redone if we are restoring from tape.
:coop, when this copy is done, I'll be able to immediately expand ftp_stage by 1TB. No one will have to move anything to make this space happen. Unfortunately it's going to be a few more hours (so, possibly tomorrow once I wake up again) but when that happens, hey, 1TB for everybody.
From the du run of last night:

/mnt/netapp/stage/archive.mozilla.org/pub/firefox/nightly/20??

13G     2004
25G     2005
100G    2006
77G     2007
108G    2008
161G    2009
174G    2010
556G    2011
419G    2012

and /mnt/netapp/stage/archive.mozilla.org/pub/

5.1T total

16G     addons
14M     artwork
4.0K    bouncer
177G    calendar
98G     camino
648K    cck
54M     chimera
2.8G    data
93M     directory
3.3G    diskimages
2.5M    extensions
5.6G    firebird
3.7T    firefox
3.5M    grendel
6.8G    js
33M     l10n-kits
524M    labs
2.8G    minimo
760G    mobile
128G    mozilla
848K    msgsdk
304M    nspr
620K    OJI
382M    phoenix
20M     profiles
952G    seamonkey
8.3G    security
49M     static-analysis
1.5T    thunderbird
251M    utilities
778M    webtools
1.5T    xulrunner
48K     zz 

The largest are 

3.7T    firefox
1.5T    xulrunner
1.5T    thunderbird
(In reply to Dan Parsons [:lerxst] from comment #27)
> :coop, when this copy is done, I'll be able to immediately expand ftp_stage
> by 1TB. No one will have to move anything to make this space happen.
> Unfortunately it's going to be a few more hours (so, possibly tomorrow once
> I wake up again) but when that happens, hey, 1TB for everybody.

per irc, this copy is still going, revised ETA "a few hours".
The move is done, and I am now expanding the capacity of ftp_stage.

before:
Filesystem            Size  Used Avail Use% Mounted on
10.253.0.10:/vol/ftp_stage
                      6.5T  6.3T  172G  98% /mnt/netapp/stage


after:
Filesystem            Size  Used Avail Use% Mounted on
10.253.0.10:/vol/ftp_stage
                      7.5T  6.6T  955G  88% /mnt/netapp/stage

It will probably keep growing even bigger for a while.
(In reply to Dan Parsons [:lerxst] from comment #30)
> The move is done, and I am now expanding the capacity of ftp_stage.
> 
> before:
> Filesystem            Size  Used Avail Use% Mounted on
> 10.253.0.10:/vol/ftp_stage
>                       6.5T  6.3T  172G  98% /mnt/netapp/stage
> after:
> Filesystem            Size  Used Avail Use% Mounted on
> 10.253.0.10:/vol/ftp_stage
>                       7.5T  6.6T  955G  88% /mnt/netapp/stage
> 
> It will probably keep growing even bigger for a while.

w00t! Thanks Dan.


(In reply to John O'Duinn [:joduinn] from comment #20)
> (In reply to Dan Parsons [:lerxst] from comment #19)
...
> > Is there not enough space on mpt-netapp-[b-d]? 
> Oh, didnt know about netapp-c,netapp-d in this morning's meeting. How much
> space can we have across netapp-b/c/d ? 
> netapp-b has ~1TB free?
> netapp-c has ?? free?
> netapp-d has ?? free? (from irc, some think there's 500gb free, but not
> certain).

per irc w/lerxst:
16:33:13 < lerxst> mpt-netapp-a: aggr0 = 500gb free; aggr1 = 1.1tb free
16:33:38 < lerxst> mpt-netapp-b: aggr0 = 200gb free; aggr1 = 1tb free
16:34:03 < lerxst> mpt-netapp-c: aggr0 = 300gb free; aggr1 = full
16:34:18 < lerxst> mpt-netapp-d: aggr0 = full; aggr1 = full



nthomas over to you now to start rejuggling as best as possible!
Final state:
10.253.0.10:/vol/ftp_stage
                      7.5T  5.5T  2.1T  73% /mnt/netapp/stage

So 1T more total, 0.8T less used and 1.9T more free. If there was 0.8T to delete then that project picked a funny day to clean up, or the netapp pulled a rabbit out of the hat. Perhaps it was able to be more efficient with some free space to work with ? Anyway, seems unlikely we'll have any issues with that partition before we get to scl3.

Next step is to figure out where to offload some of firefox/nightly/20??/ onto, so that we can get some breathing room on 10.253.0.11:/vol/stage. I'll work up an estimate of how much new usage we can expect in the next few weeks, and then we can figure out if we should carve some of the new space off the existing partition on mpt-netapp-a, or use some of the other free space.
I just gave mpt-netapp-b:/vol/stage another 100GB, hope that helps until you can move stuff.
I've moved /pub/mozilla.org/firefox/nightly/{2004..2008} to /pub/mozilla.org/firefox-old-builds, which means from mpt-netapp-b:/vol/stage ('firefox') to mpt-netapp-a:/vol/ftp_stage ('everything else'). Right now we have:

Filesystem            
     Size  Used Avail Use% Mounted on
mpt-netapp-a:/vol/ftp_stage
     7.5T  5.8T  1.7T  78% /mnt/netapp/stage
mpt-netapp-b:/vol/stage
     4.3T  3.5T  764G  83% /mnt/netapp/stage/archive.mozilla.org/pub/firefox

The latter is mid-cycle for the weekly load of firefox/{try,tinderbox}-builds, and we should bottom out at around 450G free. If we need more then we can move another 161G of firefox/nightly/2009.

Thanks to everyone who contributed to providing more space or setting it up.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: