Closed Bug 342972 Opened 18 years ago Closed 14 years ago

Implement stage cleaning policy

Categories

(Release Engineering :: General, defect, P2)

All
Other

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: shaver, Assigned: coop)

References

()

Details

Attachments

(5 files, 10 obsolete files)

3.28 KB, text/plain
Details
13.59 KB, text/plain
Details
895 bytes, text/plain
Details
4.62 KB, image/png
Details
515 bytes, patch
nthomas
: review+
Details | Diff | Splinter Review
Filesystem            Size  Used Avail Use% Mounted on
/dev/cciss/c0d0p1     1.9T  1.8T     0 100% /data

So full!
Build log snippet (from URL):

./upload-packages.sh
ssh -2 -l cltbld stage.mozilla.org mkdir -p /home/ftp/pub/thunderbird/nightly/2006-06-28-03-trunk /home/ftp/pub/thunderbird/nightly/latest-trunk
mkdir: cannot create directory `/home/ftp/pub/thunderbird/nightly/2006-06-28-03-trunk': No space left on device
command failed!
Looks like about 200G in ~cltbld, might be some easy pruning to be had there.
Not much we can do here - build, can you clean some things up?
Assignee: server-ops → preed
Component: Server Operations → Build & Release
I think this is keeping extensions from being propagated, too -- I didn't realize that one could affect the other.

Can I get added to the nagios notification list for disk-space warnings/errors on stage?  This is pretty bad times for me, not sure what's happened to extensions that were approved during the full state. :(
perhaps you could remove some of the OLD OLD.
there are still nightlies from august 2005 hanging about for firefox and thunderbird.

surely we dont need that many still saved, i doubt anyone downloads them.
Thanks, the people who actually have access to the machine know what needs to be kept and what doesn't, and what's taking up appreciable amounts of space.  Please help us help you by keeping the noise level in this bug low.

justdave has freed up some space by moving some backup data to dracula, but I think we still need to look at clearing out more, and figuring out how to keep cltbld, f.e., from filling up with crap again. :)  Reducing severity, but please do look at this ASAP.
Severity: critical → major
I've moved ancient staging areas for FFx 1.0.x, Tbird 1.0.x, and Mozilla 1.x to the Netapp for now.

There's now ~120 gigs on stage, which should hold us for... awhile.

But we need to tackle this with a build expiration policy and some good ol' stage.m.o scrubbing.
Status: NEW → ASSIGNED
Summary: staging disk full, nightlies not being uploaded → Implement stage cleaning policy
QA Contact: justin → preed
Over to TR.
Assignee: preed → tfullhart
Status: ASSIGNED → NEW
tfullhart: Let's see if I understand the requirements for that script you wanted: 1) based on directory name, delete all nightlies older than a month except for 1 representative nightly per week. 2) configurable
preed: I can shoot you an email, but basically:
preed: given a directory, delete all builds except a representative set of builds for periods of time
preed: as an example, keep all builds < 6 months, 6-9 months, keep a weekly build, 9+ months, keep 2 builds
preed: as configurable as possible is good, since those requirements may become something like
preed: keep all builds < 6 months, keep every other day 6-9 months, keep a weekly build 9-18 months, keep 2 builds/month > 18 months
Status: NEW → ASSIGNED
Attached file cleaner script (obsolete) —
This version isn't completed but I promised I would post it today. I need to go home to run an errand. I will finish it tomorrow.
Attached file Configuration file for cleaner script (obsolete) —
This configuration file includes rules to make the cleaner script do the jobs of ftp-trim-archive.sh, ftp-trim-contrib.sh, trim-downloads, and trim-downloads.conf. It also includes some example rules that express requirements that preed requested.
Attached file cleaner script (obsolete) —
Attachment #237251 - Attachment is obsolete: true
Attached file Configuration file for cleaner script (obsolete) —
This has a bunch of example rules. The actual rules need to be defined.
Attachment #237253 - Attachment is obsolete: true
Attachment #238336 - Attachment is obsolete: true
Attached file perltidy profile (obsolete) —
For reformatting perl scripts using the "perltidy" script.
Please get reasonably wide review on any cleaning policy before implementing it -- having old nightlies is extremely important for finding regressions quickly.
I'd also note that I think the rules given here as examples are probably deleting too much -- especially too much since the last branch.  As we approach the 1.9 release we'll hear about regressions early in the development cycle (i.e., starting August 12, 2005), and we'll want to isolate what caused them.  And there are probably still regressions shipped in 1.8 that we've gotten reports of but haven't yet isolated.
Attachment #238337 - Attachment is obsolete: true
Attachment #246988 - Flags: review?(preed)
Attachment #246989 - Flags: review?(preed)
Attachment #238519 - Attachment is obsolete: true
Attached file perltidy profile
Attachment #238520 - Attachment is obsolete: true
Taking TR's bug; I'll retriage these shortly.
Assignee: tfullhart → preed
Status: ASSIGNED → NEW
Blocks: 291167
Reassigning bugs I'm not actively working on back into the triage pool.
Assignee: preed → build
I may consider taking this, I need to write symbol cleanup for Breakpad anyway.
Nick, can you look at this as part of your stage migration work?
Assignee: build → nrthomas
Status: NEW → ASSIGNED
Priority: -- → P3
Comment on attachment 246988 [details]
Configuration file for cleaner script

Canceling old reviews.
Attachment #246988 - Flags: review?(preed)
Comment on attachment 246989 [details]
reformatted cleaner script

Canceling old reviews.
Attachment #246989 - Flags: review?(preed)
We have plenty of room on the netapp at the moment (>400GB), so this is not an urgent problem. Back to the pool for now.
Assignee: nrthomas → build
Status: ASSIGNED → NEW
Assignee: build → nobody
QA Contact: mozpreed → build
Does this actually block the stage migration?  It's listed as such...if not, can it be removed from blocking list?
(In reply to comment #28)
> Does this actually block the stage migration?  It's listed as such...if not,
> can it be removed from blocking list?

Can we continue to grow the FTP archive indefinitely? Not sure if it needs to block, but it seems important if the answer is "no" :)

(In reply to comment #29)
> Can we continue to grow the FTP archive indefinitely?

But it would be really nice if the answer is yes.  (Isn't disk cheap these days?)
(In reply to comment #30)
> (In reply to comment #29)
> > Can we continue to grow the FTP archive indefinitely?
> 
> But it would be really nice if the answer is yes.  (Isn't disk cheap these
> days?)

I'd love for the answer to be yes, for what its worth :)

Attached image Consumption graph
Here's some info on how we're doing currently. Of the 2.9TB of space allocated, we have about 150GB still available, and nightly builds are consuming it at a rate of 1.75GB/day. When a release comes along we take an additional bite by keeping multiple copies around, which is what causes the big jumps on the gentle slope.

KaiRo recently pointed out that there isn't much point keeping the nightly update (mar) files around, because they are superseded by the next build. That could save us quite a bit of space, eg 72GB for the Firefox nightlies from 2007. Provided we can't think of a reason to keep them, that is.
Nope - disk space does cost, both in maintenance and initial cost and is not an infinite resource.  I think the right thing to do is implement a policy that keeps the most used files on disk such as all releases, etc but archives nighties off to tape.  We could always restore if needed relatively quickly and we won't waste the electricity to spin disks constantly when they are almost never accessed.  As for what should be kept on/off line, I'll leave that to build - thoughts?
Blocks: 419978
No longer blocks: 394069
Assignee: nobody → nrthomas
Priority: P3 → P2
Priority: P2 → P3
Not working on this right now.
Assignee: nthomas → nobody
Component: Release Engineering → Release Engineering: Future
QA Contact: build → release
Mass move of bugs from Release Engineering:Future -> Release Engineering. See
http://coop.deadsquid.com/2010/02/kiss-the-future-goodbye/ for more details.
Component: Release Engineering: Future → Release Engineering
(In reply to comment #32)
> KaiRo recently pointed out that there isn't much point keeping the nightly
> update (mar) files around, because they are superseded by the next build. That
> could save us quite a bit of space, eg 72GB for the Firefox nightlies from
> 2007. Provided we can't think of a reason to keep them, that is.

I was going to close this bug out, but this ^^ still seems like an easy win.
(In reply to comment #36)
> (In reply to comment #32)
> > KaiRo recently pointed out that there isn't much point keeping the nightly
> > update (mar) files around, because they are superseded by the next build. That
> > could save us quite a bit of space, eg 72GB for the Firefox nightlies from
> > 2007. Provided we can't think of a reason to keep them, that is.
> 
> I was going to close this bug out, but this ^^ still seems like an easy win.

Can someone from IT please comment on whether the above fix is already in place on stage? I can only see the crontabs for cltbld and ffxbld and we're only cleaning up breakpad symbols and experimental builds in those.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
IT doesn't manage this, RelEng does.  nthomas is the one with access to it last time I looked.
We don't have an automated solution for this. When we've hit a disk usage crunch in the past I've gone through with the help of some shell commands to clean up. It's getting towards that time again - we're down to 200G free on the non-firefox partition and 300G for firefox.
Assignee: server-ops → nobody
Component: Server Operations → Release Engineering
QA Contact: mrz → release
IMHO, we really should have some automated way of clearing out MARs for older nightlies for all products.
Esp. as we are producing complete _and_ partial MARs for most of our products and branches now, this is a lot of space that is used up by those even though they are not hit by any current update (which only points to the latest of those).
Alright then, let's figure out a plan and put it in place. That this has been open for almost 4 years is ridiculous.
Assignee: nobody → ccooper
Assignee: ccooper → ccooper
Status: NEW → ASSIGNED
Priority: P3 → P2
nthomas: do you think we can resurrect TR's attached cleanup scripts (assuming we revisit his/preed's cleanup criteria first to make sure they're current)?

I'm compiling some current disk usage numbers to aid in our assessment.
TR's scripts may be overkill if we just want to remove older nightly mar files. Bonus points if we don't scan over all of firefox/nightly/YYYY every run.

FYI, we've left the likes of YYYY-MM-DD-firefox3.0.19{,-l10n} alone since they are release files. Bug 562261 for the cleanup I did recently.
(In reply to comment #43)
> TR's scripts may be overkill if we just want to remove older nightly mar files.
> Bonus points if we don't scan over all of firefox/nightly/YYYY every run.

Not sure how we'd avoid a complete scan without relying on some sort of script-based solution. Eliminating nightly mars would be simple enough to do via cron, but I think we want to do more. Maybe more elaborate cronjobs that only run weekly?
 
> FYI, we've left the likes of YYYY-MM-DD-firefox3.0.19{,-l10n} alone since they
> are release files. Bug 562261 for the cleanup I did recently.

Can you be more specific here? Which subdir oddballs do we have floating around that we shouldn't be purging?
Could you elaborate on what else you'd like to clean up ? I took your comment #41 as a response to #40.

(In reply to comment #44)
> Can you be more specific here? Which subdir oddballs do we have floating around
> that we shouldn't be purging?

eg
/pub/mozilla.org/firefox/nightly/2010/03/2010-03-12-18-firefox3.0.19/
/pub/mozilla.org/firefox/nightly/2010/03/2010-03-12-18-firefox3.0.19/
/pub/mozilla.org/firefox/nightly/2010/03/2010-03-12-19-firefox3.0.19-l10n/

Tinderbox pushes the files to those locations, then we copy to the candidates directory. They're only really relevant if that copy fails, so perhaps we don't actually care if the mars are deleted a week later.
(In reply to comment #45)
> Could you elaborate on what else you'd like to clean up ? I took your comment
> #41 as a response to #40.

I'd like to automatically prune nightlies and tinderbox-builds as TR's script was trying to do, but perhaps in a more straightforward manner. 

We'll clean up the MARs too, of course, but having a policy for how we expire/archive everything is a good idea.
I've made a complete blog post on the subject here (http://coop.deadsquid.com/2010/05/reclaiming-space-on-stage-mozilla-org/), but here are my specific proposals:

1) Move Firefox releases that are no longer supported (< 3.0, including firebird) to separate storage.

2) Remove Firefox nightlies prior to 2007, freeing 260G. These can be deleted if they're not going to be used, or archived if we think they might.

3) Remove nightlies for products other than Firefox prior to 2007, freeing 174G. Again, "remove" can mean either deletion or archiving.

4) Automate the deletion of nightly MAR files older than one month. Only the most recent MAR files are required. This would be done across all products.

5) Delete builds from older candidates directories after official release. This will reclaim up to 13G per build attempt per release. This will be a manual process.

6) For every new year going forward, remove the oldest remaining year of nightlies, e.g. for a 3-year history of nightly builds, remove nightly builds from 2007 in January 2011. This will be a manual process.
(In reply to comment #47)
> I've made a complete blog post on the subject here

...and I've also posted to dev-builds, dev-planning, and dev-tree-management now.
I only started bisecting about a year ago and I have one build from 2004, 10 from 2005, 17 from 2006, 45 from 2007 in my bisecting archive. I would appreciate the ability to continue bisecting things that may have changed long ago.
In the past year, I've gone back as far as January 2006 (Firefox 1.6a1) trying to track down a regression of some sort (I can tell by the builds in my Firefox folder).  

Philippe Wittenbergh just last month went back to March 2007 trunk builds tracking down a Core regression that only manifests in Camino (for which we really needed both Camino and Firefox nightlies in order to verify it never appeared in Firefox), but we had to start checking from when 1.8 branched from the trunk on 2005-08-12.

Speaking from a Camino bugs perspective, I know there are still several Gecko/Gecko-tickled regressions that require older builds (including 1.7-was-trunk/1.8-was-trunk-era) to track down (because I've started, but never finished, tracking down the regression ranges.)

I'm certainly OK with moving older nightly builds to a separate http://archive.mozilla.org/ again (like they used to be in the old days, before ftp.m.o and archive.m.o were merged) to relieve pressure on stage/ftp.m.o, but I'm opposed to taking them offline entirely and very strongly opposed to deleting them.

(I remember this being mentioned before in another bug, but disk *is* cheap.  I just bought a nice small, portable bus-powered, quad-interface 7200 RPM 300 GB HD for ~$150.  I know it's not "server-grade" (but it has those other features that make it more costly than your average disk, which is even cheaper), but that would cover 75% of the non-Firefox nightlies since 2001, or a recent year of Firefox nightlies, or several older years of Firefox nightlies.)
(In reply to comment #50)
> (I remember this being mentioned before in another bug, but disk *is* cheap.  I

IIRC, disk was specifically "cheap" in comparison to employing a layout hacker to recreate those ancient builds needed for doing his bisection (or to come up with an alternate way of tracking down the regression).  Sorry, I was not trying to be completely cavalier with that part of the previous comment.
In my company, we start deleting nightly builds after 6 months, but we leave 1 build per week (there's 1 weekly build which has a semi-alpha status) for regression lookups.

But we can always recreate from CVS sources, as we only have automated builds and regression tests. To make you guys jealous : even the checkins in release builds (as opposed to developer builds) are more-or-less automated, except for merging conflicts of course.
(In reply to comment #51)
> IIRC, disk was specifically "cheap" in comparison to employing a layout hacker
> to recreate those ancient builds needed for doing his bisection (or to come up
> with an alternate way of tracking down the regression).

I'd also note that modern toolchains might not be able to even build from source that's four, five, or six years old. e.g., with most developers on 10.5 and 10.6 and on Intel, it might be impossible to build the 1.8 branch or the 1.7 branch. I doubt releng wants to stand by for creating old builds on outdated hardware whenever required by developers, which would be one alternative.
Based on the feedback I've received, I have revised my policy proposals:

1) Keep all releases for all products online and available. There's no need to remove them.

2) Keep all en-US nightly builds for all products online and available. There's no need to remove them.

3) Delete nightly artifacts for all products that are not useful in regression hunting. Specifically, this means deleting installer files (linux and windows) and xpis from 2009 and earlier. This could represent a one-time space recovery of almost 900GB. Individual products can opt out of this cleanup with sufficient cause.

4) Automate the deletion of nightly MAR files older than one month. Only
the most recent MAR files are required. This would be done across all
products. (unchanged)

5) Delete builds from older candidates directories after official
release. This will reclaim up to 13G per build attempt per release. This
will be a manual process. (unchanged)

6) Automate the removal of nightly artifacts older than 6 months for all products that are not useful in regression hunting.
(In reply to comment #54)
> 5) Delete builds from older candidates directories after official
> release. This will reclaim up to 13G per build attempt per release. This
> will be a manual process. (unchanged)

Note that I cleaned up a good number of SeaMonkey candidates this week, which also should have reclaimed some space.
(In reply to comment #55)
> Note that I cleaned up a good number of SeaMonkey candidates this week, which
> also should have reclaimed some space.

Thanks, KaiRo. Very helpful. Simon has also cleaned up a bunch of old Calendar stuff which is also appreciated.

Project leads have been contacted. I will likely start pruning the Firefox and old Mozilla installers/xpis later this week, but will wait to hear back the from the other project leads before I do cleanup for their projects or put any cron jobs in place.
(In reply to comment #56)
> Project leads have been contacted.

coop, who did you contact for Camino?  None of pink, smorgan, and mento report receiving an email.
3 entries for the ffxbld crontab to be run on a weekly basis:

* delete installer files older that 6 months
* delete xpi files older than 6 months
* delete empty directories older than 6 months

nthomas: can you sanity check this for me before I put it in place? Do we want to run the rm in verbose mode?
Attachment #448418 - Flags: review?(nrthomas)
I would modify the following:
- quote the file patterns with *, shell may expand it if there is a file with the same pattern in the current directory (*.installer.* should be '*.installer.*')
- quote {} to properly handle spaces ({} should be '{}')
- don't use -r for for files, add "-type f" to be more specific.
(In reply to comment #59)
> I would modify the following:
> - quote the file patterns with *, shell may expand it if there is a file with
> the same pattern in the current directory (*.installer.* should be
> '*.installer.*')
> - quote {} to properly handle spaces ({} should be '{}')
> - don't use -r for for files, add "-type f" to be more specific.

New version contains these fixes.
Attachment #448418 - Attachment is obsolete: true
Attachment #448594 - Flags: review?(nrthomas)
Attachment #448418 - Flags: review?(nrthomas)
Depends on: 569461
Comment on attachment 448594 [details] [diff] [review]
Proposed crontab for ffxbld user on stage, v2

Might as well save a crawl through the disk tree and combine the two file searches, I think this does the operator precedence correctly:

@weekly nice -n 19 find /home/ftp/pub/firefox/nightly -mtime +180 -type f \( -name '*.installer.*' -o -name '*xpi' \) -exec rm -f '{}' \;

or use a regex.

What's the -mtime +180 on the empty directory search for ? Without it I found some *-candidates/contrib dirs, but we could exclude the candidates or just search 20?? originally. With it I think we'll empty some dirs and then wait 180 days to remove the dir.
Attachment #448594 - Flags: review?(nrthomas) → review-
Merges two file deletions into a single command and removes the mtime check for empty directory removal.
Attachment #448594 - Attachment is obsolete: true
Attachment #449077 - Flags: review?(nrthomas)
Adds the MAR file expiry, but only for non-candidates dirs.
Attachment #449102 - Flags: review?(nrthomas)
Same as previous patch, but removes the 2010/ directory I was using for as a limiter for MAR file find testing.
Attachment #449077 - Attachment is obsolete: true
Attachment #449102 - Attachment is obsolete: true
Attachment #449104 - Flags: review?(nrthomas)
Attachment #449077 - Flags: review?(nrthomas)
Attachment #449102 - Flags: review?(nrthomas)
Comment on attachment 449104 [details] [diff] [review]
Proposed crontab for ffxbld user on stage, v5

># Delete all installer and xpi files/dirs older than 6 months
>@weekly nice -n 19 find /home/ftp/pub/firefox/nightly -mtime +180 \( -name '*.installer.*' -o -name '*xpi' \) -exec rm -rf '{}' \;

I took out the -type f limiter, changed the xpi search to *xpi, and returned to doing a rm -rf here. This allows us to delete entire directories like "windows-xpi" which contain both .xpi files and uninstaller .zips, and wouldn't get cleaned up otherwise.
I've created https://wiki.mozilla.org/ReleaseEngineering:StageCleanupPolicy as a lasting artifact of this bug. That wiki page includes crontabs and such, but note that none of these have been run or enabled yet. Still waiting on the ownership changes in bug 569461.
Please don't automatically delete the last remaining build of a given type. My fuzzing scripts broke due to finding https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx-debug/ empty after a day-long tree closure.
Comment on attachment 449104 [details] [diff] [review]
Proposed crontab for ffxbld user on stage, v5

Sorry for the delay coop, I'll get this early next week.
Comment on attachment 449104 [details] [diff] [review]
Proposed crontab for ffxbld user on stage, v5

I did some spot checks on a few YYYY/MM dirs and the matches seemed OK.

Please set a MAILTO so we can catch errors. Worth doing verbose rm for the first week or two, or doing it manually the first time to a log ?

># Delete all installer and xpi files/dirs older than 6 months
>@weekly nice -n 19 find /home/ftp/pub/firefox/nightly -mtime +180 \( -name '*.installer.*' -o -name '*xpi' \) -exec rm -rf '{}' \;

You'll need to add a '-depth' to prevent errors on matches like this:
 .../2009-01-01-04-mozilla1.8/windows-xpi
 .../2009-01-01-04-mozilla1.8/windows-xpi/xforms.xpi

r+ with those changes.

># Delete all MAR files older than 1 month that aren't in a candidates dir.
>@weekly nice -n 19 find /home/ftp/pub/firefox/nightly -wholename '*-candidates' -prune -o -mtime +30 -name '*.mar' -exec rm -f '{}' \;

We could probably remove the tests files we started publishing on 2009/09/24 on the same age criterion. The extension changed from tar.bz2 to zip on some branches so matching on '*.tests.*' probably makes sense.
Attachment #449104 - Flags: review?(nrthomas) → review+
(In reply to comment #69)
> Please set a MAILTO so we can catch errors. Worth doing verbose rm for the
> first week or two, or doing it manually the first time to a log ?

I'm running a first-pass now by hand and logging the output. I'll make the log available if there's anything out of the ordinary in it. 

I'll add a MAILTO to the crontabs when I put them in place.

> You'll need to add a '-depth' to prevent errors on matches like this:
>  .../2009-01-01-04-mozilla1.8/windows-xpi
>  .../2009-01-01-04-mozilla1.8/windows-xpi/xforms.xpi
> 
> r+ with those changes.

Added.

> We could probably remove the tests files we started publishing on 2009/09/24 on
> the same age criterion. The extension changed from tar.bz2 to zip on some
> branches so matching on '*.tests.*' probably makes sense.

Added.
The crontabs for all products are in place now. I've run all the individual commands by hand once and collected logs, but haven't seen anything I didn't expect.

We didn't reclaim as much space as originally estimated because we preserved the contrib dirs largely intact. We did still manage to recover almost 350G of space across all products, bringing our %used number on stage down to 79%. It was at 95% yesterday.

I've documented the new policy in the wiki:
https://wiki.mozilla.org/ReleaseEngineering:StageCleanupPolicy

We'll still have to grow the disk eventually, or shuffle stuff around, but at least now we have *something* in place.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Ouch, it looks like those scripts killed the *.mar files in the seamonkey/nightly/2.1a1-candidates/ directory, which made my 2.1a2 update builds fail. I've copied them back from the 2.1a1 release directory, but it would be good if we cared that this doesn't happen again.
Depends on: 575966
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: