Closed Bug 496876 Opened 15 years ago Closed 15 years ago

please archive and remove some directories from stage

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: mrz)

References

Details

Attachments

(2 files, 1 obsolete file)

In order to help the disk space situation on stage we should archive and remove a bunch of things. Here's what I suggest:
/home/ftp/pub/firefox/nightly/200[456]/ - 260G
/home/ftp/pub/thunderbird/nightly/200[3456]/ - 251G
/home/ftp/pub/seamonkey/nightly/200[56]/ - 113G
/data/cltbld/deprecated-candidate-dirs/ - 548G

These are all of pretty considerable size, totaling to just over 1 TB.

We'll need sign off from at least gozer & KaiRo before we can go ahead with this. I'd like nthomas' and joduinn's thoughts too.
Sounds good to me, esp. with keeping all the apps in sync in terms of date. Actually, I wonder if we should include /home/ftp/pub/calendar/{sunbird,lightning}/nightly/2006/ as well.
(In reply to comment #1)
> Sounds good to me, esp. with keeping all the apps in sync in terms of date.
> Actually, I wonder if we should include
> /home/ftp/pub/calendar/{sunbird,lightning}/nightly/2006/ as well.

That's a good idea. Ause, what do you think?
(In reply to comment #0)
> In order to help the disk space situation on stage we should archive and remove
> a bunch of things. Here's what I suggest:
> /home/ftp/pub/firefox/nightly/200[456]/ - 260G
> /home/ftp/pub/thunderbird/nightly/200[3456]/ - 251G

Sounds good to me.
I don't know if I agree with this... I do know people use those old nightlies to track down regression ranges. Maybe we really need to look into getting more space for stage.
OS: Mac OS X → All
Hardware: x86 → All
(In reply to comment #4)
> I don't know if I agree with this... I do know people use those old nightlies
> to track down regression ranges. Maybe we really need to look into getting more
> space for stage.

I would be happy with this too. We've started pushing so much more data per day to it, it's going to be necessary at some point.

Do people really use nightlies from 2006, though?
I visit this archive only for branch regressions in Firefox releases. Since Firefox 2 is discontinued only for Firefox 3 branch (it happens only extremely rare that there is a regression between two branch versions).

I wonder if a company really should delete parts of its archive? When it's gone it's gone. Maybe a selection of only trunk zip builds for windows should be kept but I assume deleting selectively would cost a lot of work.
I absolutely do not want to delete these builds permanently. But they are using up quite a bit of space, and if we can't get a bunch of space added to stage we should consider moving them to an offline backup (tapes or whatever).
We've been down this path before (bug 342972) and Justin was pretty adamant that 3TB is plenty of space and we should just use what we have better. I agree with that, and with Ben that it's time to remove some old nightlies.

Note that /home/ftp/pub/ is actually a mount from network storage and where most of the files live. It's down to 75GB and falling fast as we churn through 3.5 release candidates, so we need to clean up quickly. I know where some 10s of G savings can be found and will look for anything gone rogue. 

/data/cltbld is a local disk array that is getting full, but not particularly fast and in a way that affects the operation of mozilla ftp servers (AFAIK anyway).

Aravind, could you remind us what the backup situation is for stage:/mnt/netapp/stage ? We already have all of that backed up right ? Would you want to do anything extra to meet the earlier promise (in 342972) of being able to restore from tape if people made requests ?
Removed 44GB of Firefox nightly updates older than 2009/05, and 11G of the same for Thunderbird 1.8.0 and 1.8 to give 136GB free. Could save a further 14G in Thunderbird comm-central updates, and 13G in SeaMonkey, if Gozer an KaiRo are happy with that (both the partial and complete.mar for any given nightly are obsolete after the next nightly build, so there's no point in keeping them after you give users a few days to complete any update they started). 

Still small fry compared to the old year dirs.
I'm raising the severity because we're down to 135G free.
Severity: normal → critical
Aravind, could you address comment #8?
Assignee: server-ops → aravind
We are backing stuff up, however they are backed up on a 1 yr retention policy.  That is, You lose those from tape after a year.  The only things currently on a longer retention policy are weblogs (and those are on a 10yr policy).
(In reply to comment #9)
> Could save a further 14G in
> Thunderbird comm-central updates, and 13G in SeaMonkey, if Gozer an KaiRo are
> happy with that

Sounds good to me for SeaMonkey.
If we're not going to keep these online, we need to make sure that we have backups that won't ever expire, as these builds are not easily (if even possibly) to reproduce.
(In reply to comment #14)
> If we're not going to keep these online, we need to make sure that we have
> backups that won't ever expire, as these builds are not easily (if even
> possibly) to reproduce.

Agreed.

Aravind, can we do this?
Yes we can, but it would have to be a new backup set.  I don't want to disturb the existing backup sets and futz with their retention cycles.  Also, we are in the middle of switching backup systems, so if you can get me a list of directories that need to be backed up forever, I will work on it.
(In reply to comment #16)
> Yes we can, but it would have to be a new backup set.  I don't want to disturb
> the existing backup sets and futz with their retention cycles.  Also, we are in
> the middle of switching backup systems, so if you can get me a list of
> directories that need to be backed up forever, I will work on it.

Let's start with the ones in comment #0 - but please wait a couple days:
/home/ftp/pub/firefox/nightly/200[456]/
/home/ftp/pub/thunderbird/nightly/200[3456]/
/home/ftp/pub/seamonkey/nightly/200[56]/
/data/cltbld/deprecated-candidate-dirs/

I've just sent a note to a bunch of newsgroups about this plan, and I'd like to give people a couple days to respond first.
(In reply to comment #17)
> I've just sent a note to a bunch of newsgroups about this plan, and I'd like to
> give people a couple days to respond first.

Still hashing this out. I'm not going to have a complete list for you until at least Monday. There's 120GB free right now, which should be more than enough to last the weekend. I'll try to free up some more, too.
Depends on: 499425
I'm still trying to hammer things out when it comes to Firefox and Thunderbird builds, but there hasn't been resistance to archiving the following:
/pub/mozilla.org/calendar/sunbird/nightly/2006
/pub/mozilla.org/calendar/lightning/nightly/2006
/pub/mozilla.org/mozilla/nightly
/pub/mozilla.org/mozilla/l10n
/pub/mozilla.org/firebird/nightly

Let's go ahead and archive these to recover 152G. We'll be doing something with Firefox and Thunderbird at some point, but I don't know what yet.
No longer depends on: 499425
Keywords: 64bit
Blocks: 499425
(removing unwanted keyword)
Keywords: 64bit
Down to 64GB (we spiked because of the 3.5 release).
What's the current rate of increase?
It's inconsistent, here's a report I've been running:
Sat Jun 20 00:00:01 PDT 2009: 120G
Sun Jun 21 00:00:01 PDT 2009: 130G
Mon Jun 22 00:00:01 PDT 2009: 127G
Tue Jun 23 00:00:01 PDT 2009: 103G
Wed Jun 24 00:00:01 PDT 2009: 84G
Thu Jun 25 00:00:01 PDT 2009: 96G
Fri Jun 26 00:00:02 PDT 2009: 92G
Sat Jun 27 00:00:01 PDT 2009: 93G
Sun Jun 28 00:00:01 PDT 2009: 98G
Mon Jun 29 00:00:01 PDT 2009: 91G
Tue Jun 30 00:00:02 PDT 2009: 69G
When I ran some stats on this, I got much higher per day download counts than what it appears you were counting.  Would it be possible for someone to go through this dataset and see if you can offer any ideas on whether it might be right or wrong?

http://mozilla.dabbledb.com/publish/mozilla/0ad74f66-6319-45f8-b64c-1e3bfcedf836/maintable.html
Are you excluding the multitude of spiders that hit ftp.m.o ?
We're down to 36G partly because of a spike in usage due to 3.0.12 builds. We'll spike a bit more tomorrow again once signing is done.
(In reply to comment #25)
> Are you excluding the multitude of spiders that hit ftp.m.o ?

I don't think so. 4000 hits on seamonkey builds in one day is definitely wrong. Same goes for a lot of the others.

Daniel, you need to exclude hits from user agents matching at least any of the following:
Googlebot
msnbot
Slurp
palamida
Exabot
Twiceler
Yandex
DotBot
Sosospider
SengSpider
slurp
Vodafone
Daumoa
MJ12bot
VoilaBot
Teoma
MLBot

I think there's other spiders too, but I can't find my complete list.
If things are getting desperate, don't forget about bug 499919, both for the couple of weeks of room by deleting the existing *-mobile-trunk-l10n dirs and for getting rid of about a quarter of the daily usage.
(In reply to comment #28)
> If things are getting desperate, don't forget about bug 499919, both for the
> couple of weeks of room by deleting the existing *-mobile-trunk-l10n dirs and
> for getting rid of about a quarter of the daily usage.

Cleaned up the old builds to get back about 15G.

(In reply to comment #13 - re seamonkey mar files)
> Sounds good to me for SeaMonkey.

Another 19G of nightly complete mars deleted in 2008/07 to 2009/06 dirs. Plus 5G in Firefox 2009/05 and 06.

Up to 66G free on /mnt/netapp/stage.
Among the least useful (and thus best candidates for no-backup removal) things I mentioned in .builds:

firefox/nightly/*-fs (we can do regression hunting fine without free software builds from a random period of time)
firefox/nightly/*firefox1.5.0.*
firefox/nightly/*firefox2.0.*
thunderbird/nightly/*thunderbird1.5.0* (apparently stable branch RCs, from dead branches, often including l10n and thus often huge)
* Cleaned up *crashreporter-symbols.zip for Thunderbird & Sunbird nightlies, 11G, bug 502774 for the perma-fix
* Bug 499919 is fixed (mobile-trunk-1l0n dirs) and files cleaned up.
* 86G free currently

(In reply to comment #30)
To be had (when I get to it later today):
~ 5G   for  firefox/nightly/*-fs 
115G   for  firefox/nightly/*firefox1.5.0.*
137G   for  firefox/nightly/*firefox2.0.*
 43G   for  thunderbird/nightly/*thunderbird1.5.0*
Daniel, any updates?
Assignee: aravind → mrz
I can't re-run this until I get AMO caught up.  need the same hardware. :/
(In reply to comment #33)
> I can't re-run this until I get AMO caught up.  need the same hardware. :/

Any ETA? 

Sorry to push, but we're worried about running out of space if we cant start archive/remove process soon. Right now, we are manually removing some files every few days just to keep some breathing room.
And then what?

Suppose I give up (after all, I got *my* product saved off, in bug 499425), and you archive off the piddly 411GB of firefox/2004-2006 and thunderbird/2004-2006. Then what? That's barely more than the "breathing room" from the last few days. Comment 23 makes it sound like a more accurate average usage number might be 5GB/day. If so, that buys you 82 days. What are you going to do in 82 days? Lop off 2007? If that's 200GB, what are you going to do 40 days after that? Lop off 2008, at the end of 2009? Will you then be able to hold onto all of 2009 through 2010?

If you are going to stop archiving and start adding disk at some point before we get down to just three or six months of nightlies, when are you going to stop, and are the things you so badly want to get rid of now going to make any difference at that point? 5GB/day is 1825GB/year, so just to hang onto three years you'll need to get up to more than 5TB, ignoring that we will *always* be adding more builds. At that point, a point we'll hit very soon, when you are having to nearly double the disk we now have, is it going to actually matter whether or not we have a few hundred GB more or less?
And, I remain curious about who the ultimate decision maker about dollars versus builds is: the tail end of the mozilla.dev.builds thread had shaver saying "I
would happily incur a cost of $6K/year in extra stoarge" both to save developer time and to save the build team time spent scrambling to free up 5GB of space for tomorrow.

If it's a reasonable assumption that we'll stay between 5GB/day and 6GB/day for the next year, and thus that going from the current 3TB to 5TB would give us a year to decide without having to delete anything else (though continuing to delete the things we never really wanted at all is still a great thing to do), who can say to whom "please buy another 2TB of storage from this budget code and give it to ftp.m.o"?
Phil - I wouldn't worry too much about this.  We're actively looking at storage solutions and what the right long term strategy is.  

The 82 days, btw, gives up non-panic-mode time to figure what we need to do.

Right now it looks like we might add a 1TB iscsi volume to buy even more time.
Assignee: mrz → aravind
fyi, the space issue has been resolved.  The long term retention policy is still missing and gated on Metrics.
Assignee: aravind → deinspanjer
Status: NEW → ASSIGNED
Whiteboard: Daniel to re-run ftp log analysis and deliver results by 2009-07-17
I re-ran the FTP processing after adding a filter for a dozen or so likely bot/spider user agent strings.  There were a few that I left in like wget and the download accelerators such as GetRight and FlashGot because I suspect that many people trying to pull an old build might use something like that.

We really do get hammered by bots.  They are just constantly spidering our FTP and pulling down these binaries.  I'd strongly suggest looking into implementing a user-agent filter .htaccess or something.

There are still some large spikes here and there.  Here is a visualization that makes it fairly easy to see the trend in traffic for particular product/build_years:
http://manyeyes.alphaworks.ibm.com/wikified/mozilla/FTP+old+nightlies+download+stacked+category

The attached sqlite file has a main table containing the counted downloads, and three ancillary tables that monitor things like:
* common rejected requests (from heartbeat, pentests, etc.)
* sampled rejected requests (a sampling of other requests that were not counted)
* sampled user agent strings (For each day, the top ten UA according to number of requests and a sampling of 100 other UAs)
sqlite3 ftp_old_nightlies_download.sqlite
.read gencrosstab.sql
.exit
head *.csv
I'm all done with this for now.
Assignee: deinspanjer → mrz
Whiteboard: Daniel to re-run ftp log analysis and deliver results by 2009-07-17
Can you help interpret this?  Looks like there's always a > 0 access to any Firefox file in 2004/2005?
That is correct.  Even when omitting obvious spider/bot activity, there is always more than 10 downloads of a 2004 and 2005 Firefox nightly build every day.

We sometimes have extreme spikes in downloads.  Obviously, when these spikes climb up into the thousands, it is very likely that this is the result of a spider or of a bug in someone's automated retrieval script.

We could go several directions from here:

1. We could ask Eric to take a look at the data and generate a more statistically sound model of the data.  That would tell us things like the typical number of downloads along with the standard deviation, but I don't think that data is actually useful to you is it?

2. We could re-run the analysis again, this time recording the user agent for every download.  This might help us determine better if there are any defining characteristics to the constant downloads that would indicate it would be okay to cut them off.

3. We could simply say we always have traffic from "people" downloading even old builds and we shouldn't take those builds off of the FTP site without providing an alternate download mechanism.  Here are some alternatives I can think of:

* Burn some DVDs and make them available to anyone who requests

* Seed them out on BitTorrent and try to get enough of a community that we can take our seeds offline

* Use a cloud storage such as Amazon S3, rsync.net, JungleDisk, etc.
Try to build or borrow a distributed file store for this data similar to a mirror network using something like CFS, Ivy FS, FreeNet, Infinit, Pastis, OceanStore, or OpenAFS
(In reply to comment #44)
> That is correct.  Even when omitting obvious spider/bot activity, there is
> always more than 10 downloads of a 2004 and 2005 Firefox nightly build every
> day.

Wow, that seems like a lot of interest in 5+ year old nightly builds. I'd be impressed with that many downloads of a 5 year old release, but a 5 year old nightly baffles me. 

Is there any recurring pattern of IPs showing where those requests are coming from?
I can do another analysis run that does some calculations on IP address frequency.
The first thing that I can think of is to count the number of days for each year and month that an IP address was seen downloading an old nightly build.  If it is the same IPs constantly downloading then that would be good to know.
It would also probably be good to know if the same IP is downloading the same nightly build multiple times.  That would be wasted bandwidth.

I'm CCing Eric and Ken to see if they can provide any other useful statistics that can be calculated.
(In reply to comment #45)
> (In reply to comment #44)
> > That is correct.  Even when omitting obvious spider/bot activity, there is
> > always more than 10 downloads of a 2004 and 2005 Firefox nightly build every
> > day.
> 
> Wow, that seems like a lot of interest in 5+ year old nightly builds. I'd be
> impressed with that many downloads of a 5 year old release, but a 5 year old
> nightly baffles me. 
> 
> Is there any recurring pattern of IPs showing where those requests are coming
> from?

Also, is there any pattern to *which* 5 year old nightlies? With 365 nightlies a year, I'd be curious if there's a pattern of which 10 are being downloaded?
(In reply to comment #44)
> That is correct.  Even when omitting obvious spider/bot activity, there is
> always more than 10 downloads of a 2004 and 2005 Firefox nightly build every
> day.
> 

This sounds really high compared to my findings. Is there any way you can post a list of all the user agents somewhere? I suspect there's still some crawlers in there.

Also, were you counting directory accesses, or just downloads of files? The latter is all we care about IMHO.
In the attached sqlite database, there is a sampling of the top 10 user agents + a random sampling of 100 more for each day.

There is also a sampling of the rejected requests.
(In reply to comment #49)
> In the attached sqlite database, there is a sampling of the top 10 user agents
> + a random sampling of 100 more for each day.
> 
> There is also a sampling of the rejected requests.

Whoops, should have looked there first. In the sample_user_agent_requests I found some more things that are almost certainly crawlers:
Yandex*
Xenu Link Sleuth*
*nutch*
IDA
baidu*
Qryos*
CamelStampede*
Microsoft URL Control*
Diamond*
DoCoMo*
MediaPartners-Google
EmailSiphon
Space Bison*
TinEye*
Missigua Locator*

I dunno if excluding them would make any considerable difference, but they're certainly not "real" hits.
Many of those don't count for any significant amount of traffic.  Below is a query that shows the over-all top 10 non-MSIE non-Gecko user agents counted.
I can take out most of these and the ones you list above.  My biggest concern is eliminating ones like Wget that I would imagine are quite likely legitimate downloads.  I frequently use Wget to download a package off of an FTP site like this.  The Java one could be the same, I don't know.  Maybe someone has built a tool to make doing regression windows easier? (although, knowing the Mozilla crowd, that tool would be built in Python, not Java. ;)

Also, I'm quite wary of the large number of MSIE downloads we have.  I just can't imagine that many legitimate downloads of Firefox nightlies by people using MSIE.  Unless maybe that is the user agent that is given if the user types an FTP address into a Windows Explorer address bar..  I think we need the IP analysis too before we can say anything for sure.

sqlite> select user_agent, sum(num_requests) count from sample_user_agent_requests where user_agent not like '%MSIE%' and user_agent not like '%Gecko%' group by user_agent having count > 10 order by count desc;
Java/1.6.0_03|1010
Wget/1.11.4|560
Yandex/1.01.001 (compatible; Win16; I)|409
DoCoMo/2.0 P900i(c100;TB;W24H11)|123
YandexSomething/1.0|61
Wget/1.10.2 (Red Hat modified)|59
CamelStampede/0.0.7(beta)|27
Mozilla/4.0|27
CamelStampede/0.0.8|20
Opera/9.10 (Windows NT 5.0; U; de) -|14
EmailSiphon|12
Lynx/2.8.6rel.4 libwww-FM/2.14|11
Wget/1.10.2|11
Seems not very astonishing to me. I guess people who see a bug in Firefox just want to know if it once worked correctly.
(In reply to comment #51)
> Many of those don't count for any significant amount of traffic.  Below is a
> query that shows the over-all top 10 non-MSIE non-Gecko user agents counted.
> I can take out most of these and the ones you list above.  My biggest concern
> is eliminating ones like Wget that I would imagine are quite likely legitimate
> downloads. I frequently use Wget to download a package off of an FTP site like
> this.  The Java one could be the same, I don't know.  Maybe someone has built a
> tool to make doing regression windows easier? (although, knowing the Mozilla
> crowd, that tool would be built in Python, not Java. ;)
>

I totally agree about wget, Java, etc - I believe those are entirely legitimate. Yandex is a search engine though, and I imagine YandexSomething is their crawler, and google seems to think that DoCoMo, CamelStampede and EmailSiphon are crawlers.
I agree about the others.  I re-ran last night adding another dozen agents to the filter.  I also changed the process to dump all accepted user agents instead of sampling them.  This makes the db a bit bigger but it will help us figure things out better.

I'm going to try to run this again soon with one more big change, I'm going to import my user agent parser module to classify and categorize the user agents.  I'll then add a column to the main table that will indicate what type of user agent is responsible for the request.  That way we'll be able to look at the stats by MSIE/Gecko/Opera/Webkit/Mobile/Download Agent/Spider/Bot/Other.
Attachment #389271 - Attachment is obsolete: true
I forgot to say, now that I am not sampling the user agents, we can see that there really are a huge number of hits from these libraries we were wondering about:
Python 53085
Wget    6964
Perl     146
(In reply to comment #55)
> I forgot to say, now that I am not sampling the user agents, we can see that
> there really are a huge number of hits from these libraries we were wondering
> about:
> Python 53085
> Wget    6964
> Perl     146

I would bet that Wget is legitimate. I'd like to think that Python is someone using a regression hunting script to download builds with...but it would be great to know for sure.
I only know the following two scripts. But both are not implemented in Python:

http://db48x.net/hg/regression-search/ 
https://bugzilla.mozilla.org/show_bug.cgi?id=482536
This is no longer an issue after we added disk space.  We'll explore archiving to tape later.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: