Currently regression hunting is very tedious. It also gets exponentially more time consuming when we build changeset that require a different compiler/build setting. A few of us in #developers think it would be feasible to archive every changeset build for M-C/M-I for much longer then currently kept. Perhaps keeping them around for say 5 years or so. We did a quick cost analysis and it seems to make sense so I'm proposing the idea. I think it's worth researching this proposal: Here's our very rough cost analysis: 5000 pushes/mo 1500 pushes/mo to M-C+M-I 150 MB per push to keep builds + symbols for win32/64, mac, linux32/64, android 150 MB / push * 1500 push / mo =220GB/mo, call it 300GB/mo to account for growth 300GB / mo * 12 mo / year = 3.5TB / year 18 TB in 5 year 100 $ / mo to storage 1 TB in S3 low redundancy = $1200/year per TB $21600/year to store once we've accumulated 5 years of pushes Note that we do NOT need high availability, and data lost is very acceptable (but a bit of a pain). I don't have accurate figures for the cost of development time so I'll leave that calculation open. Implementation wise this could be very simple, we could have a cron script that pulls from the FTP and moves to S3 for long term storage periodically.
How does this relate to bug 669696? Specifically, would we rather store all the builds or regenerate the ones we need to bisect? I don't know how often developers need to do regression hunting that falls outside the timerange of builds that we already keep. I'm looking to be educated here.
Outside the range of tinderbox builds we already keep? Pretty often, I think, since we only keep those for a few weeks, right?
> Specifically, would we rather store all the builds or regenerate the ones we need to bisect? Much rather store all builds, personally. Our build system changes over time; it's hard to build m-c from one or two years ago. It would take a lot of work to make that builder work consistently. Moreover, there's no guarantee that the old build is the same as the new build. Toolchains change, PGO is non-deterministic, etc. Plus, rebuilding an old build takes much longer than downloading it. > I don't know how often developers need to do regression hunting that falls outside the timerange > of builds that we already keep. This is difficult to answer quantitatively; perhaps you can look at bugs which had regression-window-wanted set at some point. I suspect we don't have hourly builds for a large fraction of the regressions we hunt.
(In reply to Justin Lebar [:jlebar] from comment #3) > This is difficult to answer quantitatively; perhaps you can look at bugs > which had regression-window-wanted set at some point. I suspect we don't > have hourly builds for a large fraction of the regressions we hunt. Well, if we knew where the regressions were going to be... ;) Storage is one factor here, network transfer is another. I still haven't heard a good estimation of how often and how many developers are needing older builds. I want to make sure that either downloads are light enough that they won't matter as a fraction of the storage cost or that organizationally we don't care and are willing to spend that money. Again, having an estimate of usage over some time period would be helpful here. bz, jlebar, BenWa: even if the three of you just averaged your own bisection efforts over the past year, that would be useful data. I think there's a legitimate business case here, and I'll discuss it with releng/finance at our teamweek at the end of July. Any extra data will help.
I haven't been writing this down, but from memory my impression is that 50% or so of my bisections involve finding a regression range for a bug that's being reported because we shipped it in beta or final. Which means the actual regression happened 6-18 weeks earlier (for beta) or 12+ weeks earlier (for final). Of the remainder, about half are 0-12 weeks old (bug reported on Aurora). The rest are older; in some cases much older. But note that I have a slightly biased sample because for cases when we have hourlies available our wonderful volunteer QA already use them to find a range, which means I don't have to think about those bisections at all. And that's a _huge_ win, because even 5 minutes of thinking, added up over many bugs, really adds up. Volunteer QA do this for nightlies too, but after that's done either someone has to bisect over one day by building or someone who knows the entire codebase has to take the time to read carefully through the changelog to see what in there might be relevant. It looks like right now we store hourlies for a month or so, which means that by the time a build goes to beta we no longer have hourlies from its development..
Oh, to put more numbers to this, I estimate that I personally would save about 2 hours a week on reading changelogs and whatnot if volunteer QA could always just bisect on hourlies. That's a few thousand dollars a year, right? I suspect if we actually added this up over a few other people who end up dealing with regression ranges, we pretty easily end up in the ballpark of comment 0's numbers. That said, I don't know how much network transfer costs are...
Case in point: bug 778128 missed our current monthly cutoff by a week or so.
I think this is a question for IT to start with. Do we have capacity to keep dep builds for 18 or 24 weeks (a full release cycle) instead of the current 4 weeks?
I'm going to move this over to the storage queue, since dparsons and gcox are the folks who can best give you answers to these questions.
How much disk space would you need for the 24 weeks?
This would also be excellent for performance regression hunting, which we don't do much of right now, but should do more in the future. :)
In case no one here has heard of it yet, Amazon's new Glacier system might be a good fit for this: http://aws.amazon.com/glacier/ $0.01 per GB per month
As long as I was reminded of this bug... More anecdotal data: I've dealt with 4 regression ranges so far today that were about 6-8 weeks in the past (as in, they're being reported on aurora). For two of these, reading the longish changelog for 10 mins or so found the causes. For the other two, I'm going to need to do manual bisections or find someone else to do them....
(In reply to Dan Parsons [:lerxst] from comment #10) > How much disk space would you need for the 24 weeks? A conservative estimate is about 300GB/mo so 1.8 TB every 6 months. (In reply to Dan Parsons [:lerxst] from comment #12) > In case no one here has heard of it yet, Amazon's new Glacier system might > be a good fit for this: > > http://aws.amazon.com/glacier/ > > $0.01 per GB per month That's very interesting and would bring my cost estimate to ~$2,200 a year which would make this a no brainer. The only problem are with retrieval times of 3-5 hours it would make regression hunting (a binary search problem) very long. Perhaps we could modify how do we it by using k-way "binary" search (ternary search). Say we did a 5-way search and we have 5000 pushes a month over 5 year we have ceil(log(5000*12*5) / log(5)) = 8 glacier batch retrievals * 5 hours = 40 hours of waiting. I guess this depends on the budgeting.
Bug 787947 is an example where I have to manually bisect an intraday range.
(In reply to Benoit Girard (:BenWa) from comment #14) Actually my analysis here is wrong. We retain all nightly and can use this to get to an intraday range. From there we can pull all builds for that day. So we're looking at a 5 hours wait to get an intra-range which is great for the cost.
While it's on my mind: You'd want to use S3 as a cache in front of glacier, so that if I happen to request the same set of builds twice within a few days, we don't have to go to glacier twice.
Yes that's a great ideal but it's also greatly increases the complexity of the initial implementation. I'd love to get a budget approve for a glacier only front end and then I could write a simple cronjob to upload to glacier once a day and get something running in a few working days. From there we could add the S3 front end as a follow up.
Where do we stand on this? Who needs what? Should this bug be closed?
Where we stand on this is that nothing has happened and the fact that nothing has happened means that developers and volunteer QA are spending tons of extra time bisecting via local builds. What _needs_ to happen is to store hourly builds for at least 24 weeks (basically one release cycle), and then we can discuss the rest in a followup. I am entirely indifferent as to how this at-least-24-week-storage result is achieved as long as there is reasonable (a few hours to get them at most, given a fast network connection) access to the builds for any given day. > Should this bug be closed? You tell me. Are there any plans to ever fix it and stop people having to waste time hunting down regression ranges?
Oh, and to answer the question from comment 10: Based on the numbers in comment 0, 24 weeks should require approximately 1.8TB of storage, of which about 300GB we're already using, so we'd need 1.5TB additional storage. Let's play it safe and call it 2TB because we're doing more pushes and whatnot. I'm sorry no one answered that question earlier. Does having that answer make this bug more actionable?
A suggestion which I remember someone making (I forget who) was that we could check all our binaries into a git repository, which we could then locally bisect. The claim was that OOo does this, and that git is relatively good at compressing binaries, so the resultant repository isn't ginormous. This is the sort of thing that a developer could hack up in a week. Maybe we should try it, instead of continuing to hold our breath.
I agree with Boris, I scope creeped this bug. Perhaps we can begin by making the FTP retain 24 weeks. Then as a follow up we can extend mozregression to use the 24 week retention to find specific changesets.
If we have someplace we could put the git repo that doesn't involve trying to wring storage space out of server ops, let's do it. It can't be worse than the current situation...
(In reply to Justin Lebar [:jlebar] from comment #22) > A suggestion which I remember someone making (I forget who) was that we > could check all our binaries into a git repository, which we could then > locally bisect. The claim was that OOo does this, and that git is > relatively good at compressing binaries, so the resultant repository isn't > ginormous. > > This is the sort of thing that a developer could hack up in a week. Maybe > we should try it, instead of continuing to hold our breath. That's an interesting idea, I did read that about OOo but didn't find much info on how it was implemented. Do we have any numbers on what the size of this gitrepo would be like? Can you run an experiment? If we can achieve something like OOo where we can cram a ton of binary into one repo that would be amazing.
Created attachment 716348 [details] script.sh I ran some numbers using this script: https://docs.google.com/a/fantasytalesonline.net/spreadsheet/ccc?key=0AqhCk0oQJImvdGRmTEE5dVphRWtWN3FUVVdiOTlGRlE#gid=0 Using a git repo we're looking probably at 3MB/push for a single platform' package if we store using a repack'ed git repo. I think for now the best thing to do is to modify the FTP retention policy to a full cycle (24 or 32 weeks) and make sure we have enough capacity.
bz, Sorry this has sat for so long and sorry you feel like you have to wring space out of us :) This is totally not the case and we'd like to see what the best possible, workable solution is. As you already noted in comment #21, that was important for us to know and it kind of got lost in the noise of the rest of the comments. bz, coop, BenWa : Please let me know if I'm summarizing this correctly : We're looking to increase current ftp storage by about 2TB to allow you folks to store more builds to save time when you have to hunt regressions. Correct? Once I have that down correctly, I'll work with people and make sure we have a resolution correctly. Also in the future, bz - Feel free to send bugs like this to me over email if you're waiting for a response. Although the IT bug queue isn't as big as the Firefox ones, things do tend to get lost due to the sheer amount of requests we get. I'm more than happy to help here and get this sorted out, one way or another.
> We're looking to increase current ftp storage by about 2TB to allow you folks to store > more builds to save time when you have to hunt regressions. Correct? Assuming we didn't screw up our math anywhere, correct. Sorry for the frustration here on both sides; I understand how bug backlogs go, and I certainly know the feeling of a bug not having the info it needs to be actionable...
FWIW I tried using bsdiff and didn't get great results; a fully-compacted repository with 100 m-i debug builds takes up 250mb. (The repository has one set of full binaries and then the rest are bsdiff'ed against this one.) This isn't much better than the 3mb per build that benwa saw without bsdiff. However. It looks like the two biggest offenders with respect to space are omni.ja and libxul.so. I think if we extracted omni.ja and used courgette to diff the libraries, there's a chance we could make the git repository much smaller.
I filed a clone bug to investigate using git which is more of an experiment at this point. Let's continue discussing increasing the FTP retention here.
(In reply to Boris Zbarsky (:bz) from comment #28) > Assuming we didn't screw up our math anywhere, correct. Okay. Cool. > Sorry for the frustration here on both sides; I understand how bug backlogs > go, and I certainly know the feeling of a bug not having the info it needs > to be actionable... No problem. The shortest way to do this is to identify where on the ftp setup the builds that you need are stored and see if we can move that to another netapp volume and give it about 3TB of space. I'm going to have to work with both Webops and Storage folks here as well as releng, so give me a bit of time and I'll have this sorted out for you.
I assume we'd want some subset of http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/ Perhaps just mozilla-inbound-*?
Ted, would be nice to have confirmation of that. bz, here is the plan : The storage team will create an additional slice as needed to fit in this extra space requirement. The webops team will work on that and make sure the data is migrated cleanly and that you have more space. Once all this is done, we need to make sure any cleaning scripts are told to push back on their timeline and don't wipe out the builds ahead of time. Going to assign this to Jake, who'll work with storage on what's needed.
additional data I dont see mentioned here. fyi: We recently enabled ASAN builds (bug#753148) and have more coming. Each of these builds is >300mb.
We don't need long-term archives of ASAN builds, imo.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #32) > I assume we'd want some subset of > http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/ > > Perhaps just mozilla-inbound-*? BenWa, bz, jlebar: Do you think archiving just mozilla-inbound tinderbox builds (possibly some subset, like just *-opt and *-debug) would be sufficient?
That's probably fine by me, given the low traffic on m-c nowadays....
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #36) > (In reply to Ted Mielczarek [:ted.mielczarek] from comment #32) > BenWa, bz, jlebar: Do you think archiving just mozilla-inbound tinderbox > builds (possibly some subset, like just *-opt and *-debug) would be > sufficient? m-i is certainly sufficient. I would even claim *-opt is sufficient. Typically hunting long term regression is a result of observing a failure in a nightly build. Opt build mimic nightly the closest. I can see retaining debug builds being handy but if we can't I'd still be happy.
(In reply to comment #35) > We don't need long-term archives of ASAN builds, imo. We do if we want to bisect security bugs found on those builds.
And I would like a pony. I think we should focus on an easily attainable goal that has a large value-add. Storing just mozilla-inbound*opt (and optionally debug) builds should provide maximum bang-for-the-buck.
4 years ago
1) Once diskspace is available on ftp.m.o, bug#765258 is corresponding RelEng work to adjust cron scripts. 2) morphing summary, based on decisions reached in this (long) thread, and work in bug#765258. The curious should also follow bug#707843, about integrating ftp.m.o in with s3/glacier.
(In reply to John O'Duinn [:joduinn] from comment #41) > 1) Once diskspace is available on ftp.m.o Is there a request open for this task since this is the bulk of the work?
(In reply to Benoit Girard (:BenWa) from comment #42) > Is there a request open for this task since this is the bulk of the work? My understanding is that :jakem is doing that as part/most of this bug.
(In reply to Chris Cooper [:coop] from comment #43) > (In reply to Benoit Girard (:BenWa) from comment #42) > > Is there a request open for this task since this is the bulk of the work? > > My understanding is that :jakem is doing that as part/most of this bug. Correct.
> bug#765258 is corresponding RelEng work to adjust cron scripts. I'm not sure I follow. This bug is now about increasing the disk space. Is it also tracking the cron scripts, or should that be a different bug?
(In reply to Boris Zbarsky (:bz) from comment #45) > > bug#765258 is corresponding RelEng work to adjust cron scripts. > > I'm not sure I follow. This bug is now about increasing the disk space. Is > it also tracking the cron scripts, or should that be a different bug? The goal posts have moved a few times here, so let me try to sum up the current state as I understand it. IT is going to increase the capacity of the existing ftp partition up to 3TB to accommodate a full release cycle (i.e. 4x6weeks) of nightly builds. This will not include all nightlies, but only the subset that are considered diagnostically relevant for regression hunting. From the discussion above, let's proceed with archiving *all* of mozilla-inbound this way until we run out of space. If we do run out of space, we can always remove debug builds, and then possibly Ted's pony. This will require some new cronjobs to move and expire m-i builds properly. I will file a separate releng bug for that. At some point in the future, we'll move all these regression-hunting builds to S3/glacier. That work will happen in bug 707843 and children. We'll need to update the cronjobs at that point, although we may choose to do a one-time sync from ftp->S3 and upload new to both places in the future.
Shyam: do you need any further information here for IT to proceed?
(In reply to Chris Cooper [:coop] from comment #46) > This will require some new cronjobs to move and expire m-i builds properly. > I will file a separate releng bug for that. Filed bug 850202.
Chris, perfect, thanks. That completely clears things up.
(In reply to Chris Cooper [:coop] from comment #46) > IT is going to increase the capacity of the existing ftp partition up to 3TB > to accommodate a full release cycle (i.e. 4x6weeks) of nightly builds. This > will not include all nightlies, but only the subset that are considered > diagnostically relevant for regression hunting. nit: s/nightly/tinderbox onchange builds/.
(In reply to Chris Cooper [:coop] from comment #47) > Shyam: do you need any further information here for IT to proceed? Bug is assigned to Jake, he'll do what's needed.
FYI, we have tinderbox-builds in a couple of locations (for space reasons) and use symlinks to make it look like one space. Bug 765115 is on file to combine that into one, but if we don't get that prior to granting more space please be careful where you add it.
bz, jlebar, ehsan, and other interested developers: Which set of files are useful out of this list: firefox-24.0a1.en-US.langpack.xpi firefox-24.0a1.en-US.linux-i686.checksums firefox-24.0a1.en-US.linux-i686.checksums.asc firefox-24.0a1.en-US.linux-i686.crashreporter-symbols.zip firefox-24.0a1.en-US.linux-i686.json firefox-24.0a1.en-US.linux-i686.tar.bz2 firefox-24.0a1.en-US.linux-i686.tests.zip firefox-24.0a1.en-US.linux-i686.txt jsshell-linux-i686.zip mar mbsdiff mozilla-inbound-linux-bm62-build1-build100.txt.gz + many more logs I'm guessing the firefox executable, the .txt file for the revision, maybe the symbols and tests.
The executable is the only thing I really end up using personally, but I would expect that some tools might use the .txt (which is small in any case).
Treeherder (the replacement for TBPL) will need the logs. Devs+mozregression will need the binaries (both those shown below and for other platforms). The revision txt is also useful for those bisecting by hand + tooling. I think we might not need the xpi or tests zip longer term.
(In reply to Ed Morley [:edmorley UTC+1] from comment #55) > I think we might not need the xpi or tests zip longer term. the tests zip will be quite useful to have for Bisect In The Cloud as the system can download them and then use them at that stage if no new tests have been passed in. This might change but I would rather have them for now and then say we can get rid of them
(In reply to Nick Thomas [:nthomas] from comment #53) > Which set of files are useful out of this list: If I had to order them in terms of priority, I would say: > firefox-24.0a1.en-US.linux-i686.tar.bz2 The build itself, most important. > firefox-24.0a1.en-US.linux-i686.json > firefox-24.0a1.en-US.linux-i686.txt Info about changeset and build ID, pretty important (also tiny). These + the build are the only things I would designate as "must have". Everything below this would be "nice to have". > firefox-24.0a1.en-US.linux-i686.crashreporter-symbols.zip > firefox-24.0a1.en-US.linux-i686.tests.zip Symbols + tests, somewhat important if we want to try to bisect test failures (as David points out) or get a useful stack out of crashes. > mozilla-inbound-linux-bm62-build1-build100.txt.gz + many more logs Logs, somewhat important since TBPL can use them, although for the use cases outlined in this bug I'm not sure how important they really are. > jsshell-linux-i686.zip JS shell binaries, I know that various people using fuzzers etc like to have these handy, but I'm not sure if they need them long-term. Bisecting for a regression range is probably easier with these available. > mar > mbsdiff > firefox-24.0a1.en-US.langpack.xpi > firefox-24.0a1.en-US.linux-i686.checksums > firefox-24.0a1.en-US.linux-i686.checksums.asc None of these are important.
Note that if you compress multiple logs together, I expect them to be rather small. (All the discussion I've seen about log size talks about compressing them separately.) But I don't want us to get hung up on the logs if that means we can't save the binaries.
Ok, change of plan since comment #46 - back to an Amazon S3 solution. I've got a bucket set up which I need your help to test. There are few different ways to access this * an Apache lookalike at http://ec2-50-112-87-128.us-west-2.compute.amazonaws.com/ provided for compatibility with existing scripts that scrape ftp.mozilla.org, and for humans to use. If this is useful it'll move to a more permanent domain * using the S3 REST API with https://firefox-mozilla-inbound-usw2.s3.amazonaws.com/ * using language-specific S3 interface libraries (eg boto for python), connecting to the bucket firefox-mozilla-inbound-usw2. This requires you have an AWS account and you'll pay a small amount for HEAD and GET requests that go on behind the scenes, and possibly download traffic, but gives you the most flexibility The bucket has all the mozilla-inbound depend builds from tinderbox-builds/ since Firefox 24 development started on 2013-05-13. I've most of the files we create, including test archives + jsshell, and added the desktop pgo and mobile builds too. It adds up to 3.5TB & 215000 items for the 6 weeks of Firefox 24, so we'll see how the costs go and revisit if there's a problem. This is not production quality yet, feedback welcome.
Nick, I wonder if it would be possible to mimic the folder structure we have on ftp.mo. That way it would be easier for tools like mozregression or mozdownload to retrieve those builds.
You mean with the leading /pub/mozilla.org/firefox/... ?
> You mean with the leading /pub/mozilla.org/firefox/... ? Yep.
This bug has gone back and forth several times. See comment #59 and dep bugs for how we're going to get it done by hosting the bits in Amazon.
We've got this up and running at the following URL. If there are any issues, please file bugs blocking bug 707843 http://inbound-archive.pub.build.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/
Why is no last modified date visible on listings like: http://inbound-archive.pub.build.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux/ As of now it will be very hard for someone to pick the right build when manually doing a regression test.
(In reply to Henrik Skupin (:whimboo) from comment #65) > Why is no last modified date visible on listings like: > http://inbound-archive.pub.build.mozilla.org/pub/mozilla.org/firefox/ > tinderbox-builds/mozilla-inbound-linux/ > > As of now it will be very hard for someone to pick the right build when > manually doing a regression test. Those dates are *terrible* for regression windows. They are in the order the build finish. When you get down to the last ~20 changeset the builds are all scrambled if you look at last modified vs. push order. I'm hoping we can come up with a solution in bug 789112.
The names of the directories correspond to the buildid of the build, which is normally in order the builds start. There's a proposal floating around somewhere about including the revision in the directory name, and to stop using unix timestamps as the name. That's explicitly out of scope for this project. If we want to change how we layout files on FTP, awesome, let's do that, and then this archive will follow suit.
As one of the QA Volunteers who will benefit from this, I just honestly wanted to thank for attacking Bug 463034 Comment 0 (finally) :-)
(In reply to Chris AtLee [:catlee] from comment #67) > The names of the directories correspond to the buildid of the build, which > is normally in order the builds start. For regression window the build start time is not accurate and this leads to *incorrect* regression windows result. I've done inbound regression windows myself manually and got bad results. I had to get the change-id from the .txt./json file. It is feasiable to get a fix for this in the short-ish term because we're almost there but this is effectively blocking 789112. If we don't fix it then 789112 will need to open all nearby .json (with very a generous range) files, parse them and get the push data for those changes because the buildids can't be trusted. This is really far from ideal. Why can't we just have virtual(s) directory(ies) structure? We could keep the current structure and have another structure based on push date example.
inbound-archive has served it's time and is now deprecated. The content will be moved into archive.m.o in bug 1233623.