Closed Bug 843624 Opened 11 years ago Closed 11 years ago

Create a compact git repo of M-C/M-I binaries for Regression Hunting

Categories

(Core :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: BenWa, Unassigned)

References

Details

(Whiteboard: [storage][capacity][reit])

+++ This bug was initially created as a clone of Bug #765258 +++

jlebar suggests that we use a git repo to store our binaries and use some efficient interdiff techniques:

(In reply to Justin Lebar [:jlebar] from comment #22)
> A suggestion which I remember someone making (I forget who) was that we
> could check all our binaries into a git repository, which we could then
> locally bisect.  The claim was that OOo does this, and that git is
> relatively good at compressing binaries, so the resultant repository isn't
> ginormous.
> 
> This is the sort of thing that a developer could hack up in a week.  Maybe
> we should try it, instead of continuing to hold our breath.


See bug 765258 Comment 26 for a quick test I do showing that just git alone gives us about 3MB per push. Great but not enough.
Assignee: server-ops-storage → nobody
Component: Server Operations: Storage → General
Product: mozilla.org → Core
QA Contact: dparsons
Version: other → unspecified
Do you want this under git.mozilla.org ?
(In reply to Shyam Mani [:fox2mike] from comment #1)
> Do you want this under git.mozilla.org ?

I was hoping we could get it small enough that we could put it on github, so that we didn't have to involve IT.
It looks like courgette is x86-32 only.  Also, it appears to have code for ELF and Win32, but not Mac.  Sad day.  :(
Moving to correct product/component for repo request
Component: General → Release Engineering: Developer Tools
Product: Core → mozilla.org
QA Contact: hwine
Version: unspecified → other
(In reply to Justin Lebar [:jlebar] from comment #3)
> It looks like courgette is x86-32 only.  Also, it appears to have code for
> ELF and Win32, but not Mac.  Sad day.  :(

Can you clarify if that means this is still an open request for a git.mozilla.org repo? (github would be better for this sort of repo, aiui)
I never intended for this to be a request against git.m.o; sorry that wasn't clear.
(In reply to Justin Lebar [:jlebar] from comment #3)
> It looks like courgette is x86-32 only.  Also, it appears to have code for
> ELF and Win32, but not Mac.  Sad day.  :(

Does this mean we should WONTFIX?
Well...3mb per build is still ~8x better than what we're currently getting on ftp.

What would we have to get this down to in order to declare it feasible, I wonder?
Oops - sorry for the drive by then - moving back to where I found it.
Component: Release Engineering: Developer Tools → General
Product: mozilla.org → Core
QA Contact: hwine
Version: other → unspecified
I think it would be nice even if we could use this for just Linux x86-32 builds.

Unfortunately Courgette dies on libxul ELF x86-32.  It's not clear if this is a bug or if it's due to the hackery we apply to libxul (e.g. elfhack).  Courgette is playing with the relocations in the library, so it's entirely possible that we're breaking it with elfhack.

I'm not totally ready to give up here; at the very least, I want to try building Courgette on Linux; perhaps it doesn't work right on Mac (it had some trivial compile errors, and it doesn't work with mac binaries).
(In reply to Justin Lebar [:jlebar] from comment #10)
> I think it would be nice even if we could use this for just Linux x86-32
> builds.
> 
> Unfortunately Courgette dies on libxul ELF x86-32.  It's not clear if this
> is a bug or if it's due to the hackery we apply to libxul (e.g. elfhack). 
> Courgette is playing with the relocations in the library, so it's entirely
> possible that we're breaking it with elfhack.

We could disable elfhack before running courgette
> We could disable elfhack before running courgette

Well, the issue is that we want to archive m-c and m-i builds, and I don't think we want to disable elfhack there.
bug 816494 would help when it's done.
That being said, would courgette actually help here? wouldn't it make the repo bigger by making git's interdiffing less efficient? how helpful would it actually be? iirc courgette does efficient interdifing of binaries. If you store such interdiffs, then what you get off the git repo are interdiffs ; you need to apply them, and git is not going to do it for you. So you need scripts to get the interdiffs from git and reconstruct the file. And another script to not insert an interdiff once in a while so that you don't have to use thousands of interdiffs just to get one file. At this point, you've almost reinvented a git-like storage and are storing it in git. That last part seems useless.
I think you'd have a better chance at making git happy by using "simple" filters on the files you put in git like the ones LZMA uses.
(In reply to comment #14)
> That being said, would courgette actually help here? wouldn't it make the repo
> bigger by making git's interdiffing less efficient? how helpful would it
> actually be? iirc courgette does efficient interdifing of binaries. If you
> store such interdiffs, then what you get off the git repo are interdiffs ; you
> need to apply them, and git is not going to do it for you. So you need scripts
> to get the interdiffs from git and reconstruct the file. And another script to
> not insert an interdiff once in a while so that you don't have to use thousands
> of interdiffs just to get one file. At this point, you've almost reinvented a
> git-like storage and are storing it in git. That last part seems useless.
> I think you'd have a better chance at making git happy by using "simple"
> filters on the files you put in git like the ones LZMA uses.

Agreed.  This, and the fact that courgette doesn't support all binary formats and architectures that we need seems to indicate that we should just focus on putting things in a normal git repo without any such trickery.
> That being said, would courgette actually help here? wouldn't it make the repo bigger by 
> making git's interdiffing less efficient?

Git is adding some mb of diffs per changed libxul.  bsdiff was giving ~500kb diffs.  Courgette promises a factor of 10 reduction over bsdiff.  So while you're of course right that it would make git's interdiffing less efficient, it could still be a win overall.

> we should just focus on putting things in a normal git repo without any such trickery.

If we do this, we may end up not doing the git repo at all.

joudinn says that (.29 + .023) * 6,247 = ~2000 checkins in January were to m-i or m-c.  At 3mb a pop, this is 6GB.  The point is to archive these repositories long-term, so we're looking at 72GB per platform per year.

That's of course doable, but the thing that was holding us back was getting Mozilla IT's help, so I was hoping we could do this without involving IT.  I'm not sure where we're going to be able to host ~10 x ~70GB git repositories.

I suppose we could stick a 4TB drive in Albus, which sits under jst's desk and generates the awsy data.  I was hoping to find someone else to maintain the servers for us, though.

> I think you'd have a better chance at making git happy by using "simple" filters on the 
> files you put in git like the ones LZMA uses.

Could you elaborate on this?  I'm not sure what you mean.

AIUI the essential problem Courgette is addressing is that when you change libxul most of the relocations change, even if most of the code doesn't change.  If we had some way to expand libxul so that the relocations were easier for git to interdiff, and then reassemble it later, that might help.

This is essentially what Courgette is doing (it eventually uses bsdiff), except I think it also applies some transformations to the expanded new library to make it compress better against the original.
(In reply to Justin Lebar [:jlebar] from comment #16)
> Could you elaborate on this?  I'm not sure what you mean.

LZMA is filtering binaries before compressing them, depending on their platform:
http://git.tukaani.org/?p=xz.git;a=tree;f=src/liblzma/simple;h=a8dfd5f5ce3113da67bbcf2dc46ad218b52fb2fe;hb=e7b424d267a34803db8d92a3515528be2ed45abd

These filters modify branches, calls and jumps.
Note that by using bsdiff we might be looking at closer to 1.5MB a push at the cost of complexity and runtime having to apply a chain of bsdiff patches.
The github folks got back to me and said that

a) They can't support a repository this large, and
b) They don't recommend trying it: "Git's not really optimized to handle that volume of data -- it's geared primarily toward small text files and works best with repositories measured in KB or MB rather than GB."

I don't immediately believe (b) -- all of the concrete complaints I've seen about git perf have to do mainly with repositories with many files in a given cset or with extremely deep histories, neither of which would be the case here -- but (a) is certainly an answer.
FWIW courgette on Linux fails the same way as courgette on Mac.
I'm becoming increasingly convinced git isn't going to work for our needs here.  git for example didn't do a good job compressing files when I added each build as a leaf commit off an empty initial commit, even though I think that's the optimal layout from the perspective of using the repo.

I tried bup [1], which seems saner and actually built for the sort of thing we're doing.  It has a FUSE plugin, so we might be able to use that relatively transparently.  Builds take up ~5mb each with bup.

But anyway, we should close this bug.  I tried!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
(In reply to Justin Lebar [:jlebar] from comment #21)
> I'm becoming increasingly convinced git isn't going to work for our needs
> here.  git for example didn't do a good job compressing files when I added
> each build as a leaf commit off an empty initial commit, even though I think
> that's the optimal layout from the perspective of using the repo.

How would that be optimal? it would make bisect useless.
> How would that be optimal? it would make bisect useless.

Well, I realized a few things.

 * When looking at the builds on the FTP server, I don't have the topological ordering of those builds.  I don't want to make the mistake that graphserver makes and assume that the timestamps correspond to topological order.

 * You sometimes don't want to bisect -- sometimes you just want to check out a particular old build.

 * I haven't tested them, but aiui, git is good at letting you check out all of a particular branch, but not so good at letting you check out part of a branch and then letting you bring in the rest later.

 * We might want to shard the git repository across multiple machines.

So my thinking was that if each build was a leaf, the git repository would be exactly as full-featured as the FTP server.  You'd then need an external tool to bisect.
Coincidentally we looked at doing almost exactly the same thing for storing the tests.zip archives. These are produced by every build and contain nearly identical information. We came to much of the same conclusions here. Some other observations:

- if you can treat the commits as linear rather than a bunch of individual leaf nodes, then you get much better space savings

- in the case of independent leaf nodes, git still has to transfer objects even if the remote repository already has them. it seems like the only way you can avoid transferring objects is if your parent's tree already has them. we were hoping to avoid transfer costs as well

- garbage collecting repositories like this is really painful
You need to log in before you can comment on or make changes to this bug.