Closed Bug 1050915 Opened 10 years ago Closed 8 years ago

Find better tools for repo management

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: dustin, Unassigned)

References

Details

(Whiteboard: [relsec])

Attachments

(1 file)

We've variously discussed

git-annex
 to allow multiple orgs to collaborate more easily, plus rollbacks

aptly
 new and sounds really handy, especially for finding dependencies

dpkg-scanpackages
 may be a replacement for apt-ftparchive that can easily be installed on CentOS
I tested dpkg-scanpackages on relabs and it seems to work as expected.

    $ sudo yum install dpkg dpkg-devel
    $ sudo -s
    # cd /data/repos/apt/custom/mig-agent/
    # rm -rf /data/repos/apt/custom/mig-agent/dists
    # for arch in i686 amd64; do
          for dist in precise trusty; do
              mkdir -p dists/${dist}-updates/all/binary-$arch
              /usr/bin/dpkg-scanpackages --arch $arch . > dists/${dist}-updates/all/binary-$arch/Packages
              bzip2 < dists/${dist}-updates/all/binary-$arch/Packages > dists/${dist}-updates/all/binary-$arch/Packages.bz2
          done
      done

The resulting packages files look legit, but I haven't tested the repositories yet.
    # find /data/repos/apt/custom/mig-agent/ -name Packages
    /data/repos/apt/custom/mig-agent/dists/trusty-updates/all/binary-i686/Packages
    /data/repos/apt/custom/mig-agent/dists/trusty-updates/all/binary-amd64/Packages
    /data/repos/apt/custom/mig-agent/dists/precise-updates/all/binary-i686/Packages
    /data/repos/apt/custom/mig-agent/dists/precise-updates/all/binary-amd64/Packages

It looks like this wouldn't be too hard to wrap into a script and automate.
From a read of the walkthrough, git-annex seems perfect for our uses.  A few random thoughts:

* use of symlinks might be a problem, but if Apache can serve them we're OK.  Otherwise, "direct mode" might help.

* factor out some common functionality between the current SSL sync scripts and the annex syncs

* sync each data unit separately (bmm, python, repos), additionally syncing private/ for each one separately.

* find some way to allow read-only clones and updates for external users.  Ideally via static HTTP, or at least allowing file access using static HTTP.  Those users could then still push patches upstream, but by submitting a patch for review.

* regular fscks would be good
On the "con" side, git-annex is written in Haskell and has a bajillion dependencies that aren't in CentOS 6.5.
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/316]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/316] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/707] [kanban:engops:https://kanbanize.com/ctrl_board/6/316]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/707] [kanban:engops:https://kanbanize.com/ctrl_board/6/316] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/707]
Attached patch bug1050915.patchSplinter Review
This just installs git-annex, but doesn't use it yet.  The process for updating the repo is in /data/repos/yum/custom/git-annex/update.sh.
Attachment #8533212 - Flags: review?(bugspam.Callek)
Comment on attachment 8533212 [details] [diff] [review]
bug1050915.patch

Hm, that version seems really broken, though.  And I can't get the version in EPEL 7 to install without installing CentOS 7.  So hold off on the review.
Attachment #8533212 - Flags: review?(bugspam.Callek)
Here's the error;

[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom]# git init
Initialized empty Git repository in /tmp/git-annex-stuff/custom/.git/

# use fqdn for the name
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom]# git annex init $(facter fqdn)
init relabs-puppet2.relabs.releng.scl3.mozilla.com ok
(Recording state in git...)

# then added and committed existing files
git annex add .
git commit -am "old files"

## NOTE: using 'git add' is BAD -- a 'git annex add' afterward will not tell you what happened..

[root@relabs-puppet2.relabs.releng.scl3.mozilla.com git-annex-stuff]# git clone custom custom2
Cloning into 'custom2'...
done.
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com git-annex-stuff]# cd custom2/
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom2]# git annex init custom2
init custom2 ok
(Recording state in git...)
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom2]# git remote add custom ../custom
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom2]# cd ../custom
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom]# git remote add custom2 ../custom2/

[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom]# cp -r /data/repos/yum/releng/public/CentOS/ .
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom]# git annex add CentOS/
[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom]# git commit -am "added centos"

[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom2]# git annex sync custom
## seems to do a 'git pull', but doesn't fill in the missing files (-> broken repo)
## and...

[root@relabs-puppet2.relabs.releng.scl3.mozilla.com custom2]# git annex get .
fatal: Could not switch to '../../../.git/annex/objects/qQ/wV/SHA256-s49163962--3c2221750d5551d9a4fa1596ff44beb61bc586a23d0013b39cf71d625ef85750': No such file or directory

git-annex: <file descriptor: 4>: hGetLine: end of file
failed
fatal: Could not switch to '../../../.git/annex/objects/p0/8J/SHA256-s94724465--c04218fb1ac8c291583fed821292e39f97c1740d213cf50f41d73475f868e577': No such file or directory

## I see the same thing trying to follow the git-annex walkthrough


I can't seem to work around it, even using actual remotes on different machines.  I'm sure an upgrade would help (from 3.mumble to 5.mumble), but would require upgrading all masters to CentOS 7.  That's a bit too much yak-fur for this project.  I'll revisit.
Depends on: 1109720
Assignee: dustin → relops
I have a github LFS trial.  I think that'd be a good fit here, although we'd need to think about how to host private vs. public files differently -- we want to be able to collaborate on the private stuff, but with less distribution than the public stuff.
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/707] → [relsec]
I played with my git-lfs trial a little today.  It seems very smooth!  It basically just operates as an alternative, pull-only-when-needed storage for some files.

The downside is, it seems to store files twice, rather than hard-linking them:
dustin@hopper ~/tmp/lfstest [master] $ ls -ali ./.git/lfs/objects/e1/bb/e1bb9f97d99f1a9156b0facd61b260c3b7efe7245881852e6452e430bfbe7d5b Faksness\ et\ al\ 2015\ toxicity\ of\ Macondo\ WAF.pdf 
3145769 -rw-rw-r--. 1 dustin dustin 1122243 Jul 17 09:52 Faksness et al 2015 toxicity of Macondo WAF.pdf
3145784 -rw-------. 1 dustin dustin 1122243 Jul 17 09:53 ./.git/lfs/objects/e1/bb/e1bb9f97d99f1a9156b0facd61b260c3b7efe7245881852e6452e430bfbe7d5b

(a random PDF from my test repo).  This would double our required storage capacity!

Github has a "reference server" which they don't recommend to use in production, and also offer the service for something around $50/mo for our usage levels.

Rob, I know you were interested in something like this for nuget.  What do you think about doing the same for all of our package repositories?  It would require moving from hg to git (sounds fine to me) and solving issues around space usage, dev checkouts, cost, etc.
Flags: needinfo?(rthijssen)
I like the idea. Is this what you meant by reference server: https://github.com/github/lfs-test-server ? Looks promising. If so, I'd be up for extending the go implementation to support NuGet & other types of repository feeds. Could be awesome...
Flags: needinfo?(rthijssen)
Yes, that's the ref server I was referring to.

I don't think the LFS client and server would need to know anything about nuget any more than regular git understands Python -- they just schlep files around.
You're right that they don't need to know. I certainly wouldn't do anything to the client, it's just that it occurred to me while testing out the server that if it's hosting both chocolatey packages (.nupkg files) as well as referenced MSIs, it makes sense to extend the interface to also serve up package metadata so that it can also operate as a full fledged nuget repository.

The git client doesn't need to know or care, but when the server detects an nupkg upload, it could be modified to update its nuget Packages manifest. If that worked well, other repository types could also be implemented. Its just generated json or xml manifests on top of the file hosting, and it suddenly becomes much more powerful by adding repository-server to it's existing file-server powers.
The server is a *git* (well, *git-LFS*) server, not an HTTP server.  This tool would only get the files distributed to all of the puppetmasters, where Apache would serve them to clients.  It would replace the rsync's we have right now.
i guess i'm not articulating my thoughts properly. the git lfs server go implementation is just an https server. afaics the git lfs client just pushes to (and pulls from) the git lfs https endpoints. the large files are all available from https urls directly. you don't need the lfs git client to access them. you can access them directly. this kind of negates the need to also serve them from apache.
i envisaged a scenario where you used the git lfs client to get files into git lfs, then used the git lfs https endpoints to access the large files directly, without apache.
if you then extended the git lfs server implementation to also serve manifest files in the format (xml, json, etc) dictated by the various repositories (apt, yum, nuget, etc), you could turn the git-lfs-server implementation into a much more powerful repository-server, that happens to also be a git-lfs endpoint.
Hm, interesting idea, but I see a few issues:

1. We'd need to set up a git-lfs server that is redundant across regions with automatic failover, and can handle a pretty substantial load.  The 'mock' builds, in particular, pound the crap out of our yum repos.

2. Does the git-lfs server provide access to files by the filename within the corresponding git repository (which it doesn't even host)?  And automatically provide those from the latest commit at a particular ref ("master" or "production" or something)?

If we could somehow solve 1 by using S3 to store the artifacts, referencing them by hash, and then just host the indexes on the puppetmasters, then we could probably get away without using git-lfs at all (the indexes could be checked in).
Unfortunately we don't have the cycles to work on this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: