evaluate "cloud" storage systems as a replacement for releng's filestore needs

RESOLVED FIXED

Status

Infrastructure & Operations
RelOps
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: bear, Assigned: dustin)

Tracking

Details

(Reporter)

Description

6 years ago
We have more questions than info right now about how to make a proper request to IT to "fix" our file storage system.

One of the issues is that we have a couple different buckets of file content that have their own usage patterns - some of which are NAS unfriendly from what I understand.

One thought that has come up a couple of times now in our IRC conversations is about creating an S3 like backend to store files.

I propose that we setup a multi-server environment using something like OpenStack's Swift  http://openstack.org/projects/storage/ and then send to it a load that is similar to what we currently have at a minimum.

This will let us get a feel if we can consider it for realz
Adding gozer as the resident openstack guru (relatively speaking).

The use case here is that build slaves upload firefox packages (and some other stuff), and then test slaves download them to test them.  The read:write ratio is somewhere around 20:1.  The build slaves often aren't in the same colo as the test slaves.

A few things we're looking for:

 - "infinite" scalability by adding nodes
 - resiliency to failure (I don't want to buy RAID for these)
 - HTTP or FTP access (and ssl for auth)
 - access control (read will be open, write must be securely gated to protect releases)
 - WAN compatibliity and data locality to a WAN
 - public access, at least via a gateway, so users can download packages

nice-to-have:
 - ability to dynamically adjust redundancy per file
(devs like to have older builds around, but it's OK to lose those if a drive dies; more recent builds need to survive failures)

We have spare HP hardware in scl1 on which we can play with this stuff -- in fact, gozer, if you want to team up for experimentation on that hardware, that's fine.
Assignee: server-ops-releng → dustin
There is also:
 Eucalyptus's Walrus
 Danga's MogileFS
Nick, you're the designated storage guy :)
We're crossing the streams a bit and gozer is testing some things that he will use that are most likely not going to be useful for releng.  But we'll track it all here for simplicity.

Action:
  - put 6 boxes in temporary VLAN in scl1, with admin host access, for nimbula, hand off to gozer (dustin)
    - can add gateway later with netops help

  - kickstart 3 centos 6 (or rhel5.x?) hosts for euclayptus (dustin, maybe matt)

  - set up a non-HA mogilefs cluster on 2-3 nodes (dustin to move hosts to vlan75; gozer to set up)

  - CD installs of swift on some boxes via iLO (dustin to move to vlan75; gozer to set up)
Update from gozer:

all 6 hosts (3 eucalyptus / 3 mogile) are kickstarted and waiting for my love

nimbula is building itself
Turns out nimbula only includes AMI storage (i.e. machine images), so it's not suitable for this project's storage needs.
I now have a 3-node mogilefs cluster up and running, injecting all of TB's try builds in there for testing. Anybody else want to test/play with this ?

Checking devices...
  host device         size(G)    used(G)    free(G)   use%   ob state   I/O%
  ---- ------------ ---------- ---------- ---------- ------ ---------- -----
  [ 1] dev1           225.196     15.751    198.006  12.07%  writeable  51.7
  [ 2] dev2           225.196      2.891    210.866   6.36%  writeable   0.0
  [ 3] dev3           225.196      2.783    210.974   6.32%  writeable   0.0
  ---- ------------ ---------- ---------- ---------- ------
             total:   675.588     21.424    619.846   8.25%
Eucalyptus's Walrus is S3 compatible, but without any redundancy/reliability, operating on a single node. Makes it not suitable for this project at all, unfortunately.

"The features that you mentioned are not currently available in the open source version of Eucalyptus. HA and replication will very likely be enterprise add-ons."
Ceph (http://ceph.newdream.net/) might fit the bill, I'll probably try and set that up as well.
In case anybody wants to play with the MogileFS setup, the necessary info is:

 trackers=10.12.52.3:6001,10.12.52.5:6001,10.12.52.6:6001
 domain = try

As in

$>  mogtool --domain=try --trackers=10.12.52.3:6001,10.12.52.5:6001,10.12.52.6:6001 listkey
/thunderibrd/try/bienvenu@nventure.com-1e319712b6f7/try-comm-central-linux-debug/jsshell-linux-i686.zip
/thunderibrd/try/bienvenu@nventure.com-1e319712b6f7/try-comm-central-linux-debug/thunderbird-11.0a1.en-US.langpack.xpi
/thunderibrd/try/bienvenu@nventure.com-1e319712b6f7/try-comm-central-linux-debug/thunderbird-11.0a1.en-US.linux-i686.checksums
[...]
#258 files found
thoughts on mogilefs:

 - "infinite" scalability by adding nodes
   check

 - resiliency to failure (I don't want to buy RAID for these)
   check

 - HTTP or FTP access (and ssl for auth)
   HTTP access, yes, but requires a client to talk to trackers

 - access control (read will be open, write must be securely gated to protect releases)
   This can be configured using mod_ssl and whatever auth we want (e.g., client certs)

 - WAN compatibliity and data locality to a WAN
   possible:
   http://lists.danga.com/pipermail/mogilefs/2007-August/001182.html

 - public access, at least via a gateway, so users can download packages
   Sort of - we would have to build a frontend of some sort to proxy this, but that shouldn't be too hard -- it would be vaguely similar to bouncer, along with a proxy.

 - ability to dynamically adjust redundancy per file
   Sort of - we can set redundancy per class, but AFAIK you can't change an object's class after it's created.  So we can set redundancy differently for e.g., releases and try builds, but can't adjust "old" tinderbox builds to have less redundancy than new.


Summary: This meets the requirements, so I think we should consider it.  However, deploying it will require some more work than just changing the access method used in the build and test scripts (need a client; set up auth; WAN-aware replication strategy; and a custom frontend).
As a general status for this bug, specifically regarding releng's needs:

Nimbula - rejected (only AMI storage)
Euclayptus Walrus - rejected (not redundant)
MogileFS - acceptable
Ceph - not tried yet
OpenStack - not desirable
glusterfs - suggested in #infra yesterday

(OpenStack is quite new and not mature, seems designed-by-committee, and doesn't have an S3-compatible interface.  Using Swift would probably require pulling in a number of other OpenStack components)
Ceph cluster up and running, POSIX semantics are really nice feature.

Amazon S3 http front-end up and running as well.

Poke me for mount information or s3 credentials.
As a note about glusterfs, this is currently used on cn-web*.cn, and occasionally fouls up and requires a reboot.  I'll login and look at how it's set up now, but this has me worried.
OpenStack up and running, including Swift (storage). It's a very different beast of storage, and it is its own protocol, so to test, the swift client is needed.
Turns out there is an s3 compatible middleware for Swift, and it seems to be working just fine.
I promised I'd decide on Monday whether we'd try to do this ab novo in scl3.  The answer is "no" - the build artifact storage migration is already complex, and critical, so let's stick to the "replicate with fewer bugs" plan, rather than doing something new.

That doesn't mean this bug should stop, and it's already assigned to me, as it's on my TODO to evaluate the most recent options and write them up a la comment 11.
OK, I've delayed this *way* too long.  I'd like to have a look at ceph's and openstack's S3 interfaces a bit.  Gozer, can we meet up to look at that tomorrow morning?
thoughts on Ceph:

I think that we would set this up so that access by build machines was done via the S3 interface, with the filesystem module only used (read-only) to serve files via HTTP/FTP/rsync, do virus scanning, etc.

I played around with this with a hacked version of s3cmd.  I had to use a bucket with a '_' in it to stop the client from trying to use DNS subdomains, but that's a client issue and could easily be fixed.

I was not able to upload a "large" file to the cluster from my machine at home -- it failed after ~30 seconds each time:

> dustin@cerf ~/Downloads/s3cmd-1.1.0-beta3 $ ./s3cmd put ~/Downloads/NS_ICG_V2_8.1.pdf s3://dustin_/ns.pdf
> WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1736704 of 4498971    38% in   32s    51.64 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.00)
> WARNING: Waiting 3 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1687552 of 4498971    37% in   33s    49.91 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.01)
> WARNING: Waiting 6 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1728512 of 4498971    38% in   32s    51.40 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.05)
> WARNING: Waiting 9 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1699840 of 4498971    37% in   32s    50.51 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.25)
> WARNING: Waiting 12 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>   626688 of 4498971    13% in   38s    15.97 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=1.25)
> WARNING: Waiting 15 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>   233472 of 4498971     5% in   70s     3.25 kB/s  failed
> ERROR: Upload of '/Users/dustin/Downloads/NS_ICG_V2_8.1.pdf' failed too many times. Skipping that file.

I tried doing the same on admin1a, in case this was a network session timeout.  Again, small files were fine, but big files:

> [root@admin1a.infra.scl1 s3cmd-1.1.0-beta3]# ./s3cmd put ../NS_ICG_V2_8.1.pdf s3://dustin_/ns2.pdf
> WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
> ../NS_ICG_V2_8.1.pdf -> s3://dustin_/ns2.pdf  [1 of 1]
>  4498971 of 4498971   100% in    0s     6.76 MB/s  done
> ERROR: syntax error: line 1, column 49
> ERROR: Parameter problem: Bucket contains invalid filenames. Please run: s3cmd fixbucket s3://your-bucket/

I couldn't find a way around that.  Gozer's investigating

 - "infinite" scalability by adding nodes
   check

 - resiliency to failure (I don't want to buy RAID for these)
   check

 - HTTP or FTP access (and ssl for auth)
   By mounting the the ceph fs on http/ftp servers, yes

 - access control (read will be open, write must be securely gated to protect releases)
   Yes, using IP access controls like the netapps for filesystems, and S3 permissions for writes

 - WAN compatibliity and data locality to a WAN
   This is less important now that we're envisioning moving all of releng to scl3.  I don't know if Ceph supports it.

 - public access, at least via a gateway, so users can download packages
   Yep (HTTP/FTP above)

 - ability to dynamically adjust redundancy per file
   Not per file; can do this per-pool, where there's really only one pool by default, with N=2.  CRUSH can be used to make toplogically sensitive allocation decisions, but doesn't change that N.  I can't find any docs about how to change those settings, which is sad -- we would want N=1 for tinderbox builds, and probably N=3 for nightlies.  I think "PG" is related, but I don't see any definition for that abbreviation.

Summary: I think this is a better option than mogilefs, as it would require less custom development.  I assume that the upload issues are a misconfig on my end or in Ceph, and can be fixed.  The lack of any real documentation is disturbing, though.
Ah, these are 500 ISE's:

[Thu Feb 23 10:50:54 2012] [warn] [client 10.12.48.10] mod_fcgid: HTTP request length 135168 (so far) exceeds MaxRequestLen (131072)

Fixing MaxRequestLen (to 5GB) straightened this right out.
From what Gozer's said, OpenStack/Swift isn't quite ready for prime time yet -- in fact, the test install had failed when he went to look at it.  So I think this evaluation comes down to Ceph being a good option.

None of this will happen until we're out of sjc1, so there's nothing much more to do in this bug.  I'm happy to toss around ideas in IRC until then.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
There is also GlusterFS that was a candidate and didn't get looked at. Might be worth a look at it as an equivalent/replacement for Ceph
I rejected it, perhaps unfairly, based on comment 14.  I'm leery of stuff that has kernel modules *and* crashes a lot, since that tends to tear down a lot of associated things with it.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.