715415 - evaluate "cloud" storage systems as a replacement for releng's filestore needs

Reporter

Description

•

13 years ago

We have more questions than info right now about how to make a proper request to IT to "fix" our file storage system.

One of the issues is that we have a couple different buckets of file content that have their own usage patterns - some of which are NAS unfriendly from what I understand.

One thought that has come up a couple of times now in our IRC conversations is about creating an S3 like backend to store files.

I propose that we setup a multi-server environment using something like OpenStack's Swift  http://openstack.org/projects/storage/ and then send to it a load that is similar to what we currently have at a minimum.

This will let us get a feel if we can consider it for realz

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

13 years ago

Adding gozer as the resident openstack guru (relatively speaking).

The use case here is that build slaves upload firefox packages (and some other stuff), and then test slaves download them to test them.  The read:write ratio is somewhere around 20:1.  The build slaves often aren't in the same colo as the test slaves.

A few things we're looking for:

 - "infinite" scalability by adding nodes
 - resiliency to failure (I don't want to buy RAID for these)
 - HTTP or FTP access (and ssl for auth)
 - access control (read will be open, write must be securely gated to protect releases)
 - WAN compatibliity and data locality to a WAN
 - public access, at least via a gateway, so users can download packages

nice-to-have:
 - ability to dynamically adjust redundancy per file
(devs like to have older builds around, but it's OK to lose those if a drive dies; more recent builds need to survive failures)

We have spare HP hardware in scl1 on which we can play with this stuff -- in fact, gozer, if you want to team up for experimentation on that hardware, that's fine.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Assignee: server-ops-releng → dustin

Philippe M. Chiasson (:gozer)

Comment 2

•

13 years ago

There is also:
 Eucalyptus's Walrus
 Danga's MogileFS

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

13 years ago

Nick, you're the designated storage guy :)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

13 years ago

We're crossing the streams a bit and gozer is testing some things that he will use that are most likely not going to be useful for releng.  But we'll track it all here for simplicity.

Action:
  - put 6 boxes in temporary VLAN in scl1, with admin host access, for nimbula, hand off to gozer (dustin)
    - can add gateway later with netops help

  - kickstart 3 centos 6 (or rhel5.x?) hosts for euclayptus (dustin, maybe matt)

  - set up a non-HA mogilefs cluster on 2-3 nodes (dustin to move hosts to vlan75; gozer to set up)

  - CD installs of swift on some boxes via iLO (dustin to move to vlan75; gozer to set up)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

13 years ago

Update from gozer:

all 6 hosts (3 eucalyptus / 3 mogile) are kickstarted and waiting for my love

nimbula is building itself

Philippe M. Chiasson (:gozer)

Comment 6

•

13 years ago

Turns out nimbula only includes AMI storage (i.e. machine images), so it's not suitable for this project's storage needs.

Philippe M. Chiasson (:gozer)

Comment 7

•

13 years ago

I now have a 3-node mogilefs cluster up and running, injecting all of TB's try builds in there for testing. Anybody else want to test/play with this ?

Checking devices...
  host device         size(G)    used(G)    free(G)   use%   ob state   I/O%
  ---- ------------ ---------- ---------- ---------- ------ ---------- -----
  [ 1] dev1           225.196     15.751    198.006  12.07%  writeable  51.7
  [ 2] dev2           225.196      2.891    210.866   6.36%  writeable   0.0
  [ 3] dev3           225.196      2.783    210.974   6.32%  writeable   0.0
  ---- ------------ ---------- ---------- ---------- ------
             total:   675.588     21.424    619.846   8.25%

Philippe M. Chiasson (:gozer)

Comment 8

•

13 years ago

Eucalyptus's Walrus is S3 compatible, but without any redundancy/reliability, operating on a single node. Makes it not suitable for this project at all, unfortunately.

"The features that you mentioned are not currently available in the open source version of Eucalyptus. HA and replication will very likely be enterprise add-ons."

Philippe M. Chiasson (:gozer)

Comment 9

•

13 years ago

Ceph (http://ceph.newdream.net/) might fit the bill, I'll probably try and set that up as well.

Philippe M. Chiasson (:gozer)

Comment 10

•

13 years ago

In case anybody wants to play with the MogileFS setup, the necessary info is:

 trackers=10.12.52.3:6001,10.12.52.5:6001,10.12.52.6:6001
 domain = try

As in

$>  mogtool --domain=try --trackers=10.12.52.3:6001,10.12.52.5:6001,10.12.52.6:6001 listkey
/thunderibrd/try/bienvenu@nventure.com-1e319712b6f7/try-comm-central-linux-debug/jsshell-linux-i686.zip
/thunderibrd/try/bienvenu@nventure.com-1e319712b6f7/try-comm-central-linux-debug/thunderbird-11.0a1.en-US.langpack.xpi
/thunderibrd/try/bienvenu@nventure.com-1e319712b6f7/try-comm-central-linux-debug/thunderbird-11.0a1.en-US.linux-i686.checksums
[...]
#258 files found

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

13 years ago

thoughts on mogilefs:

 - "infinite" scalability by adding nodes
   check

 - resiliency to failure (I don't want to buy RAID for these)
   check

 - HTTP or FTP access (and ssl for auth)
   HTTP access, yes, but requires a client to talk to trackers

 - access control (read will be open, write must be securely gated to protect releases)
   This can be configured using mod_ssl and whatever auth we want (e.g., client certs)

 - WAN compatibliity and data locality to a WAN
   possible:
   http://lists.danga.com/pipermail/mogilefs/2007-August/001182.html

 - public access, at least via a gateway, so users can download packages
   Sort of - we would have to build a frontend of some sort to proxy this, but that shouldn't be too hard -- it would be vaguely similar to bouncer, along with a proxy.

 - ability to dynamically adjust redundancy per file
   Sort of - we can set redundancy per class, but AFAIK you can't change an object's class after it's created.  So we can set redundancy differently for e.g., releases and try builds, but can't adjust "old" tinderbox builds to have less redundancy than new.


Summary: This meets the requirements, so I think we should consider it.  However, deploying it will require some more work than just changing the access method used in the build and test scripts (need a client; set up auth; WAN-aware replication strategy; and a custom frontend).

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

13 years ago

As a general status for this bug, specifically regarding releng's needs:

Nimbula - rejected (only AMI storage)
Euclayptus Walrus - rejected (not redundant)
MogileFS - acceptable
Ceph - not tried yet
OpenStack - not desirable
glusterfs - suggested in #infra yesterday

(OpenStack is quite new and not mature, seems designed-by-committee, and doesn't have an S3-compatible interface.  Using Swift would probably require pulling in a number of other OpenStack components)

Philippe M. Chiasson (:gozer)

Comment 13

•

13 years ago

Ceph cluster up and running, POSIX semantics are really nice feature.

Amazon S3 http front-end up and running as well.

Poke me for mount information or s3 credentials.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

13 years ago

As a note about glusterfs, this is currently used on cn-web*.cn, and occasionally fouls up and requires a reboot.  I'll login and look at how it's set up now, but this has me worried.

Philippe M. Chiasson (:gozer)

Comment 15

•

13 years ago

OpenStack up and running, including Swift (storage). It's a very different beast of storage, and it is its own protocol, so to test, the swift client is needed.

Philippe M. Chiasson (:gozer)

Comment 16

•

13 years ago

Turns out there is an s3 compatible middleware for Swift, and it seems to be working just fine.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 17

•

12 years ago

I promised I'd decide on Monday whether we'd try to do this ab novo in scl3.  The answer is "no" - the build artifact storage migration is already complex, and critical, so let's stick to the "replicate with fewer bugs" plan, rather than doing something new.

That doesn't mean this bug should stop, and it's already assigned to me, as it's on my TODO to evaluate the most recent options and write them up a la comment 11.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 18

•

12 years ago

OK, I've delayed this *way* too long.  I'd like to have a look at ceph's and openstack's S3 interfaces a bit.  Gozer, can we meet up to look at that tomorrow morning?

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 19

•

12 years ago

thoughts on Ceph:

I think that we would set this up so that access by build machines was done via the S3 interface, with the filesystem module only used (read-only) to serve files via HTTP/FTP/rsync, do virus scanning, etc.

I played around with this with a hacked version of s3cmd.  I had to use a bucket with a '_' in it to stop the client from trying to use DNS subdomains, but that's a client issue and could easily be fixed.

I was not able to upload a "large" file to the cluster from my machine at home -- it failed after ~30 seconds each time:

> dustin@cerf ~/Downloads/s3cmd-1.1.0-beta3 $ ./s3cmd put ~/Downloads/NS_ICG_V2_8.1.pdf s3://dustin_/ns.pdf
> WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1736704 of 4498971    38% in   32s    51.64 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.00)
> WARNING: Waiting 3 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1687552 of 4498971    37% in   33s    49.91 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.01)
> WARNING: Waiting 6 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1728512 of 4498971    38% in   32s    51.40 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.05)
> WARNING: Waiting 9 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>  1699840 of 4498971    37% in   32s    50.51 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=0.25)
> WARNING: Waiting 12 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>   626688 of 4498971    13% in   38s    15.97 kB/s  failed
> WARNING: Upload failed: /ns.pdf ([Errno 32] Broken pipe)
> WARNING: Retrying on lower speed (throttle=1.25)
> WARNING: Waiting 15 sec...
> /Users/dustin/Downloads/NS_ICG_V2_8.1.pdf -> s3://dustin_/ns.pdf  [1 of 1]
>   233472 of 4498971     5% in   70s     3.25 kB/s  failed
> ERROR: Upload of '/Users/dustin/Downloads/NS_ICG_V2_8.1.pdf' failed too many times. Skipping that file.

I tried doing the same on admin1a, in case this was a network session timeout.  Again, small files were fine, but big files:

> [root@admin1a.infra.scl1 s3cmd-1.1.0-beta3]# ./s3cmd put ../NS_ICG_V2_8.1.pdf s3://dustin_/ns2.pdf
> WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
> ../NS_ICG_V2_8.1.pdf -> s3://dustin_/ns2.pdf  [1 of 1]
>  4498971 of 4498971   100% in    0s     6.76 MB/s  done
> ERROR: syntax error: line 1, column 49
> ERROR: Parameter problem: Bucket contains invalid filenames. Please run: s3cmd fixbucket s3://your-bucket/

I couldn't find a way around that.  Gozer's investigating

 - "infinite" scalability by adding nodes
   check

 - resiliency to failure (I don't want to buy RAID for these)
   check

 - HTTP or FTP access (and ssl for auth)
   By mounting the the ceph fs on http/ftp servers, yes

 - access control (read will be open, write must be securely gated to protect releases)
   Yes, using IP access controls like the netapps for filesystems, and S3 permissions for writes

 - WAN compatibliity and data locality to a WAN
   This is less important now that we're envisioning moving all of releng to scl3.  I don't know if Ceph supports it.

 - public access, at least via a gateway, so users can download packages
   Yep (HTTP/FTP above)

 - ability to dynamically adjust redundancy per file
   Not per file; can do this per-pool, where there's really only one pool by default, with N=2.  CRUSH can be used to make toplogically sensitive allocation decisions, but doesn't change that N.  I can't find any docs about how to change those settings, which is sad -- we would want N=1 for tinderbox builds, and probably N=3 for nightlies.  I think "PG" is related, but I don't see any definition for that abbreviation.

Summary: I think this is a better option than mogilefs, as it would require less custom development.  I assume that the upload issues are a misconfig on my end or in Ceph, and can be fixed.  The lack of any real documentation is disturbing, though.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 20

•

12 years ago

Ah, these are 500 ISE's:

[Thu Feb 23 10:50:54 2012] [warn] [client 10.12.48.10] mod_fcgid: HTTP request length 135168 (so far) exceeds MaxRequestLen (131072)

Fixing MaxRequestLen (to 5GB) straightened this right out.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 21

•

12 years ago

From what Gozer's said, OpenStack/Swift isn't quite ready for prime time yet -- in fact, the test install had failed when he went to look at it.  So I think this evaluation comes down to Ceph being a good option.

None of this will happen until we're out of sjc1, so there's nothing much more to do in this bug.  I'm happy to toss around ideas in IRC until then.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Philippe M. Chiasson (:gozer)

Comment 22

•

12 years ago

There is also GlusterFS that was a candidate and didn't get looked at. Might be worth a look at it as an equivalent/replacement for Ceph

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 23

•

12 years ago

I rejected it, perhaps unfairly, based on comment 14.  I'm leery of stuff that has kernel modules *and* crashes a lot, since that tends to tear down a lot of associated things with it.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations