Closed Bug 962298 Opened 9 years ago Closed 8 years ago

Investigate blob storage options

Categories

(Infrastructure & Operations :: Storage, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: dustin, Assigned: dtorre)

References

Details

Related:
 715415 -  evaluate "cloud" storage systems as a replacement for releng's filestore needs
 Ceph is being evaluated in labs (/cc Gozer)
The requirements as I have them right now are:
 * in-house
 * high performance
 * REST interface

This would mainly, but not exclusively, be used for caching - ccache, probably hg caches, caches of built artifacts, etc.

Open questions:
 * Is direct S3 compatibility required? or just REST?
 * Even assuming S3 compatibility, is it OK that it will have a different endpoint than S3?
All I can say so far is that I've been running Ceph both in labs and as a proof of concept for some webops project. And so far, I am quite happy and impressed with it. It's solid, stable and is well designed from the inside as well as from the operational point of view.

Only drawback I can think of is that for the best performance, you need a ceph kernel module that has been mainlined a while back, but that is not part of RHEL.

It does come with great POSIX support for just a good-old mount, but it also sports an object gateway that talks S3 and also Swift, albeit not 100% API complete. That part, however, I have no experience with, but could easily assist in setting one up.

More info on Ceph's Rados Gateway here: http://ceph.com/docs/master/radosgw/
I was impressed with the Ceph internals at the talk I attended back at LISA'12.

I think we'd only ever want an HTTP API to the storage - no filesystem.  S3 compatibility probably doesn't have to be very deep, either - the basic PUT and GET should be adequate.

What do you see as the future of Ceph at Mozilla?  Will it become an IT-supported platform?  It'd be a shame for these two different bits of Mozilla to pick different object storage technologies and bifurcate the available support resources.
Another point in Ceph's favor: CloudStack, at least, can use it for primary and secondary storage.
If that's a good thing, there is also the RBD (Rados Block Devices) that give you Linux-OS level raw block devices backed by Ceph's replication.

And IIRC, kvm/qemu/libvirt have direct support for it.
I'm optimistically re-assigning to gozer :)

Gozer, probably the best next step here is to set up some access to the current proof-of-concept.  Taras, is there someone on the TaskCluster project or otherwise who would be well-positioned to give Ceph a trial run and either give the thumbs-up or enumerate some requirements that we can use to search for other options?
Assignee: relops → gozer
Flags: needinfo?(taras.mozilla)
Mike can look into this. Probably from the perspective of Windows builds and s3cache. Note blob store will be a lot more useful once we have some sort of a *stack deployment. For now the main usecase is 
a) caching compilation artifacts. This is Mike's stuff
b) reporting test data. Joel can probably consider this. Not sure if this would be a win over using s3 over internet
Flags: needinfo?(taras.mozilla)
From chatting with Taras, he's confident that the Ceph blob-storage API is adequate, so access to the POC system actually isn't necessary.

Instead, let's get a system stood up in the releng network somewhere that we can point scl3 releng builders at, using ccache.

Gozer, what kind of resources would you need to do something like that?  We have a few spare HP DL120G7's that will be in scl3 tomorrow afternoon, and some iX21X4 2U Neutron's there now.  The former have 250G drives; the latter have 500G (I think).  And how soon could we do this?  The latter question might be best handled by email or irc with taras directly - I make a poor middleman.
Flags: needinfo?(gozer)
Depends on: 966268
Gozer -- I just requested some more chunky hardware in bug 966268.  However, if that takes a while to set up and will go to scl3, then we can start off with some of the ix-mn-*.relabs.releng.scl3.mozilla.com hosts, which have 500G disks each.  Presumably we could drop the hosts from bug 966268 into that cluster once they're available.  I can get you ix-mn-* hosts for the asking - just let me know how many you need.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #9)
> Gozer -- I just requested some more chunky hardware in bug 966268.  However,
> if that takes a while to set up and will go to scl3, then we can start off
> with some of the ix-mn-*.relabs.releng.scl3.mozilla.com hosts, which have
> 500G disks each.  Presumably we could drop the hosts from bug 966268 into
> that cluster once they're available.  I can get you ix-mn-* hosts for the
> asking - just let me know how many you need.

a machine with one decent-speed hd(eg not some failing drive) would be a good start. s3-type stuff does checksums to make sure data is still valid, so no need for anything fancy. Once we start relying on this stuff, would be good to tweak perf/hw carefully, compare swift perf, etc.

Still waiting for some eta here.
Assuming I can get my hands on the hardware, I think having something up and running next week is realistic.

However, I would like to schedule a chat (Vidyo?) with :taras and :dustin early next week to understand exactly what's needed/wanted here.

It's one thing to setup a 2-node Ceph cluster sort-of-for-fun, but I don't want to fully commit to this until I have a better understanding of what's being attempted here. And what's really important or needed with this setup.

Things like monitoring, uptime, resiliency, access-control, etc.
Flags: needinfo?(gozer)
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #9)
> Gozer -- I just requested some more chunky hardware in bug 966268.  However,
> if that takes a while to set up and will go to scl3, then we can start off
> with some of the ix-mn-*.relabs.releng.scl3.mozilla.com hosts, which have
> 500G disks each.  Presumably we could drop the hosts from bug 966268 into
> that cluster once they're available.  I can get you ix-mn-* hosts for the
> asking - just let me know how many you need.

Initially, the minimum for me would be 3 reasonably similar boxes, that's it.

You can do it with a single box, but that gets you *no* fault tolerance, obviously.

One of the nice thing about Ceph is its ability to grow every parts of
itself without downtime. So you can grow it as you need it to, or as resources become
available.
I see Taras set up a meeting for Monday afternoon, but ping me as soon as you're around on Monday and I can answer at least enough questions to get the ball rolling.
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #13)
> I see Taras set up a meeting for Monday afternoon, but ping me as soon as
> you're around on Monday and I can answer at least enough questions to get
> the ball rolling.

Feel free to cancel the meeting if you guys agree between yourself. To start with we will do without fault-tolerance. a global ccache usecase is ok with a lossy store, there are 2 requirements:
* the crucial part is to avoid corruption and refuse to serve files if they do not pass some internal consistency check.
* Low latency. aws s3 can do 0.005s best-case time to retrieve a small file over http. I think we average around 20-40ms atm for mozilla-central .o files. We'd want to have perf in that ballpark(lower latency=faster builds)
:dustin, can you include here the 3 systems you allocated for Ceph, so I can track them down and start spinning them up? Thanks!
(In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment #16)
> https://inventory.mozilla.org/en-US/systems/show/2401/

BMC not reacheable on that one either.

> https://inventory.mozilla.org/en-US/systems/show/7354/

Kickstarted and Ceph installed.

> https://inventory.mozilla.org/en-US/systems/show/3605/ (going to ix for
> service, but we might have more like it onsite soon)

Unreacheable.
(In reply to Philippe M. Chiasson (:gozer) from comment #17)
> (In reply to Dustin J. Mitchell [:dustin] (I ignore NEEDINFO) from comment
> #16)
> > https://inventory.mozilla.org/en-US/systems/show/2401/
> 
> BMC not reacheable on that one either.

I just fixed the username/password on this - you should be good to go.
There is now a 3-node Ceph cluster running on:
 - ix-mn-6
 - ix1204-1
 - ix1204-2

[root@ix-mn-6 test]# ceph status
    cluster 41309758-5661-4a4c-adba-1a6826bc2be0
     health HEALTH_OK
     monmap e1: 3 mons at {ix-mn-6=10.26.78.28:6789/0,ix1204-1=10.26.78.18:6789/0,ix1204-2=10.26.78.19:6789/0}, election epoch 4, quorum 0,1,2 ix-mn-6,ix1204-1,ix1204-2
     mdsmap e6: 1/1/1 up {0=ix-mn-6=up:active}, 2 up:standby
     osdmap e29: 3 osds: 3 up, 3 in
      pgmap v79: 264 pgs, 12 pools, 3272 kB data, 123 objects
            15479 MB used, 3956 GB / 3972 GB avail
                 264 active+clean
  client io 4094 B/s rd, 8803 B/s wr, 17 op/s

And there is an S3/Swift endpoint (RadosGW) running (in non-ha mode) here:

http://ix-mn-6.relabs.releng.scl3.mozilla.com:80/

To play with it, you'll need an access_key/secret_key combo, just ping me and I'll get you what you need.

NOTE: No SSL at this time, but could easily be added
NOTE: No DNS-based bucket names, buckets are http://ix-mn-6.relabs.releng.scl3.mozilla.com:80/<bucketname>
      as in <http://10.26.78.28/test-public/>
      Some S3 clients have problems with bucket names in the path and not the hostname
NOTE: No HA/redudancy/load-balancing for the S3 endpoint. Could easily be done, but right now, the cluster is
      a HA setup with 3-nodes. It's only the S3 gateway that isn't.
gozer: please follow up with taras and glandium to get them access.
(In reply to Amy Rich [:arich] [:arr] from comment #21)
> gozer: please follow up with taras and glandium to get them access.

both taras and glandium should have their access_keys and private_keys to access this service via the S3 API
gozer, can you give ceph/s3 credentials to mshal too? He's glandium's peer on this build faster work
(In reply to Taras Glek (:taras) from comment #23)
> gozer, can you give ceph/s3 credentials to mshal too? He's glandium's peer
> on this build faster work

Credentials sent via e-mail to mshal@mozilla.com
Looks like this meets our needs for a global compilation cache. We will move to using this to prototype speeding up our Windows and Mac builds asap. This should free up a good deal of Windows and Mac compute capacity for tests, etc.

Amy, what are the next steps for getting a production ceph setup? 

Btw, http://www.scality.com/how-object-storage-tackles-thorny-exascale-problems/ is an interesting paper. It tries hard to sell the proprietary solution, but ceph comes out looking pretty good for our usecase.
A production environment will require long term support from IT.  After broaching this with the IT leadership folks on irc, it was suggested that WOS might be a better, more cost effective alternative.  I believe that cshields was going to discuss that with you.

If you discuss this and the decision is made to go with ceph, then security needs to do a review, we need to identify who will support this long term, and a project kickoff form needs to be filled out (that includes writing up a deal summary, obtaining funding from platform engineering, etc).
David is taking over this project and will be looking at multiple options that meet the business' needs, so I'm setting him as the owner.
Assignee: gozer → dtorre
Component: RelOps → Server Operations: Storage
Product: Infrastructure & Operations → mozilla.org
QA Contact: arich → dparsons
Next steps:

I put together a list of options for object storage [object storage as a service] at Mozilla. 

We in IT want to ensure the solution is supportable. We also want to ensure it has a decent chance of meeting WebEng's object storage needs as well. Thus I need to incorporate Inktank support, training, and ensure we have proper resources. We don't necessarily need this ASAP, but it needs to be planned. I also need to show some other options, ROI related to cutting build time at AWS, and other IT management housekeeping you guys may not care too much about.


I'd like to have this final design and decision from IT managers by mid next week. From there, assuming the rest of the Ops support staff in IT can support this option, we move ahead with a purchase. 



On a parenthetical note, there is a bug (bug 962679) regarding Amazon Direct Connect. If ADC provides sub 20ms access time to S3 from SCL3, I need to understand if that solution would be used in tandem with this one, or would instead replace OnPrem object storage altogether. Since Laura's team needs 500TB of object storage, I don't see the need for OnPrem object storage going away. However I wanted to add the ADC variable here for reference.
(In reply to dtorre from comment #28)
> On a parenthetical note, there is a bug (bug 962679) regarding Amazon Direct
> Connect. If ADC provides sub 20ms access time to S3 from SCL3, I need to
> understand if that solution would be used in tandem with this one, or would
> instead replace OnPrem object storage altogether. Since Laura's team needs
> 500TB of object storage, I don't see the need for OnPrem object storage
> going away. However I wanted to add the ADC variable here for reference.


It doesn't seem like encouraging S3 via DC is a good idea if we'll be starting from about $450/month of traffic for our smallest capacity usecase

Pasting from a prior email on this:
we pay $0.03/GB for bandwidth. 

For the existing workload that we are trying to replicate inhouse: we do about 21GB/hour to a single S3 bucket in aws atm. We store about 400GB there. 

Our onsite usage is likely to be a few times that in a couple of months.
gozer, can you give Selena s3 creds to the test cluster?
gozer, could you create a mozilla-releng-ceph-cache-scl3-try bucket and restrict my access key to only access that bucket (or create a new access key with only access to that bucket). Ideally, the access would be limited to PutObjectAcl (in S3 terms).
(In reply to Mike Hommey [:glandium] from comment #31)
> gozer, could you create a mozilla-releng-ceph-cache-scl3-try bucket and
> restrict my access key to only access that bucket (or create a new access
> key with only access to that bucket). Ideally, the access would be limited
> to PutObjectAcl (in S3 terms).

You should be able to create a bucket with such ACLs yourself, no? Or has that failed for you?

Otherwise, let me know and I'll try and do it for you.
(In reply to Philippe M. Chiasson (:gozer) from comment #32)
> (In reply to Mike Hommey [:glandium] from comment #31)
> > gozer, could you create a mozilla-releng-ceph-cache-scl3-try bucket and
> > restrict my access key to only access that bucket (or create a new access
> > key with only access to that bucket). Ideally, the access would be limited
> > to PutObjectAcl (in S3 terms).
> 
> You should be able to create a bucket with such ACLs yourself, no? Or has
> that failed for you?
> 
> Otherwise, let me know and I'll try and do it for you.

I don't have credentials, and would love to get some while you're in there! :)
(In reply to Philippe M. Chiasson (:gozer) from comment #32)
> You should be able to create a bucket with such ACLs yourself, no? Or has
> that failed for you?
> 
> Otherwise, let me know and I'll try and do it for you.

How do I restrict my access key from doing anything else? I want an access key that has put access on that bucket, and *nothing* else. No bucket creation, no access to other buckets, etc.
(In reply to Selena Deckelmann :selenamarie :selena from comment #33)
> (In reply to Philippe M. Chiasson (:gozer) from comment #32)
>> [...]
> > 
> > Otherwise, let me know and I'll try and do it for you.
> 
> I don't have credentials, and would love to get some while you're in there!
> :)

Credentials sent via e-mail
(In reply to Mike Hommey [:glandium] from comment #34)
> (In reply to Philippe M. Chiasson (:gozer) from comment #32)
> > You should be able to create a bucket with such ACLs yourself, no? Or has
> > that failed for you?
> > 
> > Otherwise, let me know and I'll try and do it for you.
> 
> How do I restrict my access key from doing anything else? I want an access
> key that has put access on that bucket, and *nothing* else. No bucket
> creation, no access to other buckets, etc.

my bad, I didn't fully understand the permission needed, I blame it on my paternity leave.

Created a new access key for you and e-mailed it.
(In reply to Philippe M. Chiasson (:gozer) from comment #36)
> my bad, I didn't fully understand the permission needed, I blame it on my
> paternity leave.
> 
> Created a new access key for you and e-mailed it.

Got things working appropriately. One weird thing is that if I give WRITE_ACP permissions to that user on the new bucket, putting data fails. It however works if I give FULL_CONTROL.
After having a frustrating experience debugging ceph perf, I think we might've overengineered this.

The benefit of an object store is infinite scalability, but object caches might have a fairly finite size(waiting on gozer to tell us what our space usage is atm). I think all we need in this case is an nginx web cache ala http://codemonkey.ravelry.com/2011/05/16/saving-money-by-putting-a-web-cache-in-front-of-amazon-s3 running on a box with good IO + network local to datacenter. 
We'd then upload to s3 as normal and have clients pull from the cache. 

https://gist.github.com/sansmischevia/5617402
eg, s3 can take care of scaling, object expiry, etc and the nginx cache can take care of providing low read latencies that we are after for 'hot' objects
Why not just memcached?
(In reply to Dustin J. Mitchell [:dustin] from comment #40)
> Why not just memcached?

You are right, memcached has been one of the options since the beginning. I think we should look at 3
1) ceph (done)
2) nginx caching on a single ssd machine(new idea)
3) memcached cluster (old idea)

The key variable here is the needed capacity(thanks gozer). Looks like our actual usage for 2 weeks worth of compiles is 250GB per arch per try pool(this matches what we saw on EC2). That means windows + mac is 500-750gb(eg mac might be double of windows because it does 64 and 32bit). We should have at-least 1.5x capacity.

Good thing we had 4tb to start in ceph :) This is also why ceph is overkill. We wont need to scale beyond a single machine worth of ssds for foreseeable future.

So 2) seems like path of least resistance to me. However, if you think 3) is faster to spin up, I'm fine with that.

Regarding hw. If you have spare hw for this elsewhere, that'd be best. If you need hw, for next few weeks can use spare seamicro nodes for either 2) or 3).

Dustin, up to you re which would be faster to spin up 500gb(not having spare capacity is probably ok for a PoC) of cache storage with. Which one do you want to go with next, and when can we start testing on it?

glandium has the backend abstracted in sccache, can adjust it to anything you setup pretty quickly.
We have 14 32G seamicro nodes, for 448GB of RAM.  We have a bunch of 16G nodes as well.  I'd have to check on the stable-hashing algorithm that clients use, but I think that we could mix those two by just listing each of the the 32G nodes twice so that they get 2x the keys of the 16G nodes.  That would probably be easiest to set up - memcached is dead simple, and we already have seamicro support in PuppetAgain.  I doubt I'll be able to get to it today, though, and next week is the releng workweek.
448GB would be sufficient. By you I meant you and people you usually get help from for these sorts of things. 

We'll make some hack time during workweek.
also fyi:
10:07 <gozer> All you need is to PUT/GET/DELETE and be done with it
10:07 <gozer> S3 only adds latency and cost, no?
10:08 <taras> we use s3 already for this on aws
10:08 <taras> but ya
10:08 <taras> PUTing into nginx is an option too
10:08 <taras> it's just slightly more complex cos one needs to setup a cronjob to delete files :)
10:08 <taras> s3 is nice
10:08 <taras> cos it takes care of auth
10:08 <taras> expiry
10:08 <taras> and infinite storage
10:09 <taras> nginx requires more engineering to get to feature parity
Spoke to gozer. He can stand up nginx by Monday EOD. Lets close this bug once that's stood up in bug 1001517.

We should do memcached too, but having nginx to test lowers urgency of that. I think we'll have enough seamicro hw for both if we do it within next 2 weeks :)
Perhaps for future use, once redis cluster is out and about, that might be a better solution than memcached.
We just finished moving *off* of redis because it's not a supported service (and not least because clustering has been "RSN" for a long time now).
Depends on: 1002630
Is this still active? If so is there anything needed from the IT department?
(In reply to dtorre from comment #48)
> Is this still active? If so is there anything needed from the IT department?

Thanks for reminding me to close this
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.