Closed Bug 1244874 Opened 8 years ago Closed 8 years ago

Purchase new hardware for hg.mo

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: van)

References

(Depends on 1 open bug)

Details

Attachments

(3 files)

(This is my first bug to request a new hardware purchase. Please be gentle on me.)

We have budget to purchase replacement hardware for hg.mozilla.org. I'm filing this bug to track that purchase.

In bug 1221108, we had tentatively settled on a G9 with 2x E5-2643 v3 processors and 32 GB RAM. We had yet to decide on drive options.

van: Have hardware options changed in the past 3 months or are the info/quotes in bug 1221108 still accurate?
Flags: needinfo?(vle)
:gps, are you looking to purchase a g9 as well? i can email our vendor and find that out for you asap. sorry for late response, ive been out of country past 2 weeks.
Flags: needinfo?(vle)
:gps, per our vendor:

Van
 
Yes the drive prices are the same.  Once we have the final quantities and items I can send you a final quote and see what we can come up with

Rich
Assignee: server-ops-dcops → vle
QA Contact: cshields
(In reply to Van Le [:van] from comment #1)
> :gps, are you looking to purchase a g9 as well? i can email our vendor and
> find that out for you asap.

Assuming the G9 is the "best" chassis available to us. I don't want us ordering something that is already out of date.

However, looking at the available Xeon processors from https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microprocessors, it appears the E5-2643 v3 is still the best available for our needs (looking for MHz). Although the E7-8893 v3 being a Haswell is likely faster despite running 200MhZ slower. But it is much more expensive :/ The Skylake DT line is almost certainly the fastest of the bunch. The E3-1280 v5 would be rad. However, those are only available in uniprocessor configurations and only have 4 cores. I'm guessing those are non-starters?
Flags: needinfo?(vle)
We can go Configure to Order and choose the processor that the developers need.  The Gen 9 are using V3 processors still.  Here is a full list of processors available for the BL460 Gen9
E5-2600 v3 series Processors
 
HP BL460c Gen9 Intel® Xeon® E5-2690v3 (2.6GHz/12-core/30MB/135W) FIO Processor Kit
726987-L21
HP BL460c Gen9 Intel® Xeon® E5-2680v3 (2.5GHz/12-core/30MB/120W) FIO Processor Kit
726988-L21
HP BL460c Gen9 Intel® Xeon® E5-2670v3 (2.3GHz/12-core/30MB/120W) FIO Processor Kit
726989-L21
HP BL460c Gen9 Intel® Xeon® E5-2660v3 (2.6GHz/10-core/25MB/105W) FIO Processor Kit
726990-L21
HP BL460c Gen9 Intel® Xeon® E5-2650v3 (2.3GHz/10-core/25MB/105W) FIO Processor Kit
726991-L21
HP BL460c Gen9 Intel® Xeon® E5-2640v3 (2.6GHz/8-core/20MB/90W) FIO Processor Kit
726992-L21
HP BL460c Gen9 Intel® Xeon® E5-2683v3 (2GHz/14-core/35MB/120W) FIO Processor Kit
726993-L21
HP BL460c Gen9 Intel® Xeon® E5-2630v3 (2.4GHz/8-core/20MB/85W) FIO Processor Kit
726994-L21
HP BL460c Gen9 Intel® Xeon® E5-2620v3 (2.4GHz/6-core/15MB/85W) FIO Processor Kit
726995-L21
HP BL460c Gen9 Intel® Xeon® E5-2623v3 (3GHz/4-core/10MB/105W) FIO Processor Kit
726996-L21
HP BL460c Gen9 Intel® Xeon® E5-2609v3 (1.9GHz/6-core/15MB/85W) FIO Processor Kit
726997-L21
HP BL460c Gen9 Intel® Xeon® E5-2603v3 (1.6GHz/6-core/15MB/85W) FIO Processor Kit
726999-L21
HP BL460c Gen9 Intel® Xeon® E5-2650Lv3 (1.8GHz/12-core/30MB/65W) FIO Processor Kit
727000-L21
HP BL460c Gen9 Intel® Xeon® E5-2698v3 (2.3GHz/16-core/40MB/135W) FIO Processor Kit
727001-L21
HP BL460c Gen9 Intel® Xeon® E5-2630Lv3 (1.8GHz/8-core/20MB/55W) FIO Processor Kit
727002-L21
HP BL460c Gen9 Intel® Xeon® E5-2695v3 (2.3GHz/14-core/35MB/120W) FIO Processor Kit
727003-L21
HP BL460c Gen9 Intel® Xeon® E5-2637v3 (3.5GHz/4-core/15MB/135W) FIO Processor Kit
765268-L21
HP BL460c Gen9 Intel® Xeon® E5-2697v3 (2.6GHz/14-core/35MB/145W) FIO Processor Kit
767049-L21
HP BL460c Gen9 Intel® Xeon® E5-2667v3 (3.2GHz/8-core/20MB/135W) FIO Processor Kit
773123-L21
HP BL460c Gen9 Intel® Xeon® E5-2643v3 (3.4GHz/6-core/20MB/135W) FIO Processor Kit
773124-L21
HP BL460c Gen9 Intel® Xeon® E5-2699v3 (2.3GHz/18-core/45MB/145W) FIO Processor Kit
779795-L21
 
The 2643 is a CTO option.,  Please note that CTO’s tend to be more expensive than BTO “off the shelf” models.
Flags: needinfo?(vle)
:gps did the above info help? the v5's you're looking at isn't available for the g9s and the g10s aren't available yet.
(please needinfo me so I don't miss questions)

comment #3 helped. I don't understand "CTO" vs "BTO." We had budgeted for the 2643v3, so we should have the money for CTO. But if BTO is much cheaper, I'm willing to entertain the idea.

I'm really trying to optimize for CPU speed here. The only other processors that interest me are the 2637-v3 (3.5GHz but 4 cores) and the 2667v3 (3.2GHz and 8 cores). How much are those?

I still need to look at the drive options. A lot of this depends on how many blades we're buying for the hgweb machines. I had figured in bug 1221108 we need ~36 CPU cores. Assuming 2 sockets per blade, that means we either need:

* 2 or 3 blades with 2x8 cores on each (I'd *really* like to run more than 2 servers for reasons)
* 3 blades with 2x6 cores on each
* 4-5 blades with 2x4 cores on each

We also have the hgssh machine(s) to worry about. Although again, the number of cores we need depends on how many blades we're running since mirroring/replication eats a CPU core per mirror/blade. This is all a bit confusing, I know :/
Forgot to set needinfo for comment 6.
Flags: needinfo?(vle)
BTO Stands for Build To Order, these are partly preconfigured servers that come with at least 1 processor and some memory. These servers are usually kept in stock, you then can select additional server options to bring the server to the specification that you require. 


CTO Stands for Configure To Order. Solutions that need to be provided using this method can only be configured via a HP specialised tool which we only have access to. Using this tool you can have a bespoke specification server that is not on the standard "off the shelf units". These CTO servers are also tested, built and configured at HP and then delivered.
Our vendor is on PTO but I'll reach out to his back up.
Flags: needinfo?(vle)
gcox (who isn't accepting needinfo requests), et al: as part of replacing the hgssh machines, I was wondering if I could poke your brain about I/O options.

Today, we mount the Mercurial repositories from the Netapp via NFS on both hgssh1 and hgssh2. If hgssh1 (the master) goes does, hgssh2 has a read-write mount and the load balancer swings over within seconds. I like this setup because a) failover is fast and painless b) redundancy and backup is built into the Netapp (no single point of failure from perspective of hgssh).

However, performance profiling is telling me that I/O is killing performance on hgssh. This slows down certain operations and annoys developers. I *assume* this slow I/O is NFS being NFS. After all, some Mercurial operations like cloning mozilla-central touch >100,000 files, all happening from a single thread doing I/O one file at a time. I just don't think there's a way to make NFS fast with this kind of I/O pattern.

I'd love to see us move to a faster, non-NFS I/O system. But I don't know what the options are.

On paper, I think fibre channel is the ideal solution. But I've heard we don't have FC in SCL3 :(

I heard we can have "storage blades" that are effectively sister blades that hold storage. If your blade fails, you can move the storage blade to another chassis/blade. Of course, there's human involvement needed here and people may find the delay too frustrating. Although, I welcome a discussion about what the SLA of pushes/writability to hg.mozilla.org needs to be. I argue most consumers only care about reads. The only critical time we care about writes to hg.mozilla.org is getting a chemspill out the door. (We close the Firefox repositories all the time for hours on end and while annoying and isn't the end of Mozilla.)

What about iSCSI? What about networked filesystems?

My requirements are first something that is fast and doesn't lose data. Second, we should be able to recover from a component failure in ideally <10 minutes (the process doesn't have to be completely automated). We can *possibly* loosen the recovery window to 1-4 hours, although a discussion about hg.mozilla.org SLA will need to take place.
This is a project-sized question.  There's not a lot of options on short-term fixes; most anything will need planning and decisions.  Distilling it down to a bug-sized response is iffy, but here goes.

> However, performance profiling is telling me that I/O is killing performance
> on hgssh. This slows down certain operations and annoys developers.

I'm sure you're aware, "it's slow" is vague, and thus the reply here is vague, but, gotta start somewhere...

> On paper, I think fibre channel is the ideal solution. But I've heard we
> don't have FC in SCL3 :(

I'm personally a fan of well-run FC, but we have neither fabrics nor an FC license for the filer, currently, so, correct, no FC exists.   NFS flexibility was a must, and nobody hit a use case that's demanded FC / not been serviced by NFS enough to justify its costs.

> I heard we can have "storage blades" that are effectively sister blades that
> hold storage. If your blade fails, you can move the storage blade to another
> chassis/blade. Of course, there's human involvement needed here and people
> may find the delay too frustrating.

Pretty effective summary of the scenario.  Assuming you have a ready spare compute blade to revive onto.

> What about iSCSI? What about networked filesystems?

iSCSI is, IMO, just "poor man's FC".  It's still a filer license to buy, you just save on not needing to buy a fabric  (though, in my experience, you end up paying it back in headaches with your can't-fail LUN over the can-fail IP network).
Networked filesystems aren't something currently on offer beyond the filer as it is now.  It'd be an 'invention' (new implementation from scratch) since it's not been needed.

> I *assume* this slow I/O is NFS being NFS. After all, some Mercurial
> operations like cloning mozilla-central touch >100,000 files, all happening
> from a single thread doing I/O one file at a time. I just don't think
> there's a way to make NFS fast with this kind of I/O pattern.

It's possible that the IO pattern doesn't work on a file-based protocol when the software is filesystem abusive, yes.  There are many factors at play here, and many potential touchpoints (mostly boiling down to 'server/network/client'), so there's a risk of straw-grasping here.  Things I can think of that are suboptimal based on what I can imagine is happening, offhand:

Without a re-architecture:
* hg shares a disk aggregate with other volumes.  Any contention with other volumes will add latency.  We could move hg off onto its own aggregate.  (needs new disks brought in and arranged)
* we don't have flash drives to put this on.  While 15k SAS is good, it's still slower.  (needs budget and procurement time)
* we could add more flashcache to the filer nodes, which would reduce disk reads in general by caching more.  (needs budget and procurement time)
* we currently dedupe the hg volume to reduce space usage and keep the inode sprawl under control.  We could disable that and let it grow, and see if it changes responsiveness. (needs new disks brought in and arranged)

"Minor" re-architecture:
* clean up the hg volume.  Any reduction in use helps in dedupe and inode usage.  (dead_repositories?  nonlive?  years-old tarballs?)
* split out the hg volume - create subvolumes for sub-areas so that less-used areas of the tree can be moved around to other spindles.

"Major" re-architecture:
* NFSv4.  We've had bad luck with permissions setups thusfar, but with hg being a fairly self-contained universe who ONLY pulls names from LDAP rather than pushed-out passwd files, we could maybe get a setup working there.  Getting to v4.1 would have some inherent performance boost, but would be a pretty invasive changeover.
* fake-iSCSI:  filer serves up an NFS volume that has a 'lun' on it, where by 'lun' I mean a sparse blob area that has a (ext4? xfs?) filesystem on it.  Migration is nontrivial and failover becomes touchy for you to manage, but, there you are.
* total-rearchitecture (one of the "buy flash, switch to FC, etc" options).

> My requirements are first something that is fast and doesn't lose data.
> Second, we should be able to recover from a component failure in ideally <10
> minutes (the process doesn't have to be completely automated). We can
> *possibly* loosen the recovery window to 1-4 hours, although a discussion
> about hg.mozilla.org SLA will need to take place.

In the short term, the above is what I can think of, though nothing I think we could play with in under a month due to other blockers.

On a longer timeline, there's many options.  Picking on one: FC is possible if there's political will and budget.  I will say, you might want to consider Cisco UCS gear for your hardware refresh.  It's what we use on ESX.  It gets you 2x10G builtin, you can build for network failover via multipathing, and if you plan appropriately, you can get FC coming out of the blades in a pretty simple format (meaning, your blade will do FCoE natively, and the Fabric Interconnect will break out FCoE into native FC for dropping off into the fabrics).  Saves a lot of per-blade fabric port and HBA costs.
 
2667 Processor is $2700 each
 
2637 Processor is $1450 each
 
Once you have the final desired configuration and option I can put together a full blade quote
(In reply to Van Le [:van] from comment #12)
> 2667 Processor is $2700 each
> 2637 Processor is $1450 each

I assume the 2643 is somewhere in the middle then? If so, that's probably our CPU.

Bug 1221108 contained a list of drive options. Our workload is definitely read intensive. We also want to avoid spinning disks. I don't think we need the 12GBps that SAS provides. One of the following should work:

804599-B21	HP 800GB 6G SATA Read Intensive-2 SFF 2.5-in SC 3yr Wty Solid State Drive	$1,275
816909-B21	HP 960GB 6G SATA Read Intensive-3 SFF 2.5-in SC 3yr Wty Solid State Drive	$1,245

From http://www8.hp.com/h20195/v2/GetPDF.aspx%2Fc04154378.pdf, the specs on both these drives are pretty similar. Although I couldn't find exactly what the difference between "Read Intensive-2" and "Read Intensive-3" is? The 160 GB likely doesn't matter for us. But if it's nearly free and has no major drawbacks, why not?

Also, how many drives do we need to buy? We have redundancy at the blade level. So I'm OK with a single data drive in each machine. Worst case a drive goes bad and we wait a few days for a new drive then re-seed the server. No biggie.

Do we typically buy separate drives for the OS? I don't think we need to unless it is recommended.

(Again, I don't know much about buying hardware for data centers. Please check for my assertions.)

Please get quotes for the G9 with 32 GB RAM and the 816909-B21 drive and the following CPU configurations:

* 2x 2643 v3
* 2x 2637 v3

We'll likely buy 3 or 4 of those blades for the hgweb machines.

That still leaves the hgssh machines. But we need to figure out the storage situation for those first. I wouldn't be surprised if we split that into a separate bug / purchase order.
(In reply to Gregory Szorc [:gps] from comment #13)
>
> Also, how many drives do we need to buy? We have redundancy at the blade
> level. So I'm OK with a single data drive in each machine. Worst case a
> drive goes bad and we wait a few days for a new drive then re-seed the
> server. No biggie.
> 
> Do we typically buy separate drives for the OS? I don't think we need to
> unless it is recommended.

One drive/blade is standard, with OS and everything on it. I'd rather have an extra machine or two for redundancy than bother messing about with hardware raid (or lose perf with software raid).

> That still leaves the hgssh machines. But we need to figure out the storage
> situation for those first. I wouldn't be surprised if we split that into a
> separate bug / purchase order.

Mmm, I'd be super surprised if we have sufficient money left over to do much on the storage side. I don't know what sort of discount we get for the filers, but licenses there can easily run into the tends of thousands (back in the day, the NFS license cost as much as a filer head!) 

I'd really like to have a better idea of what "slow" means, and look at performance numbers from both the host and filers before we look at throwing money at changing the storage. In fact, given the age of the hosts, I think I'd be much more likely to update those first and then see how things stand (we already know they're practically ancient).
Depends on: 1248072
i've asked for a quote from our vendor. :gps, do you want 2x 816909-B21 drives? we usually raid 1 these hosts.
(In reply to Van Le [:van] from comment #15)
> i've asked for a quote from our vendor. :gps, do you want 2x 816909-B21
> drives? we usually raid 1 these hosts.

According to comment 14, I think we only want 1 drive.
Attached file developer blade.xls
quote attached.
The v2643 at $8,280.00 looks like our blade! At this point, I think we just need to decide how many to order.

3 - $24,840 - 36 cores - 122.4 GHz
4 - $33,120 - 48 cores - 163.2 GHz

This comes down to what we want to do with the hgssh machines. We currently have 2 hgssh machines, one running master and a 2nd hot spare.

It feels somewhat wasteful to me to throw an entire beefy blade on the spare. I'm tempted to say we could relegate an old machine or possibly even a VM to the spare. If the primary dies, we'll run with degraded performance for a bit. But this should only be temporary. And it shouldn't be worse than performance today.

If we go with a "mixed mode" hgssh setup, I'd want 4 hgweb machines because we need 5 nodes in the ZooKeeper/Kafka cluster to be able to suffer 2 hosts going down without an outage (4 hgweb + 1 hgssh primary). If we have 3 hgweb machines, we'd have to put ZooKeeper/Kafka on a slower machine (not ideal) or reduce the cluster size to 3 (or 4 but this doesn't make much sense) and that means we can only lose at most 1 host before losing quorum.

Even if we put the exact same blade config in both hgssh machines, we're still under $50,000 and under our budget with 4 hgweb + 2 hgssh.
fubar, hal: RFC on comment 19
Flags: needinfo?(klibby)
Flags: needinfo?(hwine)
Greg: I can be sold on "sluggish standby", but only if I understand the situation a bit better.  hg.m.o is critical enough to development and release operations that I don't want to understand worst case.
 - is IT willing to keep any old machine for such use?
 - is IT willing to support a VM that would go 0-200 in 3 seconds (ludicrous mode)
 - what's the "max" downtime we need to cover? I.e. how long degraded?
(My guess is chassis failure that also takes down blade is worst case. Have to wait for new hardware from vendor and then install at DC. Or does IT have spares of both onsite?

Also, is it possible to run a dual purpose node? I.e. one that can act as either a webhead or ssh node? (If it has to go active on ssh, it can drop web load.) That would allow beefy backup server on new hardware, hopefully located in a chassis on the other side of the cage from the primary.
Flags: needinfo?(hwine) → needinfo?(gps)
(In reply to Hal Wine [:hwine] (use NI) from comment #21)
> Greg: I can be sold on "sluggish standby", but only if I understand the
> situation a bit better.  hg.m.o is critical enough to development and
> release operations that I don't want to understand worst case.
>  - is IT willing to keep any old machine for such use?
>  - is IT willing to support a VM that would go 0-200 in 3 seconds (ludicrous
> mode)
>  - what's the "max" downtime we need to cover? I.e. how long degraded?
> (My guess is chassis failure that also takes down blade is worst case. Have
> to wait for new hardware from vendor and then install at DC. Or does IT have
> spares of both onsite?

I can only speculate on the answers to these questions. Van?

> Also, is it possible to run a dual purpose node? I.e. one that can act as
> either a webhead or ssh node? (If it has to go active on ssh, it can drop
> web load.) That would allow beefy backup server on new hardware, hopefully
> located in a chassis on the other side of the cage from the primary.

In theory. I'm not keen on doing it because security. The web servers are hooked up to the public internet (via HTTP via the load balancer) and I'd rather not be a httpd/nginx vulnerability away from someone rooting a server with write access. (There currently shouldn't be a way for a compromised hgweb machine to write/push to the canonical server - they could compromise bits we serve to the world, but the canonical copy would still be clean.) (I also like to think SSH-only internet exposure is safer than HTTP, although I'm not sure if that is true beyond strictly the smaller attack surface from running only 1 service.)
Flags: needinfo?(gps) → needinfo?(vle)
We've shied away from VMs because of performance reasons, so I'd need to see numbers before considering the use of one as backup for hgssh. Why do we not just keep the existing hgssh2 and re-up the warranty? 

>  - is IT willing to keep any old machine for such use?

"any old machine" applies to all extant hg machines, fwiw.

>  - what's the "max" downtime we need to cover? I.e. how long degraded?

I seem to recall warranty replacement is fairly quick but yes, that's on IT to answer.
OTOH, it's not changing from what we have right now. If anything, it would be faster because HP wouldn't have to go looking for hardware from 2012.

> Also, is it possible to run a dual purpose node? I.e. one that can act as
> either a webhead or ssh node?

I think that would be painful, if only because all files on hgweb* are owned by hg:hg, whereas files on hgssh* are owned by whoever pushed last. I'm super against this unless someone has an extremely compelling reason.


I'm ok with 4 new hgweb, 1 new hgssh, and either keeping hgssh2 as-is or finding a newer-but-equiv-or-better blade.
Flags: needinfo?(klibby)
(In reply to Kendall Libby [:fubar] from comment #23)
> I'm ok with 4 new hgweb, 1 new hgssh, and either keeping hgssh2 as-is or
> finding a newer-but-equiv-or-better blade.

I like this plan.

I think we should go ahead with ordering 4 of the v2643 blades quoted in comment #17. Do we need a manager type person to sign off on this? That would be jgriffin or lmandel...
>  - is IT willing to keep any old machine for such use?
>"any old machine" applies to all extant hg machines, fwiw.

yes, we can.

>  - what's the "max" downtime we need to cover? I.e. how long degraded?

i can replace the hardware within 2 business days (usually same day) once RMA arrives on-site.

>Do we need a manager type person to sign off on this? That would be jgriffin or lmandel

yah, need a manager approval in bug then i can place the order.
Flags: needinfo?(vle)
jgriffin: can you please use your manager special powers to approve our purchase request per comment #24 and comment #25?
Flags: needinfo?(jgriffin)
(In reply to Gregory Szorc [:gps] from comment #26)
> jgriffin: can you please use your manager special powers to approve our
> purchase request per comment #24 and comment #25?

Approved.
Flags: needinfo?(jgriffin)
To confirm, 4x v2643?
Yes, 4x v2643, quoted at $8,280.00.
order has been placed with our vendor. will update with final quote and tracking info once received.



These are CTO that need to be built so 2 week lead time
 
Thanks
 
Rich
Attached file quote for 4x E5-2643v3
Hi all,

this is high enough to need a PO so I'll get that request submitted - what cost center will this come out of?
Flags: needinfo?(jgriffin)
Flags: needinfo?(gps)
(In reply to Corey Shields [:cshields] from comment #32)
> Hi all,
> 
> this is high enough to need a PO so I'll get that request submitted - what
> cost center will this come out of?

This comes out of 8200 - Platform Operations
Flags: needinfo?(jgriffin)
Flags: needinfo?(gps)
colo-trip: --- → scl3
PO approved and forwarded to Terminal Exchange.
Please install CentOS 7. Please make sure the boot partition has enough space for multiple kernels (~256 MB should be fine). Also, we run into inode count issues on these machines if using the default ext4 settings. You may want to double check with fubar on how the partitions should be set up.
Flags: needinfo?(gps)
opened 1262232 for MOC to reimage hosts. closing this as a purchase bug for easier tracking.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Blocks: 1263679
Blocks: 1265557
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: