910790 - spec & deploy production hardware for vcs-sync systems

Reporter

Description

•

11 years ago

Current production server is github-sync3.dmz.scl3.mozilla.com, which is a SPOF, and undersized (disk at least) for current load.

Additional disk load expected as :aki's services also roll out.

(This has been discussed with various folks in IT -- moving to bug so we keep it more visible.)

Hal Wine [:hwine] use NI!

Reporter

Comment 2

•

11 years ago

We're starting to hit space issues:

07:50 < nagios-scl3> Thu 07:50:01 PDT [5093] github-sync3.dmz.scl3.mozilla.com:Disk - All
                     is WARNING: DISK WARNING - free space: / 27618 MB (10% inode=80%):
                     (http://m.allizom.org/Disk+-+All)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 3

•

11 years ago

Per mtg w/aki: 

(gd3 == github-sync3.dmz.scl3.mozilla.com)
$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             260G  208G   39G  85% /
$

1) Having all this production load on gd3 means we have a SPOF, and we are getting even closer to running out of diskspace here. We're actively worried about running out of space when we add new branches (b2g v1.2 later this week) and new l10n  locales. What is the ETA on getting two machines (called "gd5", "gd6") which are same class or better as gd3, and with at least 500GB disks? 

2) In terms of rollout, note that gd3 is currently in production, so to avoid treeclosure/downtime, it would be best if we can stand up gd5, gd6, transition the production jobs from gd3 over to running on gd5/6. Using both gd5 and gd6 with shared load means we will not have a SPOF.

3) We could then power off gd3 for hardware upgrade without needing tree closure. At best guess, this transition of jobs to gd5/6 would take a few months, as we stand up automation on these new machines. However, each job we transition would reduce load on overloaded gd3, and reduce SPOF a little.  Also, this would allow us to restore gd3 backups onto one of these machines should gd3 die.


fox2mike, jake: whats next steps here?

Flags: needinfo?(shyam)

Flags: needinfo?(nmaul)

Hal Wine [:hwine] use NI!

Reporter

Comment 4

•

11 years ago

This has been sitting for 3 weeks, disk space now close to 90% when it will start alerting SRE's:

[vcs2vcs@github-sync3.dmz.scl3 ~]$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3             260G  217G   30G  89% /

Bumping priority.

Severity: normal → major

Ashish Vijayaram [:ashish]

Comment 5

•

11 years ago

Assigning to :jakem to stop paging oncall.

Assignee: server-ops-webops → nmaul

Corey Shields [:cshields]

Comment 6

•

11 years ago

jumping in to help as jakem is on pto.

What's the growth like on this?  Have we gone from nil to 260G in a year and half?  What will its growth be for the years to come?  Need this data to avoid having to increase the disk on this every so often.

A second server is requested in c3 to avoid a tree closure but I'm not sure this will all be possible in time.  I'll check with Terminal on their availability.  At best you are asking for 1 new blade and 2 storage blades (one for the new blade and one to backfill the existing blade)

Flags: needinfo?(hwine)

Hal Wine [:hwine] use NI!

Reporter

Comment 7

•

11 years ago

(In reply to Corey Shields [:cshields] from comment #6)
> jumping in to help as jakem is on pto.

I should clarify -- our immediate need is to get more disk space available, and some failover capacity would be nice as well. I started this discussion in this bug with the hopes that the correct production configuration could be used as the solution.

It seems unlikely that the production configuration can be defined and stood up in time to address this immediate need. Should we move this to another bug and return to the "real production setup" that IT wishes to operate at a later point?

> What's the growth like on this?  Have we gone from nil to 260G in a year and
> half?  

The growth is driven by b2g business needs:
 - every repository used by the team needs to be mirrored and/or converted
 - this includes both source code & locales

> What will its growth be for the years to come?  Need this data to
> avoid having to increase the disk on this every so often.

I appreciate the desire for a stable answer. However, I believe b2g is still too early in it's life cycle to accurately project. That said, I would guess that an extra 250 GB will hold until better projections are possible. (More below.)

> A second server is requested in c3 to avoid a tree closure but I'm not sure
> this will all be possible in time.  I'll check with Terminal on their
> availability.  At best you are asking for 1 new blade and 2 storage blades
> (one for the new blade and one to backfill the existing blade)

There are two reasons a 2nd server is requested:
 1. that provides extra capacity without requiring down time
 2. as discussed back in Dec 2012, this is a production system. b2g builds stop without it running, so a 2nd machine provides some level of failover.

One plan that was previously proposed by IT was seeing _if_ the new ESX system could handle the load. The previous generation of virtual machines could not (these operations produce an atypical file system load that wasn't handled well by efs-4 under vmware, see bug 739100 comment 17 and on). However, a sample VM was never provided for testing, so we do need to proceed with hardware at this time.

While I'll be on PTO when this disk is added, all the procedures are documented on restarting this machine. If an additional volume can be added without rebooting, that would be preferred. Disk usage can be balanced without downtime using the instructions for moving a job set between machines [1].

[1]: https://people.mozilla.org/~hwine/tmp/vcs2vcs/cookbook.html#moving-a-job-set-between-servers

Flags: needinfo?(hwine)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 8

•

11 years ago

(In reply to Hal Wine [:hwine] (use needinfo) [PTO Sep24-Oct6] from comment #7)
> (In reply to Corey Shields [:cshields] from comment #6)
> > jumping in to help as jakem is on pto.
> 
> I should clarify -- our immediate need is to get more disk space available,
> and some failover capacity would be nice as well. I started this discussion
> in this bug with the hopes that the correct production configuration could
> be used as the solution.
> 
> It seems unlikely that the production configuration can be defined and stood
> up in time to address this immediate need. Should we move this to another
> bug and return to the "real production setup" that IT wishes to operate at a
> later point?
> 
> > What's the growth like on this?  Have we gone from nil to 260G in a year and
> > half?  
> 
> The growth is driven by b2g business needs:
>  - every repository used by the team needs to be mirrored and/or converted
>  - this includes both source code & locales
> 
> > What will its growth be for the years to come?  Need this data to
> > avoid having to increase the disk on this every so often.
> 
> I appreciate the desire for a stable answer. However, I believe b2g is still
> too early in it's life cycle to accurately project. That said, I would guess
> that an extra 250 GB will hold until better projections are possible. (More
> below.)
Followup from EngOps meeting. 

1) Its taken ~9months to get to 217gb usage of this 260gb disk. This rate of growth looks like its slowing down, but its still too early to know for certain. For now, if we can double disk to approx 500-520gb, that should be plenty to keep us going for rest of year... and (I hope) some amount of 2014. More realistically, it will also get us better trend data as b2g stabilizes.

2) dmoore hopes to have quotes for review by tomorrow. Stay tuned.

Derek Moore [:dmoore]

Comment 9

•

11 years ago

RFQ has been sent over to Terminal

We'll be looking at the new blade, in addition to any option which provides us (at least) 1TB of storage for each server.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 10

•

11 years ago

We got other instances running in bug#924716 which are a) handling this additional load just fine, and b) help with the SPOF problem, so lets WONTFIX.

Status: NEW → RESOLVED

Closed: 11 years ago

Flags: needinfo?(shyam)

Flags: needinfo?(nmaul)

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Updated

•

10 years ago

Component: WebOps: Source Control → General

Product: Infrastructure & Operations → Developer Services

Bugzilla

Quick Search

spec & deploy production hardware for vcs-sync systems

Categories

(Developer Services :: General, task)

Tracking

(Not tracked)

People

(Reporter: hwine, Assigned: nmaul)

References

Details

(Whiteboard: [reit-ops])

Crash Data

Security

(public)

User Story

Description

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated