Closed
Bug 917567
Opened 12 years ago
Closed 11 years ago
Please create HA and DR environments for Tableau
Categories
(Infrastructure & Operations :: Virtualization, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bsullins, Assigned: cknowles)
Details
Please create the following machines to support High Availability and Disaster Recovery for Tableau. These machines can be virtual assuming the requested resources can be dedicated to them and AWS is also an option, which might make the DR scenario a bit easier to configure.
A diagram of these architectures can be found here: http://prezi.com/2czygd2zhwue/?utm_campaign=share&utm_medium=copy
Details about Tableau's HA/DR support can be found here: http://onlinehelp.tableausoftware.com/v8.0/server/en-us/distrib_ha_intro.htm
**HA Machines**
* All in same data center.
* Existing Tableau server lives in SCL3.
* Please copy netflows from tableau1.metrics.scl3.mozilla.com
Gateway Machine x1:
* 4 cores, 16GB RAM, 100GB Disk
* OS: Windows Server (latest available)
Worker Machines x2:
* 4 Cores, 16GB RAM, 300GB Disk
* OS: Windows Server (latest available)
**DR Machine***
* In geographically different data center then above HA machines
* Please copy netflows from tableau1.metrics.scl3.mozilla.com
Primary Tableau Server x1:
* 8 Cores, 32GB RAM, 600GB Disk
* OS: Windows Server (latest available)
Comment 1•12 years ago
|
||
Ben,
Before we take any action on this, we need to find hardware that we can monitor OR move you to a VM based infrastructure, so we don't need to deal with issues that arise out of hardware we can't monitor because this needs to run on Windows. We're still having trouble monitoring everything we need to wrt Tableau.
My personal preference would be VMs, since that's easier to manage (from an SRE perspective) and is already pretty redundant (based on how our VM infra is designed) but I'll defer to Dan/Greg since they know how much we can add on to the current infrastructure we have.
What's your timeline on this? So we can plan and move forward accordingly.
Thanks!
Comment 2•12 years ago
|
||
We could possibly do the 1x gateway machine and 2x worker machines as VMs in scl3, but with the need for 4 cores and 16GB RAM per VM, the preference would be a physical server. If necessary, we could do VMs though.
For the primary server (with 8 cores and 32GB RAM), we would definitely suggest using a physical server for that.
Dan: After looking at this closer I realize we could scale down the gateway machine to 2 cores and 8GB of RAM. Also, I should include that these machines should be 64bit (I inadvertently assumed this as the default). I'll update the diagram as such. I'm open to whichever option your team prefers, physical or VM, as long as the appropriate resources are allocated.
Shyam: Maybe we can chat offline about the monitoring. I was under the impression this is all up and running as noted here: https://bugzilla.mozilla.org/show_bug.cgi?id=874203
I worked closely with rbryce to get nagios monitoring for the services (found here: https://dataviz.mozilla.org/admin/systeminfo) and I believe he also setup some windows system level monitors.
I'm happy to help with anything I can here and ultimately defer to your teams for the appropriate course of action.
Thanks for jumping on this so quickly!
Comment 4•12 years ago
|
||
How heavy will the disk I/O be?
Dan: Random I/O is what we'll see which has lead me in the past to using solid-state drives although I don't believe that is required here. It will be significant, but the typical bottleneck is RAM since Tableau caches a ton of data in-memory when people are viewing dashboards.
Shyam: Just revisited that bug, I see the frustrations now. My mistake, I thought it was taken care of. The good news is Tableau is working on a linux server version however it probably won't be released until this time next year at best. Let me know how I can help.
Comment 6•12 years ago
|
||
(In reply to bsullins from comment #5)
> Shyam: Just revisited that bug, I see the frustrations now. My mistake, I
> thought it was taken care of. The good news is Tableau is working on a linux
> server version however it probably won't be released until this time next
> year at best. Let me know how I can help.
Ideal world scenario - we virtualize everything and are in a better state with peace of mind :)
Less ideal scenario - we find better hardware (that we can monitor properly) and then move to that.
Comment 7•12 years ago
|
||
Given the combination of high random disk I/O, high RAM and high CPU needs, this is a classic example of when physical hardware is a better solution.
Comment 8•12 years ago
|
||
(In reply to Dan Parsons [:lerxst] from comment #7)
> Given the combination of high random disk I/O, high RAM and high CPU needs,
> this is a classic example of when physical hardware is a better solution.
Ben,
Knowing that hardware > virtual in this case, what's your timeline for this DR project ? :)
(In reply to Dan Parsons [:lerxst] from comment #7)
> Given the combination of high random disk I/O, high RAM and high CPU needs,
> this is a classic example of when physical hardware is a better solution.
Agreed. The gateway node at 2 cores and 8GB of RAM w/o the same I/O needs as the workers could certainly be virtual still in this scenario though.
(In reply to Shyam Mani [:fox2mike] from comment #8)
>Ben,
>Knowing that hardware > virtual in this case, what's your timeline for this DR project ? :)
I'm too new to know how long these type of things take here :) I defer to your judgement as to what is reasonable. Also, I've not heard hard and fast requirements from Sylvie on this. Would you be able to propose a reasonable time estimate for this?
Once the machines are stood up I'll have the work of doing the config and installations of Tableau that I would estimate at about 1-2 weeks to be fully deployed.
Comment 10•12 years ago
|
||
(In reply to bsullins from comment #9)
> I'm too new to know how long these type of things take here :) I defer to
> your judgement as to what is reasonable. Also, I've not heard hard and fast
> requirements from Sylvie on this. Would you be able to propose a reasonable
> time estimate for this?
My biggest time sink is going to be testing Windows monitoring on the machines we want to use _before_ we use them. I can make this a Q4 goal for my team and get stuff done in Q4. Does that sound okay to you?
Reporter | ||
Comment 11•12 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #10)
> (In reply to bsullins from comment #9)
>
> > I'm too new to know how long these type of things take here :) I defer to
> > your judgement as to what is reasonable. Also, I've not heard hard and fast
> > requirements from Sylvie on this. Would you be able to propose a reasonable
> > time estimate for this?
>
> My biggest time sink is going to be testing Windows monitoring on the
> machines we want to use _before_ we use them. I can make this a Q4 goal for
> my team and get stuff done in Q4. Does that sound okay to you?
Sounds reasonable to me. I will let Annie and Sylvie know on my end.
Comment 12•12 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #1)
> Ben,
>
> Before we take any action on this, we need to find hardware that we can
> monitor OR move you to a VM based infrastructure, so we don't need to deal
> with issues that arise out of hardware we can't monitor because this needs
> to run on Windows. We're still having trouble monitoring everything we need
> to wrt Tableau.
>
> My personal preference would be VMs, since that's easier to manage (from an
> SRE perspective) and is already pretty redundant (based on how our VM infra
> is designed) but I'll defer to Dan/Greg since they know how much we can add
> on to the current infrastructure we have.
>
> What's your timeline on this? So we can plan and move forward accordingly.
>
> Thanks!
Agree a VM is best and get basic monitoring in place to fail it over as needed. And then look at any other components that are scheduled or need to run for that service to be monitored. But before doing that, we need Dan to look at the workload charateristics of this server and determine if a good candidate for VM's (ie if very high I/O and memory demands, may not work well....)
Comment 13•12 years ago
|
||
I think the gateway VM @ 2 cores and 8GB RAM with low I/O needs is a perfect candidate for virtualization. Should I set that up now or should I wait until the rest of this project is fleshed out, in case other considerations pop up that change how we build the whole system? I'm opting for the latter, but I can do it either way.
Comment 14•12 years ago
|
||
The monitoring issues discussed earlier are not a blocker anymore for going bare-metal. Ashish and I worked last night and found an easy way to query iLO 4 via SNMP to get system health. This works regardless the OS installed on the box, since the entire health information is now available thru iLO.
It can take a few days of Nagios tuning, but IMHO, if a physical server is needed, we can get it, as long as it's a HP Gen8.
Reporter | ||
Comment 15•12 years ago
|
||
(In reply to Dan Parsons [:lerxst] from comment #13)
> I think the gateway VM @ 2 cores and 8GB RAM with low I/O needs is a perfect
> candidate for virtualization. Should I set that up now or should I wait
> until the rest of this project is fleshed out, in case other considerations
> pop up that change how we build the whole system? I'm opting for the latter,
> but I can do it either way.
My preference would be to wait until we have the other servers in place.
Reporter | ||
Comment 16•12 years ago
|
||
Final specs:
**HA Machines**
* All in same data center.
* Existing Tableau server lives in SCL3.
* Please copy netflows from tableau1.metrics.scl3.mozilla.com
Gateway Machine x1:
* Virtual Machine
* 2 cores, 8GB RAM, 100GB Disk
* OS: Windows Server 64bit (latest available)
Worker Machines x2:
* Physical Machines
* 4 Cores, 16GB RAM, 512GB Disk
* OS: Windows Server 64bit (latest available)
**DR Machine***
* In geographically different data center then above HA machines (PHX1 perhaps?)
* Please copy netflows from tableau1.metrics.scl3.mozilla.com
Primary Tableau Server x1:
* 8 Cores, 32GB RAM, 600GB Disk
* OS: Windows Server 64bit (latest available)
Comment 17•12 years ago
|
||
We're trying to get the DR machine in place before Nov 16th.
Assignee: server-ops → shyam
Comment 18•12 years ago
|
||
How soon do you need the gateway VM up, and what should its full hostname be?
Reporter | ||
Comment 19•12 years ago
|
||
For the HA machines, they all need to be live before we can migrate from our current production box (tableau1.metrics.scl3.mozilla.com) to them. For the full hostname can it be: tableau-portal.metrics.scl3.mozilla.com
Reporter | ||
Comment 20•12 years ago
|
||
I should add, can the worker naming convention be the following? Down the road if we need to add more workers we can just continue to increment them and add to the cluster.
* tableau-worker1.metrics.scl3.mozilla.com
* tableau-worker2.metrics.scl3.mozilla.com
Comment 21•12 years ago
|
||
tableau-portal.metrics.scl3.mozilla.com is online and basic (ping) nagios check added. Currently Windows is NOT activated as need a license. License was requested in https://mozilla.service-now.com/nav_to.do?uri=com.glideapp.servicecatalog_checkout_view.do?sysparm_sys_id=94a4da3580a4110015184d74e7a0c86b
Password is in metrics gpg file (please svn up to see it).
Comment 22•12 years ago
|
||
License key applied to tableau-portal.metrics.scl3.mozilla.com
I am not sure if the other hardware is ready and/or who will be handling that (VM task was assigned to me).
Comment 24•11 years ago
|
||
(In reply to bsullins from comment #23)
> @shyam, any update on the HA boxes?
Nope, we haven't gotten any hardware. Is the DR box working as expected? if so I can get something of a similar spec for your HA.
Flags: needinfo?(shyam)
Reporter | ||
Comment 25•11 years ago
|
||
@shyam
- DR box is working w/ the exception of netflows that are missing. Bug 939298 has been submitted for this but we've not seen any progress.
- Re: HA boxes, we need to keep them spec'd as above so we don't have any licensing issues. The only thing I think we might change is physical versus virtual boxes, but I'll defer to you guys on whichever you'd like to go with.
- The DR box is also spec'd I believe too large (more then 8 cores) currently and we may need to disable some of the cores to stay within what our licensing will allow.
Reporter | ||
Comment 26•11 years ago
|
||
Discussion with Dan Parsons and Annie. New specs for HA are as follows:
Worker Machines x2:
* Virtual Machines
* 4 Cores, 16GB RAM, 512GB Disk
* OS: Windows Server 64bit (latest available)
Updated•11 years ago
|
Assignee: shyam → dparsons
Component: Server Operations → Server Operations: Virtualization
QA Contact: shyam → dparsons
Comment 27•11 years ago
|
||
Passing to :cknowles to get these VMs deployed
Assignee: dparsons → cknowles
Assignee | ||
Comment 28•11 years ago
|
||
A few questions -
* Would you like the 512 G all on C drive, or just left for you to slice up as needed?
* Does this need to be in Nagios? (Probably yes, but I thought I'd be certain)
* Anything else you need from us?
If the answers to these questions are no/nothing, then the devices are up right now, answering to the passwords from the sysadmins/gpg/passwords.txt file for the windows VM templates.
Don't hesitate to let me know if you need anything further.
CJK
Reporter | ||
Comment 29•11 years ago
|
||
1 - All on the C drive is good
2 - Yes, and we have service checks already setup as well for dataviz.m.o
3 - Automated backups to a separate location
Comment 30•11 years ago
|
||
My group doesn't (yet) provide backup services - so we can handle everything else you've asked, but for backups you'll need to speak to Usul, who will know more about getting a Windows system added to the backup system he maintains.
Reporter | ||
Comment 31•11 years ago
|
||
cool, will submit a separate bug once we have host names for these
Comment 32•11 years ago
|
||
Correction, ludo is the contact for backups.
Assignee | ||
Comment 33•11 years ago
|
||
Alright...
1) c drive is expanded for the entire extent of the virtual 512g disk.
2) Initial barebones Nagios is in place.
3) Dan has provided contact info for backups.
Do you need anything else here?
CJK
Reporter | ||
Comment 34•11 years ago
|
||
Can you join these machines to our Windows domain and also provide login credentials?
Assignee | ||
Comment 35•11 years ago
|
||
Certainly - can you provide documentation on the AD and logins to allow me to do that?
Poke me in IRC if you need to convey passwords.
CJK
Assignee | ||
Comment 36•11 years ago
|
||
filed 967131 for the netflows to allow this to join the AD.
CJK
Assignee | ||
Comment 37•11 years ago
|
||
Alright, with the recent work for the netflows, these have been joined to the AD and I've verified that the RDP works and it responds to AD accounts.
Try it out and let me know.
CJK
Assignee | ||
Comment 38•11 years ago
|
||
Haven't heard anything recently, though given the complexity, I'm unwilling to simply resolve this - is there anything else that the VM folks can do to help get this working?
CJK
Assignee | ||
Comment 39•11 years ago
|
||
Alright, I'm hoping that complete radio silence indicates all is well. Please reopen if there are concerns with the VMs themselves.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•