Closed Bug 917567 Opened 12 years ago Closed 11 years ago

Please create HA and DR environments for Tableau

Categories

(Infrastructure & Operations :: Virtualization, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bsullins, Assigned: cknowles)

Details

Please create the following machines to support High Availability and Disaster Recovery for Tableau. These machines can be virtual assuming the requested resources can be dedicated to them and AWS is also an option, which might make the DR scenario a bit easier to configure. A diagram of these architectures can be found here: http://prezi.com/2czygd2zhwue/?utm_campaign=share&utm_medium=copy Details about Tableau's HA/DR support can be found here: http://onlinehelp.tableausoftware.com/v8.0/server/en-us/distrib_ha_intro.htm **HA Machines** * All in same data center. * Existing Tableau server lives in SCL3. * Please copy netflows from tableau1.metrics.scl3.mozilla.com Gateway Machine x1: * 4 cores, 16GB RAM, 100GB Disk * OS: Windows Server (latest available) Worker Machines x2: * 4 Cores, 16GB RAM, 300GB Disk * OS: Windows Server (latest available) **DR Machine*** * In geographically different data center then above HA machines * Please copy netflows from tableau1.metrics.scl3.mozilla.com Primary Tableau Server x1: * 8 Cores, 32GB RAM, 600GB Disk * OS: Windows Server (latest available)
Ben, Before we take any action on this, we need to find hardware that we can monitor OR move you to a VM based infrastructure, so we don't need to deal with issues that arise out of hardware we can't monitor because this needs to run on Windows. We're still having trouble monitoring everything we need to wrt Tableau. My personal preference would be VMs, since that's easier to manage (from an SRE perspective) and is already pretty redundant (based on how our VM infra is designed) but I'll defer to Dan/Greg since they know how much we can add on to the current infrastructure we have. What's your timeline on this? So we can plan and move forward accordingly. Thanks!
We could possibly do the 1x gateway machine and 2x worker machines as VMs in scl3, but with the need for 4 cores and 16GB RAM per VM, the preference would be a physical server. If necessary, we could do VMs though. For the primary server (with 8 cores and 32GB RAM), we would definitely suggest using a physical server for that.
Dan: After looking at this closer I realize we could scale down the gateway machine to 2 cores and 8GB of RAM. Also, I should include that these machines should be 64bit (I inadvertently assumed this as the default). I'll update the diagram as such. I'm open to whichever option your team prefers, physical or VM, as long as the appropriate resources are allocated. Shyam: Maybe we can chat offline about the monitoring. I was under the impression this is all up and running as noted here: https://bugzilla.mozilla.org/show_bug.cgi?id=874203 I worked closely with rbryce to get nagios monitoring for the services (found here: https://dataviz.mozilla.org/admin/systeminfo) and I believe he also setup some windows system level monitors. I'm happy to help with anything I can here and ultimately defer to your teams for the appropriate course of action. Thanks for jumping on this so quickly!
How heavy will the disk I/O be?
Dan: Random I/O is what we'll see which has lead me in the past to using solid-state drives although I don't believe that is required here. It will be significant, but the typical bottleneck is RAM since Tableau caches a ton of data in-memory when people are viewing dashboards. Shyam: Just revisited that bug, I see the frustrations now. My mistake, I thought it was taken care of. The good news is Tableau is working on a linux server version however it probably won't be released until this time next year at best. Let me know how I can help.
(In reply to bsullins from comment #5) > Shyam: Just revisited that bug, I see the frustrations now. My mistake, I > thought it was taken care of. The good news is Tableau is working on a linux > server version however it probably won't be released until this time next > year at best. Let me know how I can help. Ideal world scenario - we virtualize everything and are in a better state with peace of mind :) Less ideal scenario - we find better hardware (that we can monitor properly) and then move to that.
Given the combination of high random disk I/O, high RAM and high CPU needs, this is a classic example of when physical hardware is a better solution.
(In reply to Dan Parsons [:lerxst] from comment #7) > Given the combination of high random disk I/O, high RAM and high CPU needs, > this is a classic example of when physical hardware is a better solution. Ben, Knowing that hardware > virtual in this case, what's your timeline for this DR project ? :)
(In reply to Dan Parsons [:lerxst] from comment #7) > Given the combination of high random disk I/O, high RAM and high CPU needs, > this is a classic example of when physical hardware is a better solution. Agreed. The gateway node at 2 cores and 8GB of RAM w/o the same I/O needs as the workers could certainly be virtual still in this scenario though. (In reply to Shyam Mani [:fox2mike] from comment #8) >Ben, >Knowing that hardware > virtual in this case, what's your timeline for this DR project ? :) I'm too new to know how long these type of things take here :) I defer to your judgement as to what is reasonable. Also, I've not heard hard and fast requirements from Sylvie on this. Would you be able to propose a reasonable time estimate for this? Once the machines are stood up I'll have the work of doing the config and installations of Tableau that I would estimate at about 1-2 weeks to be fully deployed.
(In reply to bsullins from comment #9) > I'm too new to know how long these type of things take here :) I defer to > your judgement as to what is reasonable. Also, I've not heard hard and fast > requirements from Sylvie on this. Would you be able to propose a reasonable > time estimate for this? My biggest time sink is going to be testing Windows monitoring on the machines we want to use _before_ we use them. I can make this a Q4 goal for my team and get stuff done in Q4. Does that sound okay to you?
(In reply to Shyam Mani [:fox2mike] from comment #10) > (In reply to bsullins from comment #9) > > > I'm too new to know how long these type of things take here :) I defer to > > your judgement as to what is reasonable. Also, I've not heard hard and fast > > requirements from Sylvie on this. Would you be able to propose a reasonable > > time estimate for this? > > My biggest time sink is going to be testing Windows monitoring on the > machines we want to use _before_ we use them. I can make this a Q4 goal for > my team and get stuff done in Q4. Does that sound okay to you? Sounds reasonable to me. I will let Annie and Sylvie know on my end.
(In reply to Shyam Mani [:fox2mike] from comment #1) > Ben, > > Before we take any action on this, we need to find hardware that we can > monitor OR move you to a VM based infrastructure, so we don't need to deal > with issues that arise out of hardware we can't monitor because this needs > to run on Windows. We're still having trouble monitoring everything we need > to wrt Tableau. > > My personal preference would be VMs, since that's easier to manage (from an > SRE perspective) and is already pretty redundant (based on how our VM infra > is designed) but I'll defer to Dan/Greg since they know how much we can add > on to the current infrastructure we have. > > What's your timeline on this? So we can plan and move forward accordingly. > > Thanks! Agree a VM is best and get basic monitoring in place to fail it over as needed. And then look at any other components that are scheduled or need to run for that service to be monitored. But before doing that, we need Dan to look at the workload charateristics of this server and determine if a good candidate for VM's (ie if very high I/O and memory demands, may not work well....)
I think the gateway VM @ 2 cores and 8GB RAM with low I/O needs is a perfect candidate for virtualization. Should I set that up now or should I wait until the rest of this project is fleshed out, in case other considerations pop up that change how we build the whole system? I'm opting for the latter, but I can do it either way.
The monitoring issues discussed earlier are not a blocker anymore for going bare-metal. Ashish and I worked last night and found an easy way to query iLO 4 via SNMP to get system health. This works regardless the OS installed on the box, since the entire health information is now available thru iLO. It can take a few days of Nagios tuning, but IMHO, if a physical server is needed, we can get it, as long as it's a HP Gen8.
(In reply to Dan Parsons [:lerxst] from comment #13) > I think the gateway VM @ 2 cores and 8GB RAM with low I/O needs is a perfect > candidate for virtualization. Should I set that up now or should I wait > until the rest of this project is fleshed out, in case other considerations > pop up that change how we build the whole system? I'm opting for the latter, > but I can do it either way. My preference would be to wait until we have the other servers in place.
Final specs: **HA Machines** * All in same data center. * Existing Tableau server lives in SCL3. * Please copy netflows from tableau1.metrics.scl3.mozilla.com Gateway Machine x1: * Virtual Machine * 2 cores, 8GB RAM, 100GB Disk * OS: Windows Server 64bit (latest available) Worker Machines x2: * Physical Machines * 4 Cores, 16GB RAM, 512GB Disk * OS: Windows Server 64bit (latest available) **DR Machine*** * In geographically different data center then above HA machines (PHX1 perhaps?) * Please copy netflows from tableau1.metrics.scl3.mozilla.com Primary Tableau Server x1: * 8 Cores, 32GB RAM, 600GB Disk * OS: Windows Server 64bit (latest available)
We're trying to get the DR machine in place before Nov 16th.
Assignee: server-ops → shyam
How soon do you need the gateway VM up, and what should its full hostname be?
For the HA machines, they all need to be live before we can migrate from our current production box (tableau1.metrics.scl3.mozilla.com) to them. For the full hostname can it be: tableau-portal.metrics.scl3.mozilla.com
I should add, can the worker naming convention be the following? Down the road if we need to add more workers we can just continue to increment them and add to the cluster. * tableau-worker1.metrics.scl3.mozilla.com * tableau-worker2.metrics.scl3.mozilla.com
tableau-portal.metrics.scl3.mozilla.com is online and basic (ping) nagios check added. Currently Windows is NOT activated as need a license. License was requested in https://mozilla.service-now.com/nav_to.do?uri=com.glideapp.servicecatalog_checkout_view.do?sysparm_sys_id=94a4da3580a4110015184d74e7a0c86b Password is in metrics gpg file (please svn up to see it).
License key applied to tableau-portal.metrics.scl3.mozilla.com I am not sure if the other hardware is ready and/or who will be handling that (VM task was assigned to me).
@shyam, any update on the HA boxes?
Flags: needinfo?(shyam)
(In reply to bsullins from comment #23) > @shyam, any update on the HA boxes? Nope, we haven't gotten any hardware. Is the DR box working as expected? if so I can get something of a similar spec for your HA.
Flags: needinfo?(shyam)
@shyam - DR box is working w/ the exception of netflows that are missing. Bug 939298 has been submitted for this but we've not seen any progress. - Re: HA boxes, we need to keep them spec'd as above so we don't have any licensing issues. The only thing I think we might change is physical versus virtual boxes, but I'll defer to you guys on whichever you'd like to go with. - The DR box is also spec'd I believe too large (more then 8 cores) currently and we may need to disable some of the cores to stay within what our licensing will allow.
Discussion with Dan Parsons and Annie. New specs for HA are as follows: Worker Machines x2: * Virtual Machines * 4 Cores, 16GB RAM, 512GB Disk * OS: Windows Server 64bit (latest available)
Assignee: shyam → dparsons
Component: Server Operations → Server Operations: Virtualization
QA Contact: shyam → dparsons
Passing to :cknowles to get these VMs deployed
Assignee: dparsons → cknowles
A few questions - * Would you like the 512 G all on C drive, or just left for you to slice up as needed? * Does this need to be in Nagios? (Probably yes, but I thought I'd be certain) * Anything else you need from us? If the answers to these questions are no/nothing, then the devices are up right now, answering to the passwords from the sysadmins/gpg/passwords.txt file for the windows VM templates. Don't hesitate to let me know if you need anything further. CJK
1 - All on the C drive is good 2 - Yes, and we have service checks already setup as well for dataviz.m.o 3 - Automated backups to a separate location
My group doesn't (yet) provide backup services - so we can handle everything else you've asked, but for backups you'll need to speak to Usul, who will know more about getting a Windows system added to the backup system he maintains.
cool, will submit a separate bug once we have host names for these
Correction, ludo is the contact for backups.
Alright... 1) c drive is expanded for the entire extent of the virtual 512g disk. 2) Initial barebones Nagios is in place. 3) Dan has provided contact info for backups. Do you need anything else here? CJK
Can you join these machines to our Windows domain and also provide login credentials?
Certainly - can you provide documentation on the AD and logins to allow me to do that? Poke me in IRC if you need to convey passwords. CJK
filed 967131 for the netflows to allow this to join the AD. CJK
Alright, with the recent work for the netflows, these have been joined to the AD and I've verified that the RDP works and it responds to AD accounts. Try it out and let me know. CJK
Haven't heard anything recently, though given the complexity, I'm unwilling to simply resolve this - is there anything else that the VM folks can do to help get this working? CJK
Alright, I'm hoping that complete radio silence indicates all is well. Please reopen if there are concerns with the VMs themselves.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.