Closed Bug 849041 Opened 11 years ago Closed 10 years ago

Create a load-balanced cluster of machines to replace bm-remote-talos-webhosts

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: arich)

References

Details

(Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/96] )

Once the tegras are chassis-mounted in a new colo (bug 666044), we'll need to stand up a replacement for the current set of bm-remote-talos-webhost machines that serve talos pages for performance tests on the tegras.

The webhosts are currently Mac minis (r2?), so it's probably in everyone's best interest to use a set of simple Linux machines as replacements. Aki says the only requirement is that the machines behind the load balancer are as identical configuration-wise as possible to avoid skew in the tegra performance numbers depending on which machine serves the pages.
After discussion, waiting on more specific requirements for this web cluster from releng.  Does it need to be colocated with the pandas or tegras, what is needed other than vanliia webapp, can we just put it on the generic web cluster, etc.
Assignee: server-ops-releng → arich
Flags: needinfo?(coop)
(In reply to Amy Rich [:arich] [:arr] from comment #1)
> After discussion, waiting on more specific requirements for this web cluster
> from releng.  Does it need to be colocated with the pandas or tegras, what
> is needed other than vanliia webapp, can we just put it on the generic web
> cluster, etc.

This is a web cluster that will serve Talos webpages.

The /var/www/html/ from the current bm-remote-talos-webhost-01 should work.  We occasionally need to update the repo in /var/www/html/talos-repo or add a new tp pageset (currently /var/www/html/tp4).

All that's needed is Apache and that; it's just serving webpages.

The main concern here is it's serving webpages for a performance test.  Any fluctuation in response time is going to negatively affect the performance test's numbers.  For that reason we went with the same hardware, load balanced, in the same physical location as the devices.

Adding it to the generic web cluster sounds like numbers could be affected by other traffic to the other sites; is that correct?  That sounds like it would make the numbers useless.

Also, we have had multiple multiple talks with legal about whether we can make the tpX pagesets public; last I heard we shouldn't for tp4, at least.  So at the very least the tp4/ URL should have some attempt at being internal-only.
Flags: needinfo?(coop)
So anytime you go through the load balancer, there's going to be some possible timing/load discrepancy.  I'm ccing jakem and cshields on this since this is likely going to be more web than relops.
Also, trying to figure out what code runs here.  This wiki page details what information is needed to stand up a web service (at the very least, we're going to do something real and supported instead of throwing three desktop machines behind a cheap lb).  Could you please take a look (especially at the last three sections) and fill out the details to help us get an idea of what this web cluster will need to support?

https://wiki.mozilla.org/Websites/Processes/New_Website
This is going to be test infrastructure, not a human-visited website, so I'm not sure how much that applies.

We basically just softlinked the web directories/files from http://hg.mozilla.org/build/talos into webroot.

Looks like the current revision of talos in use is 97cbf16e9846 on bm-remote currently.

So talos-repo is a clone of talos at the above revision:

lrwxrwxrwx   1 root root   18 2010-03-18 01:32 getInfo.html -> talos/getInfo.html
lrwxrwxrwx   1 root root   20 2010-03-18 01:32 page_load_test -> talos/page_load_test
lrwxrwxrwx   1 root root   13 2010-10-08 16:10 scripts -> talos/scripts
lrwxrwxrwx   1 root root   18 2010-03-18 01:35 startup_test -> talos/startup_test
lrwxrwxrwx   1 root root   16 2011-12-01 09:03 talos -> talos-repo/talos
drwxr-xr-x   5 root root 4096 2012-10-04 11:12 talos-repo

(we also have a softlink to tp4 from page_load_test/).


These next two directories are for Tp, which have the scrubbed internet pagesets that we can't distribute.  We can zip them up, or try to find the original source tarballs.

drwxr-xr-x  23 root root 4096 2011-04-12 03:56 mobile_tp4
drwxr-xr-x 102 root root 4096 2010-03-18 01:23 tp4


Looks like this directory has the host_utils zip, which the tegras download and unzip.  This directory predates tooltool, so we could potentially move this there.

drwxr-xr-x   2 root root 4096 2012-07-17 15:17 tegra
Jakem: from discussion with Dustin, it seems like the pertinent point here is that it's all static pages these hosts are serving, no dynamic content at all.  So it's just a vanilla web server cluster.  It would be a straight forklift of the docroot onto a set of other machines.
Basically what Aki's saying is that this site is just static data - no code executes here.

We should set this up the same way we do for any other website, to be consistent.  That means dev/stage/prod.

I'll leave discussion of holding down timing jitter to the webops folks.  The requirement is that, from the perspective of the test devices, the time for each HTTP request is stable.

Aki, what is the timing resolution for the tests as a whole?  That will give a better window for acceptable timing variance in the resulting implementation.  We should also run some timing tests against the existing bm-remote-talos-webhost stuff as a baseline, since that's been doing fine for years.
(In reply to Dustin J. Mitchell [:dustin] from comment #7)
> Aki, what is the timing resolution for the tests as a whole?  That will give
> a better window for acceptable timing variance in the resulting
> implementation.  We should also run some timing tests against the existing
> bm-remote-talos-webhost stuff as a baseline, since that's been doing fine
> for years.

Aiui:

I *think* each individual pageload measures down to milliseconds, since that's JavaScript's smallest time measurement.

Test suites, which are comprised of many many of these page loads, run in the tens of minutes.

CCing Joel, who knows way more about Talos than I do.
For a little back-of-the-envelope math, I did some sampling against a zlb-fronted static page:

>>> def t():
...   start = time.time()
...   urllib2.urlopen('http://update-tests.pub.build.mozilla.org/pub/mozilla.org/firefox/nightly/2.0.0.16-candidates/build1/linux_info.txt').read()
...   return time.time() - start
...
>>> times = [ t() for _ in xrange(1000) ]
>>> avg = average(subsec_times)
>>> avg = average(times)
>>> math.sqrt(average([(x-avg)**2 for x in times]))
0.31554827951269065

Interestingly, three of those were a few ms over 5s -- perhaps some kind of DDOS protection?  Without those outliers:

>>> avg = average(times)
>>> math.sqrt(average([(x-avg)**2 for x in times]))
0.0050805448143233711

So outliers aside, this zlb is showing a 5ms standard deviation right now.
we measure each page in ms and have accepted the fact that a slight variation in loading pages over the network will need to take place.

I think the original design was to have bm-remote be as close network wise to the tegras, now that we have panda boards hitting them, we really should rethink that :)

5ms variance doesn't seem so bad, as long as that is the variance with 500 users hitting it!
Cool, good to know.  That was from the same datacenter but different VLAN, by the way.

We should do some more realistic measuring of both of the existing bm-remote and any proposed solution.  I think 'ab' can do this, but I'm sure webops will know more.

Webops folks: relops is happy to help with implementation, but we need your expertise before we get to that stage :)
Component: Server Operations: RelEng → Server Operations: Web Operations
QA Contact: arich → nmaul
It occurs to me that those 5s outliers may correspond to lost SYNs.  Normal TCP tuning puts that retransmit delay at about 5s.
I ran a similar script from the admin hosts in mtv1 against the current bm-remote cluster.  The results are:

n = 191000
0.00540089607239
0.00542402267456
0.0054349899292
0.00544500350952
0.00545597076416
0.00545811653137
0.0054669380188
0.00546789169312
0.00547790527344
0.0054829120636
..
3.08795404434
3.0880279541
3.08832812309
3.0947868824
3.10169887543
3.10897707939
3.11571097374
3.11717009544
3.12011480331
3.12180399895
avg 0.0119746297105
std dev 0.0523291610228

It's worth noting that the std dev didn't settle down until well over n=10000. I don't know what to make of the 3s+ times.  It looks like a std dev of ~50ms, but I'm guessing the distribution is highly bimodal, so that's not particularly representative.  Sadly, I didn't capture the set of times for further analysis.

So, I'll stop throwing up statistics here, unless someone would like me to repeat this analysis in a different way.  As I said before, I think ab can do a better job of generating such statistics, but I'll let the pros handle that.
What is the app code here?  I'm guessing it is something simple.  Could we have someone from webops work with the app developer in pushing it to the dev PaaS instance, so that some of these same timing tests could be run against that and see if it is a viable candidate for this use.

I'm hesitant to build up something dedicated when a lot of our efforts are going to be moving to that system.
There's no code - the content is static.  Jake and I just talked about this and I think we have some ideas on how to proceed, but we'll definitely need to do some prototyping and measuring.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/96]
We moved this to the relengweb cluster a while back.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.