Closed Bug 882988 Opened 11 years ago Closed 11 years ago

Install WSGI app on new DXR webheads

Categories

(Developer Services :: General, task)

x86_64
Other
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: erik, Assigned: fubar)

References

Details

Attachments

(2 files)

Now that we have our new stage and prod webheads, the next step is to get the DXR web app running on them. This involves:

* Installing DXR in a layout that'll be compatible with continuous deployment. See /data on the current prod box for something you can basically copy. This will involve getting `make` to run at the root of DXR's checkout, which means the trilite and clang plugin build deps will have to be around. (This will be an important step beyond what we have on prod right now--I can't get the sqlite version conflicts worked out on that old RHEL, so we can never update trilite.)

* Configuring Apache well. You can crib my config off the prod box. It's pretty solid as far as WSGI config goes, modulo the hardware params on the new boxes. WSGI reload configuration is just so: it supports CD nicely. Beyond what I already have set up, we should turn on mod_gzip and whatever other niceties are standard for us. Our users pull big textish docs and will appreciate the speedup.

* Grabbing the "/data/instances/.../target" folder off the current prod box or just building a small, proof-of-concept one. Getting the real index on there is a separate problem from this bug.

As I said, a lot of this can be cribbed from the work I already did on the current dxr.mozilla.org. Take a look in /data, in mozbuild's crontab, and in the Apache config. I think those places are comprehensive.
Actually, especially since we have 2 prod webheads, we shouldn't do the building (make) on the servers. We should do it on the Jenkins box or on the admin node (the latter of which might be easier since we'll need RHEL or at least CentOS). The same work still has to be done, but it moves from the webheads to a different box.
Blocks: 882991
Would someone mind saying where this stands in the queue so I can update my quarterly goals list? Many thanks!
Coincidences aside, I was just digging into this today. Since we still have an open question on using NFS for serving data from, I was going to set up a test site on the build box (since we have a test NFS mount available there, as well as sufficient local disk) using NFS and local disk to see how they compared, and to see if the build-box tarball looks the same as the one generated by buildbot.
Ah, I forgot all about the NFS/local question. Kudos to you for having a look.
After a bit of faffing about with python versions, I'm running into the following error:

[Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] mod_wsgi (pid=23429): Target WSGI script '/data/dxr-prod/dxr/dxr/wsgi.py' cannot be loaded as Python module.
[Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] mod_wsgi (pid=23429): Exception occurred processing WSGI script '/data/dxr-prod/dxr/dxr/wsgi.py'.
[Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] Traceback (most recent call last):
[Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1]   File "/data/dxr-prod/dxr/dxr/wsgi.py", line 1, in <module>
[Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1]     from dxr.app import make_app
[Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] ImportError: No module named dxr.app

I *think* all of the stuff in /data on dxr-processor1 is set up similarly to the old box, so I'm not sure what's tripping it up. Any ideas?
Belay that; something was wonky with the virtualenv.  WSGI app is currently working on dxr-processor1 with a locally-built m-c tree. 

Is there something in the dxr code that's checking for dxr.m.o? I can't access it via the hostname, but fudging dxr.m.o in my /etc/hosts makes it work. Mostly just curious, though it'd make it easier to compare to prod if I didn't have to do that.
Attached file prod.tsv
Raw tab-separated data in attatched prod.tsv; data from production dxr.mozilla.org

sekrit$ ab -g prod.tsv -c 4 -n 400 "http://dxr.mozilla.org/mozilla-central/search?q=mainScreen&redirect=false"

Server Software:        Apache/2.2.3
Server Hostname:        dxr.mozilla.org
Server Port:            80

Document Path:          /mozilla-central/search?q=mainScreen&redirect=false
Document Length:        24131 bytes

Concurrency Level:      4
Time taken for tests:   61.955 seconds
Complete requests:      400
Failed requests:        0
Write errors:           0
Total transferred:      9742400 bytes
HTML transferred:       9652400 bytes
Requests per second:    6.46 [#/sec] (mean)
Time per request:       619.548 [ms] (mean)
Time per request:       154.887 [ms] (mean, across all concurrent requests)
Transfer rate:          153.56 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:      111  539 317.4    445    2196
Processing:     0   81  65.7     63     432
Waiting:        0   76  65.1     58     432
Total:        468  619 363.6    508    2426

Percentage of the requests served within a certain time (ms)
  50%    508
  66%    515
  75%    526
  80%    535
  90%    726
  95%   1538
  98%   2172
  99%   2339
 100%   2426 (longest request)
Attached file buildbox.tsv
data from test site on dxr-processor1 (using /etc/hosts tricks), on NFS (prod was local disk); raw data in buildbox.tsv.

sekrit$ ab -g buildbox.tsv -c 4 -n 400 "http://dxr.mozilla.org/mozilla-central/search?q=mainScreen&redirect=false" 


Server Software:        Apache
Server Hostname:        dxr.mozilla.org
Server Port:            80

Document Path:          /mozilla-central/search?q=mainScreen&redirect=false
Document Length:        22055 bytes

Concurrency Level:      4
Time taken for tests:   56.001 seconds
Complete requests:      400
Failed requests:        0
Write errors:           0
Total transferred:      8916400 bytes
HTML transferred:       8822000 bytes
Requests per second:    7.14 [#/sec] (mean)
Time per request:       560.010 [ms] (mean)
Time per request:       140.002 [ms] (mean, across all concurrent requests)
Transfer rate:          155.49 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:       99  107   7.2    105     204
Processing:   362  452  34.6    451     609
Waiting:      259  343  32.5    341     463
Total:        462  559  36.3    556     713

Percentage of the requests served within a certain time (ms)
  50%    556
  66%    573
  75%    579
  80%    582
  90%    604
  95%    618
  98%    645
  99%    688
 100%    713 (longest request)
I ran a longer test (8 concurrent, 1600 requests) where prod took
argh. 

I ran a longer test (8 concurrent, 1600 requests) where prod took ~5 times longer than the new box. All of my other, shorter, tests were very close, but that one stood out. I may run another few large tests, but the tl;dr is that it looks like NFS will be fine for serving data. Concur?
Thanks for running the benches!

You've established that the new box has a lot more CPU than the old. That's awesome.

Have we nailed down anything about IO speed, though? Wouldn't all but the first hit use the disk cache and avoid the network altogether? I think we'd need to run different queries for each request to hit new parts of the DB each time.
I'll dig up more numbers for you, but a first pass shows consistent i/o numbers for NFS traffic which would indicate no caching (and I'm not sure that RHEL does any sort of caching for NFS traffic by default). OTOH, comparing it against the local HW raid array which DOES cache, the NFS test has almost exactly the same numbers.
Ok, more tests run; one minor issue, though, is that I can't get iostat to report any data for i/o to the raid array. There's cache, but not enough to cache *everything*, so I'm not sure what's up. In any case, I do have useful numbers from nfsiostat as well as ab. As far as nfs caching goes, unless fs-cache/cachefs are setup the client only caches metadata or pending writes. 

This test is three concurrent loops of single queries, each using a random search string (I grabbed 3 arbitrary files from m-c, pulled out words > 4 chars, and randomly chose 50 words from each), so 150 queries total, e.g.:

for i in `cat nsGUIEvent.txt`; do sleep 1; ab "http://dxr.mozilla.org/mozilla-central/search?q=$i" | perl -ne 'if (/^Time per.*\s+(\d+\.\d+).*mean\)/) {print $1."\n";}' >> nsGUIEvent.txt.time8 & done;

Run once against local disk and once against NFS, with httpd restarted before running.
Excluding the sleep created crazy long times (24s!) due to hammering the snot out of my client, sadly.
 
$wc -l *.time[89]
      50 events.txt.time8
      50 events.txt.time9
      50 nsGUIEvent.txt.time8
      50 nsGUIEvent.txt.time9
      50 nsHtml5Tokenizer.txt.time8
      50 nsHtml5Tokenizer.txt.time9

Times (in ms) for local disk:
$./stat.pl < events.txt.time8
mean is:   602.4
median is: 543.1
min is:    281.6
max is:    1058.1
$./stat.pl < nsGUIEvent.txt.time8
mean is:   569.8
median is: 391.1
min is:    265.7
max is:    1166.3
$./stat.pl < nsHtml5Tokenizer.txt.time8
mean is:   410.4
median is: 315.9
min is:    283.9
max is:    993.3

Times (in ms) for NFS:
$./stat.pl < events.txt.time9
mean is:   652.8
median is: 720.2
min is:    288.2
max is:    1463.1
$./stat.pl < nsGUIEvent.txt.time9
mean is:   573.7
median is: 364.7
min is:    288.8
max is:    1385.6
$./stat.pl < nsHtml5Tokenizer.txt.time9
mean is:   430.3
median is: 309.5
min is:    251.0
max is:    1490.1


On the NFS end, nfsiostat reports avg RTT of less than 0.2ms(!) throughout the test, with a whopping max of 72 ops/sec, e.g.:

10.8.75.249:/dxr_test mounted on /mnt/dxr_test:

   op/s         rpc bklog
  72.00            0.00
getattr:          ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                 72.000          18.281           0.254        0 (0.0%)           0.181           0.208
access:           ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                  0.000           0.000           0.000        0 (0.0%)           0.000           0.000

56 inode revalidations, hitting in cache -28.57% of the time
16 open operations (mandatory GETATTR requests)
0.00% of GETATTRs resulted in data cache invalidations


So the lack of iostat info from the raid array is bothersome, but the ab numbers are all close enough that I really don't think we have an issue. Eric, how do you feel about it? If you're good, we can get prod VM requests in today.
Go, go, go! I'm glad the NFS stuff works out, as it'll make our deployments less complicated.

Thanks for doing such thorough benchmarks! It's interesting to know that about NFS caching, and those are some truly impressive RTTs. I bet the next big speedup for DXR will be moving away from SQLite, which does an oddly large number of IO ops when you open a DB. Thanks again!
Ok, I think we're done here except for the prod VMs, and once they're provisioned it'll only take a puppet run to finish the web bits.

All of the deployment pieces are on the NFS mount, which is /data on the admin node and webheads; it's /data/www on the build box.  Two cron jobs run as the admin node, doing automatic deploys to prod and staging just like the old box. Only difference is jobs are run every 5 minutes, rather than every minute. 

(seperate from the wsgi piece, another cron job is building the m-c data and dropping it onto the NFS volume; currently run at 8am UTC daily by the dxr user)

I also ran another quick series of ab tests to make sure the timings weren't off significantly with a VM; numbers were in the same ballpark as before.

If it's ok with you, I'd like to repoint dxr.allizom.org at the new staging server so we can see if there's anything different/broken compared to prod.
(In reply to Kendall Libby [:fubar] from comment #15)
> Ok, I think we're done here except for the prod VMs, and once they're
> provisioned it'll only take a puppet run to finish the web bits.

Anything I can do to help with these? Is there a bug? :)
(In reply to Shyam Mani [:fox2mike] from comment #16)
> (In reply to Kendall Libby [:fubar] from comment #15)
> > Ok, I think we're done here except for the prod VMs, and once they're
> > provisioned it'll only take a puppet run to finish the web bits.
> 
> Anything I can do to help with these? Is there a bug? :)

All good; lerxst popped them inti existence last night (bug 916852).

OTOH, maybe you can answer a question about the current dxr: I had presumed that dxr1.pub.phx1 was on zeus, but if it is I can't find it, and the pub IP isn't on the host itself. Is there some other NAT device in between, or are these cold meds really messing me up? :)
(In reply to Kendall Libby [:fubar] from comment #17)

> OTOH, maybe you can answer a question about the current dxr: I had presumed
> that dxr1.pub.phx1 was on zeus, but if it is I can't find it, and the pub IP
> isn't on the host itself. Is there some other NAT device in between, or are
> these cold meds really messing me up? :)

shyam@katniss ~/mozilla/repos/svn/sysadmins $ host dxr.mozilla.org
dxr.mozilla.org is an alias for dxr1.pub.phx1.mozilla.com.
dxr1.pub.phx1.mozilla.com has address 63.245.216.215

3rd octet is the key. 

214 = NAT in scl3
215 = Zeus in scl3 
216 = NAT in phx1
217 = Zeus in phx1

So it's totally  your cold meds :p 

And we should move this to Zeus.
prod VMs up, configured and running wsgi app. available via zeus at dxr.vips.phx1.mozilla.com (you'll need to fudge /etc/hosts).
Wrapping up. wsgi app is installed and running on new prod and stage vms.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: