Closed
Bug 882988
Opened 11 years ago
Closed 11 years ago
Install WSGI app on new DXR webheads
Categories
(Developer Services :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: erik, Assigned: fubar)
References
Details
Attachments
(2 files)
Now that we have our new stage and prod webheads, the next step is to get the DXR web app running on them. This involves: * Installing DXR in a layout that'll be compatible with continuous deployment. See /data on the current prod box for something you can basically copy. This will involve getting `make` to run at the root of DXR's checkout, which means the trilite and clang plugin build deps will have to be around. (This will be an important step beyond what we have on prod right now--I can't get the sqlite version conflicts worked out on that old RHEL, so we can never update trilite.) * Configuring Apache well. You can crib my config off the prod box. It's pretty solid as far as WSGI config goes, modulo the hardware params on the new boxes. WSGI reload configuration is just so: it supports CD nicely. Beyond what I already have set up, we should turn on mod_gzip and whatever other niceties are standard for us. Our users pull big textish docs and will appreciate the speedup. * Grabbing the "/data/instances/.../target" folder off the current prod box or just building a small, proof-of-concept one. Getting the real index on there is a separate problem from this bug. As I said, a lot of this can be cribbed from the work I already did on the current dxr.mozilla.org. Take a look in /data, in mozbuild's crontab, and in the Apache config. I think those places are comprehensive.
Reporter | ||
Comment 1•11 years ago
|
||
Actually, especially since we have 2 prod webheads, we shouldn't do the building (make) on the servers. We should do it on the Jenkins box or on the admin node (the latter of which might be easier since we'll need RHEL or at least CentOS). The same work still has to be done, but it moves from the webheads to a different box.
Reporter | ||
Comment 2•11 years ago
|
||
Would someone mind saying where this stands in the queue so I can update my quarterly goals list? Many thanks!
Assignee | ||
Comment 3•11 years ago
|
||
Coincidences aside, I was just digging into this today. Since we still have an open question on using NFS for serving data from, I was going to set up a test site on the build box (since we have a test NFS mount available there, as well as sufficient local disk) using NFS and local disk to see how they compared, and to see if the build-box tarball looks the same as the one generated by buildbot.
Reporter | ||
Comment 4•11 years ago
|
||
Ah, I forgot all about the NFS/local question. Kudos to you for having a look.
Assignee | ||
Comment 5•11 years ago
|
||
After a bit of faffing about with python versions, I'm running into the following error: [Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] mod_wsgi (pid=23429): Target WSGI script '/data/dxr-prod/dxr/dxr/wsgi.py' cannot be loaded as Python module. [Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] mod_wsgi (pid=23429): Exception occurred processing WSGI script '/data/dxr-prod/dxr/dxr/wsgi.py'. [Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] Traceback (most recent call last): [Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] File "/data/dxr-prod/dxr/dxr/wsgi.py", line 1, in <module> [Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] from dxr.app import make_app [Tue Sep 03 20:35:37 2013] [error] [client 127.0.0.1] ImportError: No module named dxr.app I *think* all of the stuff in /data on dxr-processor1 is set up similarly to the old box, so I'm not sure what's tripping it up. Any ideas?
Assignee | ||
Comment 6•11 years ago
|
||
Belay that; something was wonky with the virtualenv. WSGI app is currently working on dxr-processor1 with a locally-built m-c tree. Is there something in the dxr code that's checking for dxr.m.o? I can't access it via the hostname, but fudging dxr.m.o in my /etc/hosts makes it work. Mostly just curious, though it'd make it easier to compare to prod if I didn't have to do that.
Assignee | ||
Comment 7•11 years ago
|
||
Raw tab-separated data in attatched prod.tsv; data from production dxr.mozilla.org sekrit$ ab -g prod.tsv -c 4 -n 400 "http://dxr.mozilla.org/mozilla-central/search?q=mainScreen&redirect=false" Server Software: Apache/2.2.3 Server Hostname: dxr.mozilla.org Server Port: 80 Document Path: /mozilla-central/search?q=mainScreen&redirect=false Document Length: 24131 bytes Concurrency Level: 4 Time taken for tests: 61.955 seconds Complete requests: 400 Failed requests: 0 Write errors: 0 Total transferred: 9742400 bytes HTML transferred: 9652400 bytes Requests per second: 6.46 [#/sec] (mean) Time per request: 619.548 [ms] (mean) Time per request: 154.887 [ms] (mean, across all concurrent requests) Transfer rate: 153.56 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 111 539 317.4 445 2196 Processing: 0 81 65.7 63 432 Waiting: 0 76 65.1 58 432 Total: 468 619 363.6 508 2426 Percentage of the requests served within a certain time (ms) 50% 508 66% 515 75% 526 80% 535 90% 726 95% 1538 98% 2172 99% 2339 100% 2426 (longest request)
Assignee | ||
Comment 8•11 years ago
|
||
data from test site on dxr-processor1 (using /etc/hosts tricks), on NFS (prod was local disk); raw data in buildbox.tsv. sekrit$ ab -g buildbox.tsv -c 4 -n 400 "http://dxr.mozilla.org/mozilla-central/search?q=mainScreen&redirect=false" Server Software: Apache Server Hostname: dxr.mozilla.org Server Port: 80 Document Path: /mozilla-central/search?q=mainScreen&redirect=false Document Length: 22055 bytes Concurrency Level: 4 Time taken for tests: 56.001 seconds Complete requests: 400 Failed requests: 0 Write errors: 0 Total transferred: 8916400 bytes HTML transferred: 8822000 bytes Requests per second: 7.14 [#/sec] (mean) Time per request: 560.010 [ms] (mean) Time per request: 140.002 [ms] (mean, across all concurrent requests) Transfer rate: 155.49 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 99 107 7.2 105 204 Processing: 362 452 34.6 451 609 Waiting: 259 343 32.5 341 463 Total: 462 559 36.3 556 713 Percentage of the requests served within a certain time (ms) 50% 556 66% 573 75% 579 80% 582 90% 604 95% 618 98% 645 99% 688 100% 713 (longest request)
Assignee | ||
Comment 9•11 years ago
|
||
I ran a longer test (8 concurrent, 1600 requests) where prod took
Assignee | ||
Comment 10•11 years ago
|
||
argh. I ran a longer test (8 concurrent, 1600 requests) where prod took ~5 times longer than the new box. All of my other, shorter, tests were very close, but that one stood out. I may run another few large tests, but the tl;dr is that it looks like NFS will be fine for serving data. Concur?
Reporter | ||
Comment 11•11 years ago
|
||
Thanks for running the benches! You've established that the new box has a lot more CPU than the old. That's awesome. Have we nailed down anything about IO speed, though? Wouldn't all but the first hit use the disk cache and avoid the network altogether? I think we'd need to run different queries for each request to hit new parts of the DB each time.
Assignee | ||
Comment 12•11 years ago
|
||
I'll dig up more numbers for you, but a first pass shows consistent i/o numbers for NFS traffic which would indicate no caching (and I'm not sure that RHEL does any sort of caching for NFS traffic by default). OTOH, comparing it against the local HW raid array which DOES cache, the NFS test has almost exactly the same numbers.
Assignee | ||
Comment 13•11 years ago
|
||
Ok, more tests run; one minor issue, though, is that I can't get iostat to report any data for i/o to the raid array. There's cache, but not enough to cache *everything*, so I'm not sure what's up. In any case, I do have useful numbers from nfsiostat as well as ab. As far as nfs caching goes, unless fs-cache/cachefs are setup the client only caches metadata or pending writes. This test is three concurrent loops of single queries, each using a random search string (I grabbed 3 arbitrary files from m-c, pulled out words > 4 chars, and randomly chose 50 words from each), so 150 queries total, e.g.: for i in `cat nsGUIEvent.txt`; do sleep 1; ab "http://dxr.mozilla.org/mozilla-central/search?q=$i" | perl -ne 'if (/^Time per.*\s+(\d+\.\d+).*mean\)/) {print $1."\n";}' >> nsGUIEvent.txt.time8 & done; Run once against local disk and once against NFS, with httpd restarted before running. Excluding the sleep created crazy long times (24s!) due to hammering the snot out of my client, sadly. $wc -l *.time[89] 50 events.txt.time8 50 events.txt.time9 50 nsGUIEvent.txt.time8 50 nsGUIEvent.txt.time9 50 nsHtml5Tokenizer.txt.time8 50 nsHtml5Tokenizer.txt.time9 Times (in ms) for local disk: $./stat.pl < events.txt.time8 mean is: 602.4 median is: 543.1 min is: 281.6 max is: 1058.1 $./stat.pl < nsGUIEvent.txt.time8 mean is: 569.8 median is: 391.1 min is: 265.7 max is: 1166.3 $./stat.pl < nsHtml5Tokenizer.txt.time8 mean is: 410.4 median is: 315.9 min is: 283.9 max is: 993.3 Times (in ms) for NFS: $./stat.pl < events.txt.time9 mean is: 652.8 median is: 720.2 min is: 288.2 max is: 1463.1 $./stat.pl < nsGUIEvent.txt.time9 mean is: 573.7 median is: 364.7 min is: 288.8 max is: 1385.6 $./stat.pl < nsHtml5Tokenizer.txt.time9 mean is: 430.3 median is: 309.5 min is: 251.0 max is: 1490.1 On the NFS end, nfsiostat reports avg RTT of less than 0.2ms(!) throughout the test, with a whopping max of 72 ops/sec, e.g.: 10.8.75.249:/dxr_test mounted on /mnt/dxr_test: op/s rpc bklog 72.00 0.00 getattr: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 72.000 18.281 0.254 0 (0.0%) 0.181 0.208 access: ops/s kB/s kB/op retrans avg RTT (ms) avg exe (ms) 0.000 0.000 0.000 0 (0.0%) 0.000 0.000 56 inode revalidations, hitting in cache -28.57% of the time 16 open operations (mandatory GETATTR requests) 0.00% of GETATTRs resulted in data cache invalidations So the lack of iostat info from the raid array is bothersome, but the ab numbers are all close enough that I really don't think we have an issue. Eric, how do you feel about it? If you're good, we can get prod VM requests in today.
Reporter | ||
Comment 14•11 years ago
|
||
Go, go, go! I'm glad the NFS stuff works out, as it'll make our deployments less complicated. Thanks for doing such thorough benchmarks! It's interesting to know that about NFS caching, and those are some truly impressive RTTs. I bet the next big speedup for DXR will be moving away from SQLite, which does an oddly large number of IO ops when you open a DB. Thanks again!
Assignee | ||
Comment 15•11 years ago
|
||
Ok, I think we're done here except for the prod VMs, and once they're provisioned it'll only take a puppet run to finish the web bits. All of the deployment pieces are on the NFS mount, which is /data on the admin node and webheads; it's /data/www on the build box. Two cron jobs run as the admin node, doing automatic deploys to prod and staging just like the old box. Only difference is jobs are run every 5 minutes, rather than every minute. (seperate from the wsgi piece, another cron job is building the m-c data and dropping it onto the NFS volume; currently run at 8am UTC daily by the dxr user) I also ran another quick series of ab tests to make sure the timings weren't off significantly with a VM; numbers were in the same ballpark as before. If it's ok with you, I'd like to repoint dxr.allizom.org at the new staging server so we can see if there's anything different/broken compared to prod.
Comment 16•11 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #15) > Ok, I think we're done here except for the prod VMs, and once they're > provisioned it'll only take a puppet run to finish the web bits. Anything I can do to help with these? Is there a bug? :)
Assignee | ||
Comment 17•11 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #16) > (In reply to Kendall Libby [:fubar] from comment #15) > > Ok, I think we're done here except for the prod VMs, and once they're > > provisioned it'll only take a puppet run to finish the web bits. > > Anything I can do to help with these? Is there a bug? :) All good; lerxst popped them inti existence last night (bug 916852). OTOH, maybe you can answer a question about the current dxr: I had presumed that dxr1.pub.phx1 was on zeus, but if it is I can't find it, and the pub IP isn't on the host itself. Is there some other NAT device in between, or are these cold meds really messing me up? :)
Comment 18•11 years ago
|
||
(In reply to Kendall Libby [:fubar] from comment #17) > OTOH, maybe you can answer a question about the current dxr: I had presumed > that dxr1.pub.phx1 was on zeus, but if it is I can't find it, and the pub IP > isn't on the host itself. Is there some other NAT device in between, or are > these cold meds really messing me up? :) shyam@katniss ~/mozilla/repos/svn/sysadmins $ host dxr.mozilla.org dxr.mozilla.org is an alias for dxr1.pub.phx1.mozilla.com. dxr1.pub.phx1.mozilla.com has address 63.245.216.215 3rd octet is the key. 214 = NAT in scl3 215 = Zeus in scl3 216 = NAT in phx1 217 = Zeus in phx1 So it's totally your cold meds :p And we should move this to Zeus.
Assignee | ||
Comment 19•11 years ago
|
||
prod VMs up, configured and running wsgi app. available via zeus at dxr.vips.phx1.mozilla.com (you'll need to fudge /etc/hosts).
Assignee | ||
Comment 20•11 years ago
|
||
Wrapping up. wsgi app is installed and running on new prod and stage vms.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Component: Server Operations: Developer Services → General
Product: mozilla.org → Developer Services
You need to log in
before you can comment on or make changes to this bug.
Description
•