Closed Bug 1304171 Opened 8 years ago Closed 6 years ago

Direct rest and bzapi requests to the new 6th webhead for a day or two and measure performance

Categories

(bugzilla.mozilla.org :: Infrastructure, task)

Production
task
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: dylan, Assigned: fubar)

References

Details

Per: https://wiki.mozilla.org/BMO/Meetings/2016-09-20 We'll try REST and BZAPI calls to one webhead for a few hours or a day, in order to verify that UI performance is improved, and API performance does not significantly degrade.
Created new zeus pool called 'bugzilla.mozilla.org-api' with just web6 in, and bugzilla.mozilla.org-https pool as failover. Created new rule 'bugzilla-api' that checks URL path for /bzapi/ or /rest/ and sends them to the new pool. Currently enabled on prod, and traffic is flowing happily to web6.
Assignee: nobody → klibby
10:23:49 <@nagios-scl3> Thu 07:23:49 PDT [5215] web6.bugs.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 28% free (559 MB out of 2046 MB)
There weren't a huge number of clients connected (ie more idle workers than not), but lots of httpd processes swapped out. Load was fairly spiky, too. If this is going to be a longer term model, we might want to tune things better for handling API calls, or add a second node.
cool, I'm going to take a look at apache size limit stuff.
graphite shows a marked difference in memory usage between the new node and the others, suggesting a memory leak in bzapi/rest requests. dylan is investigating, but there will likely be ongoing swap alerts to the MOC.
I've returned API traffic back to the entire pool; it's the end of the day and I don't want the MOC to get paged over and over again as swap fills up (and it just ate everything on web6!). I think there's low enough "user" traffic that we could split the cluster 4/2 and be pretty happy. More so once dylan tracks down why SizeLimit isn't firing on API traffic (kill_pigs is culling them instead).
:dylan: did you get enough info when we ran the test, or do we need another couple of days of redirecting to get more info?
Flags: needinfo?(dylan)
Another 24-hr period would be useful. Especially if I can add more instrumentation to try to track down the remaining memory leaks.
Flags: needinfo?(dylan)
I've re-enabled this with a small change - web5 and web6 are now serving API requests while web1-4 are handling all other traffic. Hopefully having two nodes handling API traffic will prevent nagios from throwing swap alerts while the test is running.
Apache on web5 stopped responding entirely and was graceful'd.
Had web5 page for swap today, and restarted Apache. My apologies if I had interrupted any testing :( 19:22:45 <@nagios-scl3> Mon 19:22:45 PST [5605] web5.bugs.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 22% free (450 MB out of 2047 MB) (http://m.mozilla.org/Swap)
<@nagios-scl3:#sysadmins> (IRC) Tue 03:43:26 PST [5340] web6.bugs.scl3.mozilla.com:Out of memory - killed process is WARNING: WARNING: Log errors found: Jan 10 11:42:46 web6.bugs.scl3.mozilla.com apache[8172]: Out of memory! (http://m.mozilla.org/Out+of+memory+-+killed+process) <@nagios-scl3:#sysadmins> Tue 03:49:56 PST [5346] web6.bugs.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 30% free (597 MB out of 2046 MB) (http://m.mozilla.org/Swap) Apache Server Status for localhost Server Version: Apache/2.2.15 (Unix) mod_perl/2.0.4 Perl/v5.10.1 Server Built: Nov 3 2016 10:35:25 -------------------------------------------------------------------------- Current Time: Tuesday, 10-Jan-2017 11:54:39 UTC Restart Time: Tuesday, 10-Jan-2017 03:12:19 UTC Parent Server Generation: 0 Server uptime: 8 hours 42 minutes 20 seconds Total accesses: 70782 - Total Traffic: 7.7 MB CPU Usage: u11562.5 s318.54 cu0 cs0 - 37.9% CPU load 2.26 requests/sec - 258 B/second - 114 B/request 4 requests currently being processed, 44 idle workers ______W_____._____.............._......................__...._._ _.........K........__......_............._........._._........._ .K...K..._._...__._...._................_...._.__..._....__..... ....__.......................................................... .... Scoreboard Key: "_" Waiting for Connection, "S" Starting up, "R" Reading Request, "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup, "C" Closing connection, "L" Logging, "G" Gracefully finishing, "I" Idle cleanup of worker, "." Open slot with no current process So appears that one process, the killed one, was taking the remainder of virtual memory on the host. The large number of processes are taking up the rest, no other individual memory hogs. Mem: 16334188k total, 15002416k used, 1331772k free, 19900k buffers Swap: 2096124k total, 1318792k used, 777332k free, 295232k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5606 apache 30 10 801m 481m 3152 S 0.0 3.0 2:03.25 httpd 8348 apache 30 10 777m 475m 3276 S 0.0 3.0 3:51.39 httpd 8406 apache 30 10 776m 424m 3144 S 0.0 2.7 3:32.19 httpd 8344 apache 30 10 714m 415m 3220 S 0.0 2.6 3:49.42 httpd 5633 apache 30 10 707m 407m 3172 S 0.0 2.6 2:00.41 httpd 8260 apache 30 10 688m 382m 3248 S 0.0 2.4 3:59.79 httpd 5604 apache 30 10 696m 366m 3680 S 0.0 2.3 2:02.61 httpd 8342 apache 30 10 682m 362m 3200 S 0.0 2.3 3:52.23 httpd 8315 apache 30 10 641m 339m 3240 S 0.0 2.1 3:47.86 httpd 8318 apache 30 10 635m 335m 3184 S 0.0 2.1 4:08.25 httpd 8280 apache 30 10 631m 329m 3328 S 0.0 2.1 4:06.92 httpd 8385 apache 30 10 646m 324m 3380 S 0.0 2.0 3:47.81 httpd 8382 apache 30 10 622m 320m 3304 S 0.0 2.0 3:32.26 httpd 5567 apache 30 10 616m 318m 3212 S 0.0 2.0 1:51.18 httpd 8357 apache 30 10 629m 317m 3192 S 0.0 2.0 3:48.72 httpd 8257 apache 30 10 620m 314m 3156 S 0.0 2.0 3:35.99 httpd 8384 apache 30 10 619m 309m 3320 S 0.0 1.9 3:38.04 httpd 8193 apache 30 10 611m 308m 3276 S 0.0 1.9 3:54.77 httpd 8329 apache 30 10 640m 308m 3208 S 0.0 1.9 4:32.50 httpd 8250 apache 30 10 632m 303m 3200 S 0.0 1.9 4:02.22 httpd 8352 apache 30 10 604m 298m 3396 S 0.0 1.9 3:55.86 httpd 5607 apache 30 10 611m 296m 3328 S 8.3 1.9 2:11.14 httpd 5634 apache 30 10 605m 295m 3232 S 0.0 1.9 1:49.98 httpd 8405 apache 30 10 635m 295m 3212 S 0.0 1.9 3:42.68 httpd 8304 apache 30 10 618m 294m 3164 S 0.0 1.8 3:10.71 httpd 8252 apache 30 10 623m 291m 3164 S 0.0 1.8 3:34.02 httpd 8349 apache 30 10 645m 285m 3208 S 0.0 1.8 4:30.29 httpd 5569 apache 30 10 609m 285m 3416 S 0.0 1.8 1:44.09 httpd 5628 apache 30 10 585m 285m 3176 S 0.0 1.8 2:06.90 httpd 8389 apache 30 10 603m 282m 3144 S 0.0 1.8 3:44.04 httpd 8331 apache 30 10 597m 281m 3436 S 0.0 1.8 4:19.56 httpd 8259 apache 30 10 635m 281m 3184 S 0.0 1.8 4:16.38 httpd 8187 apache 30 10 602m 279m 3212 S 7.6 1.8 3:47.85 httpd 8377 apache 30 10 575m 278m 3212 S 0.3 1.7 3:06.77 httpd 8289 apache 30 10 606m 277m 3364 S 0.0 1.7 3:42.44 httpd 8281 apache 30 10 577m 276m 3192 S 0.0 1.7 3:36.99 httpd 5605 apache 30 10 600m 275m 3256 S 0.0 1.7 1:47.27 httpd 5566 apache 30 10 591m 274m 3308 S 0.0 1.7 2:08.32 httpd 8394 apache 30 10 597m 272m 3244 S 0.0 1.7 3:51.43 httpd 8271 apache 30 10 616m 269m 3212 S 0.0 1.7 4:00.11 httpd 8336 apache 30 10 593m 265m 3336 S 0.0 1.7 3:21.47 httpd 5631 apache 30 10 583m 261m 3172 S 0.0 1.6 1:55.20 httpd 8225 apache 30 10 590m 256m 3164 S 0.0 1.6 3:46.66 httpd 5632 apache 30 10 581m 251m 3252 S 0.0 1.6 1:51.46 httpd 8395 apache 30 10 604m 236m 3520 S 0.0 1.5 4:20.09 httpd 8203 apache 30 10 568m 233m 3256 S 0.0 1.5 3:22.72 httpd 5568 apache 30 10 538m 230m 3188 S 0.0 1.4 1:56.39 httpd 5629 apache 30 10 524m 222m 3204 S 0.0 1.4 1:38.02 httpd 7953 root 30 10 311m 17m 2588 S 0.0 0.1 0:02.99 httpd Bounced httpd to recover memory/swap as everyone else has.
Thu 06:23:36 PST [5684] web5.bugs.scl3.mozilla.com:httpd max clients is CRITICAL: (Service Check Timed Out) (http://m.mozilla.org/httpd+max+clients) I restarted Apache.
That looks like too many processes. My math says each webhead should have about 25, but that's clearly almost 50. Why?
Reverting back to standard zeus config so that we're not pestering the MOC all (long) weekend; sadly, two nodes still isn't quite enough.
See Also: → 1359570
Type: defect → task
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.