Closed Bug 1304171 Opened 4 years ago Closed 2 years ago

Direct rest and bzapi requests to the new 6th webhead for a day or two and measure performance

Categories

(bugzilla.mozilla.org :: Infrastructure, task)

Production
task
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: dylan, Assigned: fubar)

References

Details

Per: https://wiki.mozilla.org/BMO/Meetings/2016-09-20

We'll try REST and BZAPI calls to one webhead for a few hours or a day, 
in order to verify that
    UI performance is improved, and
    API performance does not significantly degrade.
Created new zeus pool called 'bugzilla.mozilla.org-api' with just web6 in, and bugzilla.mozilla.org-https pool as failover. Created new rule 'bugzilla-api' that checks URL path for /bzapi/ or /rest/ and sends them to the new pool.

Currently enabled on prod, and traffic is flowing happily to web6.
Assignee: nobody → klibby
10:23:49 <@nagios-scl3> Thu 07:23:49 PDT [5215] web6.bugs.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 28% free (559 MB out of 2046 MB)
There weren't a huge number of clients connected (ie more idle workers than not), but lots of httpd processes swapped out. Load was fairly spiky, too.

If this is going to be a longer term model, we might want to tune things better for handling API calls, or add a second node.
cool, I'm going to take a look at apache size limit stuff.
graphite shows a marked difference in memory usage between the new node and the others, suggesting a memory leak in bzapi/rest requests. dylan is investigating, but there will likely be ongoing swap alerts to the MOC.
I've returned API traffic back to the entire pool; it's the end of the day and I don't want the MOC to get paged over and over again as swap fills up (and it just ate everything on web6!). 

I think there's low enough "user" traffic that we could split the cluster 4/2 and be pretty happy. More so once dylan tracks down why SizeLimit isn't firing on API traffic (kill_pigs is culling them instead).
:dylan: did you get enough info when we ran the test, or do we need another couple of days of redirecting to get more info?
Flags: needinfo?(dylan)
Another 24-hr period would be useful. Especially if I can add more instrumentation to try to track down the remaining memory leaks.
Flags: needinfo?(dylan)
I've re-enabled this with a small change - web5 and web6 are now serving API requests while web1-4 are handling all other traffic. Hopefully having two nodes handling API traffic will prevent nagios from throwing swap alerts while the test is running.
Apache on web5 stopped responding entirely and was graceful'd.
Had web5 page for swap today, and restarted Apache.  My apologies if I had interrupted any testing :(

19:22:45 <@nagios-scl3> Mon 19:22:45 PST [5605] web5.bugs.scl3.mozilla.com:Swap is CRITICAL: SWAP CRITICAL - 22% free (450 MB out of 2047 MB) (http://m.mozilla.org/Swap)
<@nagios-scl3:#sysadmins> (IRC) Tue 03:43:26 PST [5340] 
  web6.bugs.scl3.mozilla.com:Out of memory - killed process is WARNING: 
  WARNING: Log errors found: Jan 10 11:42:46 web6.bugs.scl3.mozilla.com 
  apache[8172]: Out of memory! 
  (http://m.mozilla.org/Out+of+memory+-+killed+process)

<@nagios-scl3:#sysadmins> Tue 03:49:56 PST [5346] 
  web6.bugs.scl3.mozilla.com:Swap is WARNING: SWAP WARNING - 30% free (597 MB 
  out of 2046 MB) (http://m.mozilla.org/Swap)

                       Apache Server Status for localhost

   Server Version: Apache/2.2.15 (Unix) mod_perl/2.0.4 Perl/v5.10.1

   Server Built: Nov 3 2016 10:35:25

   --------------------------------------------------------------------------

   Current Time: Tuesday, 10-Jan-2017 11:54:39 UTC

   Restart Time: Tuesday, 10-Jan-2017 03:12:19 UTC

   Parent Server Generation: 0

   Server uptime: 8 hours 42 minutes 20 seconds

   Total accesses: 70782 - Total Traffic: 7.7 MB

   CPU Usage: u11562.5 s318.54 cu0 cs0 - 37.9% CPU load

   2.26 requests/sec - 258 B/second - 114 B/request

   4 requests currently being processed, 44 idle workers

 ______W_____._____.............._......................__...._._
 _.........K........__......_............._........._._........._
 .K...K..._._...__._...._................_...._.__..._....__.....
 ....__..........................................................
 ....

   Scoreboard Key:
   "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
   "W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
   "C" Closing connection, "L" Logging, "G" Gracefully finishing,
   "I" Idle cleanup of worker, "." Open slot with no current process


So appears that one process, the killed one, was taking the remainder of virtual memory on the host. The large number of processes are taking up the rest, no other individual memory hogs.


Mem:  16334188k total, 15002416k used,  1331772k free,    19900k buffers
Swap:  2096124k total,  1318792k used,   777332k free,   295232k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                     
 5606 apache    30  10  801m 481m 3152 S  0.0  3.0   2:03.25 httpd                                                                                        
 8348 apache    30  10  777m 475m 3276 S  0.0  3.0   3:51.39 httpd                                                                                        
 8406 apache    30  10  776m 424m 3144 S  0.0  2.7   3:32.19 httpd                                                                                        
 8344 apache    30  10  714m 415m 3220 S  0.0  2.6   3:49.42 httpd                                                                                        
 5633 apache    30  10  707m 407m 3172 S  0.0  2.6   2:00.41 httpd                                                                                        
 8260 apache    30  10  688m 382m 3248 S  0.0  2.4   3:59.79 httpd                                                                                        
 5604 apache    30  10  696m 366m 3680 S  0.0  2.3   2:02.61 httpd                                                                                        
 8342 apache    30  10  682m 362m 3200 S  0.0  2.3   3:52.23 httpd                                                                                        
 8315 apache    30  10  641m 339m 3240 S  0.0  2.1   3:47.86 httpd                                                                                        
 8318 apache    30  10  635m 335m 3184 S  0.0  2.1   4:08.25 httpd                                                                                        
 8280 apache    30  10  631m 329m 3328 S  0.0  2.1   4:06.92 httpd                                                                                        
 8385 apache    30  10  646m 324m 3380 S  0.0  2.0   3:47.81 httpd                                                                                        
 8382 apache    30  10  622m 320m 3304 S  0.0  2.0   3:32.26 httpd                                                                                        
 5567 apache    30  10  616m 318m 3212 S  0.0  2.0   1:51.18 httpd                                                                                        
 8357 apache    30  10  629m 317m 3192 S  0.0  2.0   3:48.72 httpd                                                                                        
 8257 apache    30  10  620m 314m 3156 S  0.0  2.0   3:35.99 httpd                                                                                        
 8384 apache    30  10  619m 309m 3320 S  0.0  1.9   3:38.04 httpd                                                                                        
 8193 apache    30  10  611m 308m 3276 S  0.0  1.9   3:54.77 httpd                                                                                        
 8329 apache    30  10  640m 308m 3208 S  0.0  1.9   4:32.50 httpd                                                                                        
 8250 apache    30  10  632m 303m 3200 S  0.0  1.9   4:02.22 httpd                                                                                        
 8352 apache    30  10  604m 298m 3396 S  0.0  1.9   3:55.86 httpd                                                                                        
 5607 apache    30  10  611m 296m 3328 S  8.3  1.9   2:11.14 httpd                                                                                        
 5634 apache    30  10  605m 295m 3232 S  0.0  1.9   1:49.98 httpd                                                                                        
 8405 apache    30  10  635m 295m 3212 S  0.0  1.9   3:42.68 httpd                                                                                        
 8304 apache    30  10  618m 294m 3164 S  0.0  1.8   3:10.71 httpd                                                                                        
 8252 apache    30  10  623m 291m 3164 S  0.0  1.8   3:34.02 httpd                                                                                        
 8349 apache    30  10  645m 285m 3208 S  0.0  1.8   4:30.29 httpd                                                                                        
 5569 apache    30  10  609m 285m 3416 S  0.0  1.8   1:44.09 httpd                                                                                        
 5628 apache    30  10  585m 285m 3176 S  0.0  1.8   2:06.90 httpd                                                                                        
 8389 apache    30  10  603m 282m 3144 S  0.0  1.8   3:44.04 httpd                                                                                        
 8331 apache    30  10  597m 281m 3436 S  0.0  1.8   4:19.56 httpd                                                                                        
 8259 apache    30  10  635m 281m 3184 S  0.0  1.8   4:16.38 httpd                                                                                        
 8187 apache    30  10  602m 279m 3212 S  7.6  1.8   3:47.85 httpd                                                                                        
 8377 apache    30  10  575m 278m 3212 S  0.3  1.7   3:06.77 httpd                                                                                        
 8289 apache    30  10  606m 277m 3364 S  0.0  1.7   3:42.44 httpd                                                                                        
 8281 apache    30  10  577m 276m 3192 S  0.0  1.7   3:36.99 httpd                                                                                        
 5605 apache    30  10  600m 275m 3256 S  0.0  1.7   1:47.27 httpd                                                                                        
 5566 apache    30  10  591m 274m 3308 S  0.0  1.7   2:08.32 httpd                                                                                        
 8394 apache    30  10  597m 272m 3244 S  0.0  1.7   3:51.43 httpd                                                                                        
 8271 apache    30  10  616m 269m 3212 S  0.0  1.7   4:00.11 httpd                                                                                        
 8336 apache    30  10  593m 265m 3336 S  0.0  1.7   3:21.47 httpd                                                                                        
 5631 apache    30  10  583m 261m 3172 S  0.0  1.6   1:55.20 httpd                                                                                        
 8225 apache    30  10  590m 256m 3164 S  0.0  1.6   3:46.66 httpd                                                                                        
 5632 apache    30  10  581m 251m 3252 S  0.0  1.6   1:51.46 httpd                                                                                        
 8395 apache    30  10  604m 236m 3520 S  0.0  1.5   4:20.09 httpd                                                                                        
 8203 apache    30  10  568m 233m 3256 S  0.0  1.5   3:22.72 httpd                                                                                        
 5568 apache    30  10  538m 230m 3188 S  0.0  1.4   1:56.39 httpd                                                                                        
 5629 apache    30  10  524m 222m 3204 S  0.0  1.4   1:38.02 httpd                                                                                        
 7953 root      30  10  311m  17m 2588 S  0.0  0.1   0:02.99 httpd                                                                                        


Bounced httpd to recover memory/swap as everyone else has.
See Also: → 1326389
Thu 06:23:36 PST [5684] web5.bugs.scl3.mozilla.com:httpd max clients is CRITICAL: (Service Check Timed Out) (http://m.mozilla.org/httpd+max+clients)

I restarted Apache.
That looks like too many processes. My math says each webhead should have about 25, but that's clearly almost 50. Why?
Reverting back to standard zeus config so that we're not pestering the MOC all (long) weekend; sadly, two nodes still isn't quite enough.
See Also: → 1330645
Duplicate of this bug: 1330645
See Also: → 1359570
Type: defect → task
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.