Closed Bug 1199721 Opened 9 years ago Closed 9 years ago

occasional 503 errors from aus4

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: nmaul)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1655] )

QE has been seeing occasional 503 errors during some of their update tests on releases. Eg: 2015-08-28 16:38:33.513380 UTC - -1373430976[b7274900]: http response [ 2015-08-28 16:38:33.513419 UTC - -1373430976[b7274900]: HTTP/1.1 503 Service Temporarily Unavailable 2015-08-28 16:38:33.513426 UTC - -1373430976[b7274900]: Server: Apache 2015-08-28 16:38:33.513432 UTC - -1373430976[b7274900]: X-Backend-Server: aus1.webapp.phx1.mozilla.com 2015-08-28 16:38:33.513437 UTC - -1373430976[b7274900]: Cache-Control: max-age=60 2015-08-28 16:38:33.513443 UTC - -1373430976[b7274900]: Content-Type: text/html; charset=iso-8859-1 2015-08-28 16:38:33.513448 UTC - -1373430976[b7274900]: Date: Fri, 28 Aug 2015 16:38:33 GMT 2015-08-28 16:38:33.513452 UTC - -1373430976[b7274900]: Connection: close 2015-08-28 16:38:33.513468 UTC - -1373430976[b7274900]: X-Cache-Info: not cacheable; response code not cacheable 2015-08-28 16:38:33.513473 UTC - -1373430976[b7274900]: Content-Length: 323 2015-08-28 16:38:33.513478 UTC - -1373430976[b7274900]: ] It doesn't look like it's widespread, but I'm a bit surprised to see any. New Relic tells me that the error rate over the course of the past week is quite low percentage-wise (0.0082%), but that still ends up being 1-5 per minute. Some of them are due to known issues causing 500s (eg, not handling certain input data correctly), so it's not clear to me how many are 503s. Is there anything we can do to investigate the 503s further?
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/1655]
Summary: occaisonal 503 errors from aus4 → occasional 503 errors from aus4
Assignee: server-ops-webops → nmaul
Looks like we see this intermittently with our ondemand update tests run with Mozmill. If you need some more networking info please let us know. Or if we can assist with something else.
The ZLB 503 errors won't show up in New Relic... only errors thrown by the web servers have a chance of landing there. I'll check ZLB logs to see if it is 503'ing because backend nodes are detected as down or some other reason. We ran into a similar situation recently, but with 500's instead of 503's, which turned out to be related to some ZLB timeout/retry settings. I'll double-check those also.
Are these checks happening over HTTP or HTTPS?
(In reply to Jake Maul [:jakem] from comment #3) > Are these checks happening over HTTP or HTTPS? HTTPS
Note that on average this is happening once per test run (every test run is 150-200 tests that do two updates in different ways via the UI). The checks are surely HTTPS as this is performing product updates from Firefox UI.
Bug 1126825 is instructive for a history lesson on this service. The TL;DR is, we left things off as follows: Apache using the "worker" MPM (instead of default "prefork") - this helped a lot mod_wsgi configured as: WSGIDaemonProcess aus4 processes=1 threads=8 maximum-requests=80000 display-name=aus4 I've looked over New Relic, and I'm able to correlate several things: 1) mod_wsgi restarts its process about every 20 minutes (varies over the day, obviously). 2) When that happens, "request queuing" response time spikes a little. Makes sense. 3) Also when that happens, server load average spikes a bit, as does CPU usage and memory. They return to normal a minute or two later. Network bandwidth spikes as well, just a little bit after load/CPU/mem... makes sense. 4) And finally, when all that happens, I *also* get an "INFO" message in the ZLB logs indicating that the fallback configuration was "modified". I think "modified" is a red herring, and really means "used". This implies that a node failed to respond in time and Zeus sent the fallback response body out, with a 503 error. This theory nicely fits the output in comment 0. My attempted fix will be to fiddle with the mod_wsgi config some more. We haven't done this since switching Apache to the worker MPM. In particular, I'm switching from 1 proc / 8 threads to 2 procs / 4 threads. That way when maximum-requests is reached on one of the processes, there will still be another one ready to go. I'm also raising maximum-requests from 80k to an even 100k, which at the very least should reduce the frequency of potential problems.
(In reply to Jake Maul [:jakem] from comment #6) > 3) Also when that happens, server load average spikes a bit, as does CPU > usage and memory. They return to normal a minute or two later. Network > bandwidth spikes as well, just a little bit after load/CPU/mem... makes > sense. This is probably due to the caches getting cold after the restart, no surprise there! > In particular, I'm switching from 1 proc / 8 threads to 2 procs / 4 threads. > That way when maximum-requests is reached on one of the processes, there > will still be another one ready to go. I'm also raising maximum-requests > from 80k to an even 100k, which at the very least should reduce the > frequency of potential problems. Thanks for having a go at this...here's hoping it helps! Would setting maximum-requests to different values on each node help at all? That way they would (hopefully) not all restart right around the same time?
I'm seeing multiple of those errors in a test run right now. I remember I have seen those more often when running tests in my afternoon (Central Europe), could this be load-dependent?
Could definitely be load-dependent. We made significant changes to this (and many other) clusters during the week of October 26. Namely, we moved them to a different datacenter. AUS4 in particular got some WSGI tweaks during that migration to help with database issues during that work. Those are still in place today. - WSGIDaemonProcess aus4 processes=2 threads=4 maximum-requests=100000 display-name=aus4 + WSGIDaemonProcess aus4 processes=2 threads=32 maximum-requests=100000 display-name=aus4 Based on New Relic graphing, things have looked good ever since. I'm going to close this out.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.