Closed Bug 800897 Opened 12 years ago Closed 12 years ago

bouncer returning ISE 500 errors after bug 800042

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

task
Not set
blocker

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: bhearsum, Assigned: bburton)

References

Details

(Whiteboard: [stub+][service-interrupt])

We have some tests in our release automation that make sure all bouncer links are working before we ship. When these ran for the ESR releases that happened overnight, we got a bunch of ISE 500 from bouncer itself, which is something we've never seen before:
Testing http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 13:00:08 GMT
Server: Apache
X-Backend-Server: bouncer6.webapp.phx1.mozilla.com
Set-Cookie: dmo=10.8.81.217.1350046808979574; path=/; expires=Sat, 12-Oct-13 13:00:08 GMT
Connection: close
Content-Type: text/html; charset=UTF-8

I can intermittently reproduce by hand.

I don't think this blocks releasing anything, but it's a pretty serious problem if we're serving ISE 500 to real requests.
I should be clear: this isn't 100% of the time. It appears to be happening 5-15% of the time.
Assignee: server-ops-webops → afernandez
Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server perhaps...

I'm in an airport gate about to get on a plane, can't troubleshoot myself.
(In reply to Jake Maul [:jakem] from comment #2)
> Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server
> perhaps...
> 
> I'm in an airport gate about to get on a plane, can't troubleshoot myself.

Happened on bouncer10, too, just now:

host-7-77:~ sdonner$ curl -ILk 'http://download.mozilla.org?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1'
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 19:16:02 GMT
Server: Apache
X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
I'm raising this to a blocker because it's happening on http://download.mozilla.org/?product=firefox-beta-stub&os=win&lang=en-US, which we need 100% working to do our stub installer test this weekend. Sorry.
Severity: critical → blocker
the only thing i can see in the http error logs that is out of the ordinary are entries like:

  PHP Notice:  Undefined variable: client_region in /data/www/download.mozilla.org/tuxedo/bouncer/php/index.php on line 283


282: // If we're here we've fallen back to global
283: $fallback_global = getGlobalFallbackProhibited($client_region);
I think Brandon wrote that code, it might be able to weigh in on if it's related?
(In reply to Stephen Donner [:stephend] from comment #3)
> (In reply to Jake Maul [:jakem] from comment #2)
> > Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server
> > perhaps...
> > 
> > I'm in an airport gate about to get on a plane, can't troubleshoot myself.
> 
> Happened on bouncer10, too, just now:
> 
> host-7-77:~ sdonner$ curl -ILk
> 'http://download.mozilla.org?product=thunderbird-10.0.9esr-
> complete&os=win&lang=it&force=1'
> HTTP/1.0 500 Internal Server Error
> Date: Fri, 12 Oct 2012 19:16:02 GMT
> Server: Apache
> X-Backend-Server: bouncer10.webapp.phx1.mozilla.com

is this because you are missing a "/" before ?product=...
(In reply to Corey Shields [:cshields] from comment #7)

<snip>

> > host-7-77:~ sdonner$ curl -ILk
> > 'http://download.mozilla.org?product=thunderbird-10.0.9esr-
> > complete&os=win&lang=it&force=1'
> > HTTP/1.0 500 Internal Server Error
> > Date: Fri, 12 Oct 2012 19:16:02 GMT
> > Server: Apache
> > X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
> 
> is this because you are missing a "/" before ?product=...

host-7-77:~ sdonner$ curl -ILk 'https://download.mozilla.org/?product=thunderbird=10.0.9esr-complete&os=win&lang=en-US&force=1'
HTTP/1.1 404 Not Found

Yes, that was it; thanks! Trust that the above is expected...
(In reply to Stephen Donner [:stephend] from comment #8)
> (In reply to Corey Shields [:cshields] from comment #7)
> 
> <snip>
> 
> > > host-7-77:~ sdonner$ curl -ILk
> > > 'http://download.mozilla.org?product=thunderbird-10.0.9esr-
> > > complete&os=win&lang=it&force=1'
> > > HTTP/1.0 500 Internal Server Error
> > > Date: Fri, 12 Oct 2012 19:16:02 GMT
> > > Server: Apache
> > > X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
> > 
> > is this because you are missing a "/" before ?product=...
> 
> host-7-77:~ sdonner$ curl -ILk
> 'https://download.mozilla.org/?product=thunderbird=10.0.9esr-
> complete&os=win&lang=en-US&force=1'
> HTTP/1.1 404 Not Found
> 
> Yes, that was it; thanks! Trust that the above is expected...

Well, my original test was with http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1, but that's not 500'ing anymore either. But I can still reproduce with other URLs so that may just be a fluke:
curl -IL "http://download.mozilla.org/?product=firefox-16.0.1&os=win&lang=it&force=1"
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 19:57:07 GMT
Server: Apache
X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
Set-Cookie: dmo=10.8.81.217.1350071827602061; path=/; expires=Sat, 12-Oct-13 19:57:07 GMT
Connection: close
Content-Type: text/html; charset=UTF-8
Looks like a problem with fallback so I would guess it's happening when we see a cdn blip. give Brandon a call to take a look.
Can we determine that this is in no way related to the brief bouncer we just had?
Whiteboard: [stub+]
cturra: how many of those php errors are you seeing? when did they start?
FWIW, I can only repro this on bouncer10 now:
(buildbot)➜  l10n  curl -IL "http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1"
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 20:45:03 GMT
Server: Apache
X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
Set-Cookie: dmo=10.8.81.216.1350074703392938; path=/; expires=Sat, 12-Oct-13 20:45:03 GMT
Connection: close
Content-Type: text/html; charset=UTF-8


I did about 20 requests, 2 of which were on bouncer10 and returned ISE 500.
can we try taking 10 out of the pool?
I'm taking a look at the code in that area to see what I can find. Knowing when the first occurrence of that error happened would be good, as would knowing if it's happening across all machines or just bouncer10.
hold on looking into this until we get through the network issues we are in right now (in case they are related)

thanks
Assignee: afernandez → bburton
Whiteboard: [stub+] → [stub+][service-interrupt]
I believe I've confirmed the notice is not the root cause of the ISE. I will hold off as I suspect network issues may play a role here.
update mirror_mirrors set active = 0 where id=259;
update mirror_mirrors set active = 0 where id=493;
update mirror_mirrors set active = 1 where id=486;

changed from active=1 to active =0 
259 | Mozilla - Santa Clara 3 | http://ftp-zlb.vips.scl3.mozilla.com/pub/mozilla.org/
493 | FTP Cluster - SSL       | https://ftp-zlb.vips.scl3.mozilla.com/pub/mozilla.org

changed from active=0 to active=1
486 | Mozilla CDN's           | http://download.cdn.mozilla.net/pub/mozilla.org
Documenting how I disabled sentry, disabled all sentry and bouncer related crons

[root@sentry1 cron.d]# cat *
#0 17 10 * * root source /data/virtualenvs/tuxedo/bin/activate ; /var/www/django/tuxedo/manage.py updategeoip >/dev/null
# update bouncer v1
#17 * * * * root /bin/bash /usr/local/bin/sentry.sh > /dev/null
#*/5 * * * * root /bin/bash /usr/local/bin/sentry-multi.sh > /dev/null
#42 14 * * * root /root/gm/autocommit.sh >/dev/null 2>/dev/null
MAILTO="infra-notices@mozilla.com"
PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin"
*/30 * * * * root /usr/local/sbin/puppetctl run >/dev/null 2>&1
MAILTO=root
# run system activity accounting tool every 10 minutes
#*/10 * * * * root /usr/lib64/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
#53 23 * * * root /usr/lib64/sa/sa2 -A

#00 */4 * * * root cd /var/www/django/tuxedo;  /data/virtualenvs/tuxedo/bin/python26 manage.py update_product_details 

All crons are re-enabled at this point
Can anybody reproduce this post-outage?  I am unable to.
I have not been able to reproduce the issue. I recommend that we file a follow up bug to improve the logging of the database layer, as I suspect that was the root cause here. Since the errors are suppressed, we have a condition where database connections can fail and nobody knows why. Now that we have a solid test plan in place, we should be able to move forward with releases more easily.
Please reopen if this becomes an issue again, but there were 2 things at play here:  a network issue in SCL3 and passive monitoring enabled in the new bouncer prod pool which causes zeus to drain the pool unnecessarily without adding nodes back in, thus choking and DoS'ing the service.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I've just rerun releng automation -- worked fine. Thanks a lot.
Status: RESOLVED → VERIFIED
The bouncer machines still have 17k-18k sockets in CLOSE_WAIT to the db machines.

The workaround I put in, in bug 800894 is a) a workaround, not a fix b) not in puppet so if any new hosts get turned up/replace/etc they will not have the workaround and will throw 500s when we get high traffic.

My opinion is that this isn't fixed, just the symptoms are gone. There absolutely needs to be logging of failures at the DB level but it also needs to not open and close thousands upon thousands of connections to the DB in the first place, that doesn't scale.
Here's what I'm going to do, after discussions with Sheeri:
- Update the Bouncer code to use mysql_pconnect for persistent connections instead of mysql_connect (today)
- Update the Bouncer code to use a more modern mysql library (current one is deprecated and a new one will get us better perf)
- Sheeri will upgrade MySQL from 5.0 to 5.1
laura pulled me in on this. We can increase the thread caching on the slaves (this is a representative sample, over 100k threads created and the thread cache is only 500):

mysql> show variables like 'thread_cache_size';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| thread_cache_size | 500   | 
+-------------------+-------+
1 row in set (0.00 sec)

mysql> show status like 'Threads_c%';
+-------------------+--------+
| Variable_name     | Value  |
+-------------------+--------+
| Threads_cached    | 492    | 
| Threads_connected | 8      | 
| Threads_created   | 111049 | 
+-------------------+--------+
3 rows in set (0.00 sec)

The master is fine, thread_cache_size is 500 but just over 200 threads have been created total, and all but 5 are cached. I'll update the thread_cache_size on the slaves.

Also, bouncer prod is on MySQL 5.0, so I've linked bug 790391, the upgrade bug.
Set the thread_cache_size to 10,000 on all the slaves.
(In reply to Peter Radcliffe [:pir] from comment #25)
> The bouncer machines still have 17k-18k sockets in CLOSE_WAIT to the db
> machines.

I believe this is the "normal" level of connections for this app. It gets hit a lot, but 99% of the responses are a very simple 302 redirect so the bandwidth is very low.

GeoIP has in the past been the culprit for causing DB connection problems, especially during a release day/week. I turned it off on October 11 on the admin node, and I *think* I deployed it immediately thereafter... but I can't be sure. It's remotely possible that this was on and contributing to the problems. It's also possible that we ran into trouble in spite of GeoIP being disabled... it was a release week, and it could be that even with GeoIP off our sheer connection rate is still too high now. It could also be that something about the new cluster is triggering problems where there weren't any before.


> The workaround I put in, in bug 800894 is a) a workaround, not a fix b) not
> in puppet so if any new hosts get turned up/replace/etc they will not have
> the workaround and will throw 500s when we get high traffic.

Agreed, that isn't a real solution. However, it's a good start, and I think we should strongly consider putting 'tcp_tw_reuse = 1' in our base puppet modules... we've used it before, and it hasn't caused us any trouble that I'm aware of. Possibly also something like this:

net.ipv4.ip_local_port_range="10240 64000"

This widens the outgoing port range from 32768-61000 (28k up to 54k... without dropping so low that we cause problems with things on port 9090 or 9000 (which we have some of).

Those two tweaks should gain us a considerable amount of additional outgoing connection capacity, across our whole infrastructure.
Made related bug https://bugzilla.mozilla.org/show_bug.cgi?id=801777 for tracking ganglia graphing for the bouncer db cluster.
After some bouncer fixes have been made the machines look considerably healthier:

[root@bouncer6.webapp.phx1 pradcliffe]# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
    123 ESTABLISHED
     20 FIN_WAIT2
      8 LISTEN
    307 TIME_WAIT

I think we should monitor TIME_WAIT levels on webheads and maybe other places too. Opinions?
I'm fine with that if we can set reasonable expectations on what they should be and how people should respond. An alert that could conceivably page someone should be actionable, and I'm not sure this really would be. In the case of a server with high TIME_WAITs, what functionality is affected, and what action should be taken?

I guess my underlying concern is that a high level of TIME_WAITs, like a high load average, is not necessarily a problem in and of itself. The real problem I think we ought to care about is "I can't open a new socket" (or, "I'm close to not being able to open a new socket"). If we turn on "tcp_tw_reuse" everywhere, it should be rather difficult for us to get into this situation due to TIME_WAITs alone.
If not alert on it (and I do agree with you about actionable alerts, I don't like the load pages for similar reasons) can we maybe graph them in ganglia so we at least have a reference for when a problem starts if we see issues?

(I'd also like to see queries per second on webservers, db servers, etc, graphed since that makes it far easier to tell if a load issue is from higher input or changed behaviour, etc).
We already do have TIME_WAIT graphed in ganglia (I was looking at scl3 rather than phx1 ganglia).

With that, an alert in irc set to a high number, with a low alert repeat rate and decent documentation (alert docs to go look at ganglia, see if the change can be correlated with anything and to talk to devs about it?) could be useful, I think.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.