800897 - bouncer returning ISE 500 errors after bug 800042

Reporter

Description

•

12 years ago

We have some tests in our release automation that make sure all bouncer links are working before we ship. When these ran for the ESR releases that happened overnight, we got a bunch of ISE 500 from bouncer itself, which is something we've never seen before:
Testing http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 13:00:08 GMT
Server: Apache
X-Backend-Server: bouncer6.webapp.phx1.mozilla.com
Set-Cookie: dmo=10.8.81.217.1350046808979574; path=/; expires=Sat, 12-Oct-13 13:00:08 GMT
Connection: close
Content-Type: text/html; charset=UTF-8

I can intermittently reproduce by hand.

I don't think this blocks releasing anything, but it's a pretty serious problem if we're serving ISE 500 to real requests.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

12 years ago

I should be clear: this isn't 100% of the time. It appears to be happening 5-15% of the time.

Adrian J Fernandez [:Aj]

Updated

•

12 years ago

Assignee: server-ops-webops → afernandez

Jake Maul [:jakem]

Comment 2

•

12 years ago

Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server perhaps...

I'm in an airport gate about to get on a plane, can't troubleshoot myself.

Stephen Donner [:stephend] Not actively reading bugmail

Comment 3

•

12 years ago

(In reply to Jake Maul [:jakem] from comment #2)
> Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server
> perhaps...
> 
> I'm in an airport gate about to get on a plane, can't troubleshoot myself.

Happened on bouncer10, too, just now:

host-7-77:~ sdonner$ curl -ILk 'http://download.mozilla.org?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1'
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 19:16:02 GMT
Server: Apache
X-Backend-Server: bouncer10.webapp.phx1.mozilla.com

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

12 years ago

I'm raising this to a blocker because it's happening on http://download.mozilla.org/?product=firefox-beta-stub&os=win&lang=en-US, which we need 100% working to do our stub installer test this weekend. Sorry.

Severity: critical → blocker

Chris Turra [:cturra]

Comment 5

•

12 years ago

the only thing i can see in the http error logs that is out of the ordinary are entries like:

  PHP Notice:  Undefined variable: client_region in /data/www/download.mozilla.org/tuxedo/bouncer/php/index.php on line 283


282: // If we're here we've fallen back to global
283: $fallback_global = getGlobalFallbackProhibited($client_region);

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 6

•

12 years ago

I think Brandon wrote that code, it might be able to weigh in on if it's related?

Corey Shields [:cshields]

Comment 7

•

12 years ago

(In reply to Stephen Donner [:stephend] from comment #3)
> (In reply to Jake Maul [:jakem] from comment #2)
> > Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server
> > perhaps...
> > 
> > I'm in an airport gate about to get on a plane, can't troubleshoot myself.
> 
> Happened on bouncer10, too, just now:
> 
> host-7-77:~ sdonner$ curl -ILk
> 'http://download.mozilla.org?product=thunderbird-10.0.9esr-
> complete&os=win&lang=it&force=1'
> HTTP/1.0 500 Internal Server Error
> Date: Fri, 12 Oct 2012 19:16:02 GMT
> Server: Apache
> X-Backend-Server: bouncer10.webapp.phx1.mozilla.com

is this because you are missing a "/" before ?product=...

Stephen Donner [:stephend] Not actively reading bugmail

Comment 8

•

12 years ago

(In reply to Corey Shields [:cshields] from comment #7)

<snip>

> > host-7-77:~ sdonner$ curl -ILk
> > 'http://download.mozilla.org?product=thunderbird-10.0.9esr-
> > complete&os=win&lang=it&force=1'
> > HTTP/1.0 500 Internal Server Error
> > Date: Fri, 12 Oct 2012 19:16:02 GMT
> > Server: Apache
> > X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
> 
> is this because you are missing a "/" before ?product=...

host-7-77:~ sdonner$ curl -ILk 'https://download.mozilla.org/?product=thunderbird=10.0.9esr-complete&os=win&lang=en-US&force=1'
HTTP/1.1 404 Not Found

Yes, that was it; thanks! Trust that the above is expected...

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 9

•

12 years ago

(In reply to Stephen Donner [:stephend] from comment #8)
> (In reply to Corey Shields [:cshields] from comment #7)
> 
> <snip>
> 
> > > host-7-77:~ sdonner$ curl -ILk
> > > 'http://download.mozilla.org?product=thunderbird-10.0.9esr-
> > > complete&os=win&lang=it&force=1'
> > > HTTP/1.0 500 Internal Server Error
> > > Date: Fri, 12 Oct 2012 19:16:02 GMT
> > > Server: Apache
> > > X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
> > 
> > is this because you are missing a "/" before ?product=...
> 
> host-7-77:~ sdonner$ curl -ILk
> 'https://download.mozilla.org/?product=thunderbird=10.0.9esr-
> complete&os=win&lang=en-US&force=1'
> HTTP/1.1 404 Not Found
> 
> Yes, that was it; thanks! Trust that the above is expected...

Well, my original test was with http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1, but that's not 500'ing anymore either. But I can still reproduce with other URLs so that may just be a fluke:
curl -IL "http://download.mozilla.org/?product=firefox-16.0.1&os=win&lang=it&force=1"
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 19:57:07 GMT
Server: Apache
X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
Set-Cookie: dmo=10.8.81.217.1350071827602061; path=/; expires=Sat, 12-Oct-13 19:57:07 GMT
Connection: close
Content-Type: text/html; charset=UTF-8

Laura Thomson :laura

Comment 10

•

12 years ago

Looks like a problem with fallback so I would guess it's happening when we see a cdn blip. give Brandon a call to take a look.

Corey Shields [:cshields]

Comment 11

•

12 years ago

Can we determine that this is in no way related to the brief bouncer we just had?

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

12 years ago

Whiteboard: [stub+]

Laura Thomson :laura

Comment 12

•

12 years ago

cturra: how many of those php errors are you seeing? when did they start?

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 14

•

12 years ago

FWIW, I can only repro this on bouncer10 now:
(buildbot)➜  l10n  curl -IL "http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1"
HTTP/1.0 500 Internal Server Error
Date: Fri, 12 Oct 2012 20:45:03 GMT
Server: Apache
X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
Set-Cookie: dmo=10.8.81.216.1350074703392938; path=/; expires=Sat, 12-Oct-13 20:45:03 GMT
Connection: close
Content-Type: text/html; charset=UTF-8


I did about 20 requests, 2 of which were on bouncer10 and returned ISE 500.

Laura Thomson :laura

Comment 15

•

12 years ago

can we try taking 10 out of the pool?

Brandon Savage [:brandon]

Comment 16

•

12 years ago

I'm taking a look at the code in that area to see what I can find. Knowing when the first occurrence of that error happened would be good, as would knowing if it's happening across all machines or just bouncer10.

Corey Shields [:cshields]

Comment 17

•

12 years ago

hold on looking into this until we get through the network issues we are in right now (in case they are related)

thanks

Brandon Burton [:solarce]

Assignee

Updated

•

12 years ago

Assignee: afernandez → bburton

Whiteboard: [stub+] → [stub+][service-interrupt]

Brandon Savage [:brandon]

Comment 18

•

12 years ago

I believe I've confirmed the notice is not the root cause of the ISE. I will hold off as I suspect network issues may play a role here.

Matt Pressman [:mpressman]

Comment 19

•

12 years ago

update mirror_mirrors set active = 0 where id=259;
update mirror_mirrors set active = 0 where id=493;
update mirror_mirrors set active = 1 where id=486;

changed from active=1 to active =0 
259 | Mozilla - Santa Clara 3 | http://ftp-zlb.vips.scl3.mozilla.com/pub/mozilla.org/
493 | FTP Cluster - SSL       | https://ftp-zlb.vips.scl3.mozilla.com/pub/mozilla.org

changed from active=0 to active=1
486 | Mozilla CDN's           | http://download.cdn.mozilla.net/pub/mozilla.org

Brandon Burton [:solarce]

Assignee

Comment 20

•

12 years ago

Documenting how I disabled sentry, disabled all sentry and bouncer related crons

[root@sentry1 cron.d]# cat *
#0 17 10 * * root source /data/virtualenvs/tuxedo/bin/activate ; /var/www/django/tuxedo/manage.py updategeoip >/dev/null
# update bouncer v1
#17 * * * * root /bin/bash /usr/local/bin/sentry.sh > /dev/null
#*/5 * * * * root /bin/bash /usr/local/bin/sentry-multi.sh > /dev/null
#42 14 * * * root /root/gm/autocommit.sh >/dev/null 2>/dev/null
MAILTO="infra-notices@mozilla.com"
PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin"
*/30 * * * * root /usr/local/sbin/puppetctl run >/dev/null 2>&1
MAILTO=root
# run system activity accounting tool every 10 minutes
#*/10 * * * * root /usr/lib64/sa/sa1 1 1
# generate a daily summary of process accounting at 23:53
#53 23 * * * root /usr/lib64/sa/sa2 -A

#00 */4 * * * root cd /var/www/django/tuxedo;  /data/virtualenvs/tuxedo/bin/python26 manage.py update_product_details 

All crons are re-enabled at this point

Corey Shields [:cshields]

Comment 21

•

12 years ago

Can anybody reproduce this post-outage?  I am unable to.

Brandon Savage [:brandon]

Comment 22

•

12 years ago

I have not been able to reproduce the issue. I recommend that we file a follow up bug to improve the logging of the database layer, as I suspect that was the root cause here. Since the errors are suppressed, we have a condition where database connections can fail and nobody knows why. Now that we have a solid test plan in place, we should be able to move forward with releases more easily.

Corey Shields [:cshields]

Comment 23

•

12 years ago

Please reopen if this becomes an issue again, but there were 2 things at play here:  a network issue in SCL3 and passive monitoring enabled in the new bouncer prod pool which causes zeus to drain the pool unnecessarily without adding nodes back in, thus choking and DoS'ing the service.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Rail Aliiev [:rail]

Comment 24

•

12 years ago

I've just rerun releng automation -- worked fine. Thanks a lot.

Status: RESOLVED → VERIFIED

Peter Radcliffe [:pir]

Comment 25

•

12 years ago

The bouncer machines still have 17k-18k sockets in CLOSE_WAIT to the db machines.

The workaround I put in, in bug 800894 is a) a workaround, not a fix b) not in puppet so if any new hosts get turned up/replace/etc they will not have the workaround and will throw 500s when we get high traffic.

My opinion is that this isn't fixed, just the symptoms are gone. There absolutely needs to be logging of failures at the DB level but it also needs to not open and close thousands upon thousands of connections to the DB in the first place, that doesn't scale.

Laura Thomson :laura

Comment 26

•

12 years ago

Here's what I'm going to do, after discussions with Sheeri:
- Update the Bouncer code to use mysql_pconnect for persistent connections instead of mysql_connect (today)
- Update the Bouncer code to use a more modern mysql library (current one is deprecated and a new one will get us better perf)
- Sheeri will upgrade MySQL from 5.0 to 5.1

Sheeri Cabral [:sheeri]

Comment 27

•

12 years ago

laura pulled me in on this. We can increase the thread caching on the slaves (this is a representative sample, over 100k threads created and the thread cache is only 500):

mysql> show variables like 'thread_cache_size';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| thread_cache_size | 500   | 
+-------------------+-------+
1 row in set (0.00 sec)

mysql> show status like 'Threads_c%';
+-------------------+--------+
| Variable_name     | Value  |
+-------------------+--------+
| Threads_cached    | 492    | 
| Threads_connected | 8      | 
| Threads_created   | 111049 | 
+-------------------+--------+
3 rows in set (0.00 sec)

The master is fine, thread_cache_size is 500 but just over 200 threads have been created total, and all but 5 are cached. I'll update the thread_cache_size on the slaves.

Also, bouncer prod is on MySQL 5.0, so I've linked bug 790391, the upgrade bug.

Sheeri Cabral [:sheeri]

Comment 28

•

12 years ago

Set the thread_cache_size to 10,000 on all the slaves.

Laura Thomson :laura

Comment 29

•

12 years ago

1. Use pconnect: https://bugzilla.mozilla.org/show_bug.cgi?id=801685
2. Use a better mysql lib: https://bugzilla.mozilla.org/show_bug.cgi?id=801687

Jake Maul [:jakem]

Comment 30

•

12 years ago

(In reply to Peter Radcliffe [:pir] from comment #25)
> The bouncer machines still have 17k-18k sockets in CLOSE_WAIT to the db
> machines.

I believe this is the "normal" level of connections for this app. It gets hit a lot, but 99% of the responses are a very simple 302 redirect so the bandwidth is very low.

GeoIP has in the past been the culprit for causing DB connection problems, especially during a release day/week. I turned it off on October 11 on the admin node, and I *think* I deployed it immediately thereafter... but I can't be sure. It's remotely possible that this was on and contributing to the problems. It's also possible that we ran into trouble in spite of GeoIP being disabled... it was a release week, and it could be that even with GeoIP off our sheer connection rate is still too high now. It could also be that something about the new cluster is triggering problems where there weren't any before.


> The workaround I put in, in bug 800894 is a) a workaround, not a fix b) not
> in puppet so if any new hosts get turned up/replace/etc they will not have
> the workaround and will throw 500s when we get high traffic.

Agreed, that isn't a real solution. However, it's a good start, and I think we should strongly consider putting 'tcp_tw_reuse = 1' in our base puppet modules... we've used it before, and it hasn't caused us any trouble that I'm aware of. Possibly also something like this:

net.ipv4.ip_local_port_range="10240 64000"

This widens the outgoing port range from 32768-61000 (28k up to 54k... without dropping so low that we cause problems with things on port 9090 or 9000 (which we have some of).

Those two tweaks should gain us a considerable amount of additional outgoing connection capacity, across our whole infrastructure.

Sheeri Cabral [:sheeri]

Comment 31

•

12 years ago

Made related bug https://bugzilla.mozilla.org/show_bug.cgi?id=801777 for tracking ganglia graphing for the bouncer db cluster.

Sheeri Cabral [:sheeri]

Comment 32

•

12 years ago

Bouncer graphs are at https://ganglia-phx1.mozilla.org/ganglia/?r=day&cs=&ce=&m=load_one&s=by+name&c=TP-Bouncer+DB&h=tp-bouncer01-slave01.phx.mozilla.com&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4

Peter Radcliffe [:pir]

Comment 33

•

12 years ago

After some bouncer fixes have been made the machines look considerably healthier:

[root@bouncer6.webapp.phx1 pradcliffe]# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c
    123 ESTABLISHED
     20 FIN_WAIT2
      8 LISTEN
    307 TIME_WAIT

I think we should monitor TIME_WAIT levels on webheads and maybe other places too. Opinions?

Jake Maul [:jakem]

Comment 34

•

12 years ago

I'm fine with that if we can set reasonable expectations on what they should be and how people should respond. An alert that could conceivably page someone should be actionable, and I'm not sure this really would be. In the case of a server with high TIME_WAITs, what functionality is affected, and what action should be taken?

I guess my underlying concern is that a high level of TIME_WAITs, like a high load average, is not necessarily a problem in and of itself. The real problem I think we ought to care about is "I can't open a new socket" (or, "I'm close to not being able to open a new socket"). If we turn on "tcp_tw_reuse" everywhere, it should be rather difficult for us to get into this situation due to TIME_WAITs alone.

Peter Radcliffe [:pir]

Comment 35

•

12 years ago

If not alert on it (and I do agree with you about actionable alerts, I don't like the load pages for similar reasons) can we maybe graph them in ganglia so we at least have a reference for when a problem starts if we see issues?

(I'd also like to see queries per second on webservers, db servers, etc, graphed since that makes it far easier to tell if a load issue is from higher input or changed behaviour, etc).

Peter Radcliffe [:pir]

Comment 36

•

12 years ago

We already do have TIME_WAIT graphed in ganglia (I was looking at scl3 rather than phx1 ganglia).

With that, an alert in irc set to a high number, with a low alert repeat rate and decent documentation (alert docs to go look at ganglia, see if the change can be correlated with anything and to talk to devs about it?) could be useful, I think.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard