Closed
Bug 800897
Opened 12 years ago
Closed 12 years ago
bouncer returning ISE 500 errors after bug 800042
Categories
(Infrastructure & Operations Graveyard :: WebOps: Other, task)
Infrastructure & Operations Graveyard
WebOps: Other
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: bhearsum, Assigned: bburton)
References
Details
(Whiteboard: [stub+][service-interrupt])
We have some tests in our release automation that make sure all bouncer links are working before we ship. When these ran for the ESR releases that happened overnight, we got a bunch of ISE 500 from bouncer itself, which is something we've never seen before: Testing http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1 HTTP/1.0 500 Internal Server Error Date: Fri, 12 Oct 2012 13:00:08 GMT Server: Apache X-Backend-Server: bouncer6.webapp.phx1.mozilla.com Set-Cookie: dmo=10.8.81.217.1350046808979574; path=/; expires=Sat, 12-Oct-13 13:00:08 GMT Connection: close Content-Type: text/html; charset=UTF-8 I can intermittently reproduce by hand. I don't think this blocks releasing anything, but it's a pretty serious problem if we're serving ISE 500 to real requests.
Reporter | ||
Comment 1•12 years ago
|
||
I should be clear: this isn't 100% of the time. It appears to be happening 5-15% of the time.
Updated•12 years ago
|
Assignee: server-ops-webops → afernandez
Comment 2•12 years ago
|
||
Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server perhaps... I'm in an airport gate about to get on a plane, can't troubleshoot myself.
Comment 3•12 years ago
|
||
(In reply to Jake Maul [:jakem] from comment #2) > Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server > perhaps... > > I'm in an airport gate about to get on a plane, can't troubleshoot myself. Happened on bouncer10, too, just now: host-7-77:~ sdonner$ curl -ILk 'http://download.mozilla.org?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1' HTTP/1.0 500 Internal Server Error Date: Fri, 12 Oct 2012 19:16:02 GMT Server: Apache X-Backend-Server: bouncer10.webapp.phx1.mozilla.com
Reporter | ||
Comment 4•12 years ago
|
||
I'm raising this to a blocker because it's happening on http://download.mozilla.org/?product=firefox-beta-stub&os=win&lang=en-US, which we need 100% working to do our stub installer test this weekend. Sorry.
Severity: critical → blocker
Comment 5•12 years ago
|
||
the only thing i can see in the http error logs that is out of the ordinary are entries like: PHP Notice: Undefined variable: client_region in /data/www/download.mozilla.org/tuxedo/bouncer/php/index.php on line 283 282: // If we're here we've fallen back to global 283: $fallback_global = getGlobalFallbackProhibited($client_region);
Reporter | ||
Comment 6•12 years ago
|
||
I think Brandon wrote that code, it might be able to weigh in on if it's related?
Comment 7•12 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #3) > (In reply to Jake Maul [:jakem] from comment #2) > > Is it always X-Backend-Server: bouncer6.webapp.phx1.mozilla.com ? Bad server > > perhaps... > > > > I'm in an airport gate about to get on a plane, can't troubleshoot myself. > > Happened on bouncer10, too, just now: > > host-7-77:~ sdonner$ curl -ILk > 'http://download.mozilla.org?product=thunderbird-10.0.9esr- > complete&os=win&lang=it&force=1' > HTTP/1.0 500 Internal Server Error > Date: Fri, 12 Oct 2012 19:16:02 GMT > Server: Apache > X-Backend-Server: bouncer10.webapp.phx1.mozilla.com is this because you are missing a "/" before ?product=...
Comment 8•12 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #7) <snip> > > host-7-77:~ sdonner$ curl -ILk > > 'http://download.mozilla.org?product=thunderbird-10.0.9esr- > > complete&os=win&lang=it&force=1' > > HTTP/1.0 500 Internal Server Error > > Date: Fri, 12 Oct 2012 19:16:02 GMT > > Server: Apache > > X-Backend-Server: bouncer10.webapp.phx1.mozilla.com > > is this because you are missing a "/" before ?product=... host-7-77:~ sdonner$ curl -ILk 'https://download.mozilla.org/?product=thunderbird=10.0.9esr-complete&os=win&lang=en-US&force=1' HTTP/1.1 404 Not Found Yes, that was it; thanks! Trust that the above is expected...
Reporter | ||
Comment 9•12 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #8) > (In reply to Corey Shields [:cshields] from comment #7) > > <snip> > > > > host-7-77:~ sdonner$ curl -ILk > > > 'http://download.mozilla.org?product=thunderbird-10.0.9esr- > > > complete&os=win&lang=it&force=1' > > > HTTP/1.0 500 Internal Server Error > > > Date: Fri, 12 Oct 2012 19:16:02 GMT > > > Server: Apache > > > X-Backend-Server: bouncer10.webapp.phx1.mozilla.com > > > > is this because you are missing a "/" before ?product=... > > host-7-77:~ sdonner$ curl -ILk > 'https://download.mozilla.org/?product=thunderbird=10.0.9esr- > complete&os=win&lang=en-US&force=1' > HTTP/1.1 404 Not Found > > Yes, that was it; thanks! Trust that the above is expected... Well, my original test was with http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1, but that's not 500'ing anymore either. But I can still reproduce with other URLs so that may just be a fluke: curl -IL "http://download.mozilla.org/?product=firefox-16.0.1&os=win&lang=it&force=1" HTTP/1.0 500 Internal Server Error Date: Fri, 12 Oct 2012 19:57:07 GMT Server: Apache X-Backend-Server: bouncer10.webapp.phx1.mozilla.com Set-Cookie: dmo=10.8.81.217.1350071827602061; path=/; expires=Sat, 12-Oct-13 19:57:07 GMT Connection: close Content-Type: text/html; charset=UTF-8
Comment 10•12 years ago
|
||
Looks like a problem with fallback so I would guess it's happening when we see a cdn blip. give Brandon a call to take a look.
Comment 11•12 years ago
|
||
Can we determine that this is in no way related to the brief bouncer we just had?
Reporter | ||
Updated•12 years ago
|
Whiteboard: [stub+]
Comment 12•12 years ago
|
||
cturra: how many of those php errors are you seeing? when did they start?
Reporter | ||
Comment 14•12 years ago
|
||
FWIW, I can only repro this on bouncer10 now: (buildbot)➜ l10n curl -IL "http://download.mozilla.org/?product=thunderbird-10.0.9esr-complete&os=win&lang=it&force=1" HTTP/1.0 500 Internal Server Error Date: Fri, 12 Oct 2012 20:45:03 GMT Server: Apache X-Backend-Server: bouncer10.webapp.phx1.mozilla.com Set-Cookie: dmo=10.8.81.216.1350074703392938; path=/; expires=Sat, 12-Oct-13 20:45:03 GMT Connection: close Content-Type: text/html; charset=UTF-8 I did about 20 requests, 2 of which were on bouncer10 and returned ISE 500.
Comment 15•12 years ago
|
||
can we try taking 10 out of the pool?
Comment 16•12 years ago
|
||
I'm taking a look at the code in that area to see what I can find. Knowing when the first occurrence of that error happened would be good, as would knowing if it's happening across all machines or just bouncer10.
Comment 17•12 years ago
|
||
hold on looking into this until we get through the network issues we are in right now (in case they are related) thanks
Assignee | ||
Updated•12 years ago
|
Assignee: afernandez → bburton
Whiteboard: [stub+] → [stub+][service-interrupt]
Comment 18•12 years ago
|
||
I believe I've confirmed the notice is not the root cause of the ISE. I will hold off as I suspect network issues may play a role here.
Comment 19•12 years ago
|
||
update mirror_mirrors set active = 0 where id=259; update mirror_mirrors set active = 0 where id=493; update mirror_mirrors set active = 1 where id=486; changed from active=1 to active =0 259 | Mozilla - Santa Clara 3 | http://ftp-zlb.vips.scl3.mozilla.com/pub/mozilla.org/ 493 | FTP Cluster - SSL | https://ftp-zlb.vips.scl3.mozilla.com/pub/mozilla.org changed from active=0 to active=1 486 | Mozilla CDN's | http://download.cdn.mozilla.net/pub/mozilla.org
Assignee | ||
Comment 20•12 years ago
|
||
Documenting how I disabled sentry, disabled all sentry and bouncer related crons [root@sentry1 cron.d]# cat * #0 17 10 * * root source /data/virtualenvs/tuxedo/bin/activate ; /var/www/django/tuxedo/manage.py updategeoip >/dev/null # update bouncer v1 #17 * * * * root /bin/bash /usr/local/bin/sentry.sh > /dev/null #*/5 * * * * root /bin/bash /usr/local/bin/sentry-multi.sh > /dev/null #42 14 * * * root /root/gm/autocommit.sh >/dev/null 2>/dev/null MAILTO="infra-notices@mozilla.com" PATH="/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin" */30 * * * * root /usr/local/sbin/puppetctl run >/dev/null 2>&1 MAILTO=root # run system activity accounting tool every 10 minutes #*/10 * * * * root /usr/lib64/sa/sa1 1 1 # generate a daily summary of process accounting at 23:53 #53 23 * * * root /usr/lib64/sa/sa2 -A #00 */4 * * * root cd /var/www/django/tuxedo; /data/virtualenvs/tuxedo/bin/python26 manage.py update_product_details All crons are re-enabled at this point
Comment 21•12 years ago
|
||
Can anybody reproduce this post-outage? I am unable to.
Comment 22•12 years ago
|
||
I have not been able to reproduce the issue. I recommend that we file a follow up bug to improve the logging of the database layer, as I suspect that was the root cause here. Since the errors are suppressed, we have a condition where database connections can fail and nobody knows why. Now that we have a solid test plan in place, we should be able to move forward with releases more easily.
Comment 23•12 years ago
|
||
Please reopen if this becomes an issue again, but there were 2 things at play here: a network issue in SCL3 and passive monitoring enabled in the new bouncer prod pool which causes zeus to drain the pool unnecessarily without adding nodes back in, thus choking and DoS'ing the service.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 24•12 years ago
|
||
I've just rerun releng automation -- worked fine. Thanks a lot.
Status: RESOLVED → VERIFIED
Comment 25•12 years ago
|
||
The bouncer machines still have 17k-18k sockets in CLOSE_WAIT to the db machines. The workaround I put in, in bug 800894 is a) a workaround, not a fix b) not in puppet so if any new hosts get turned up/replace/etc they will not have the workaround and will throw 500s when we get high traffic. My opinion is that this isn't fixed, just the symptoms are gone. There absolutely needs to be logging of failures at the DB level but it also needs to not open and close thousands upon thousands of connections to the DB in the first place, that doesn't scale.
Comment 26•12 years ago
|
||
Here's what I'm going to do, after discussions with Sheeri: - Update the Bouncer code to use mysql_pconnect for persistent connections instead of mysql_connect (today) - Update the Bouncer code to use a more modern mysql library (current one is deprecated and a new one will get us better perf) - Sheeri will upgrade MySQL from 5.0 to 5.1
Comment 27•12 years ago
|
||
laura pulled me in on this. We can increase the thread caching on the slaves (this is a representative sample, over 100k threads created and the thread cache is only 500): mysql> show variables like 'thread_cache_size'; +-------------------+-------+ | Variable_name | Value | +-------------------+-------+ | thread_cache_size | 500 | +-------------------+-------+ 1 row in set (0.00 sec) mysql> show status like 'Threads_c%'; +-------------------+--------+ | Variable_name | Value | +-------------------+--------+ | Threads_cached | 492 | | Threads_connected | 8 | | Threads_created | 111049 | +-------------------+--------+ 3 rows in set (0.00 sec) The master is fine, thread_cache_size is 500 but just over 200 threads have been created total, and all but 5 are cached. I'll update the thread_cache_size on the slaves. Also, bouncer prod is on MySQL 5.0, so I've linked bug 790391, the upgrade bug.
Comment 28•12 years ago
|
||
Set the thread_cache_size to 10,000 on all the slaves.
Comment 29•12 years ago
|
||
1. Use pconnect: https://bugzilla.mozilla.org/show_bug.cgi?id=801685 2. Use a better mysql lib: https://bugzilla.mozilla.org/show_bug.cgi?id=801687
Comment 30•12 years ago
|
||
(In reply to Peter Radcliffe [:pir] from comment #25) > The bouncer machines still have 17k-18k sockets in CLOSE_WAIT to the db > machines. I believe this is the "normal" level of connections for this app. It gets hit a lot, but 99% of the responses are a very simple 302 redirect so the bandwidth is very low. GeoIP has in the past been the culprit for causing DB connection problems, especially during a release day/week. I turned it off on October 11 on the admin node, and I *think* I deployed it immediately thereafter... but I can't be sure. It's remotely possible that this was on and contributing to the problems. It's also possible that we ran into trouble in spite of GeoIP being disabled... it was a release week, and it could be that even with GeoIP off our sheer connection rate is still too high now. It could also be that something about the new cluster is triggering problems where there weren't any before. > The workaround I put in, in bug 800894 is a) a workaround, not a fix b) not > in puppet so if any new hosts get turned up/replace/etc they will not have > the workaround and will throw 500s when we get high traffic. Agreed, that isn't a real solution. However, it's a good start, and I think we should strongly consider putting 'tcp_tw_reuse = 1' in our base puppet modules... we've used it before, and it hasn't caused us any trouble that I'm aware of. Possibly also something like this: net.ipv4.ip_local_port_range="10240 64000" This widens the outgoing port range from 32768-61000 (28k up to 54k... without dropping so low that we cause problems with things on port 9090 or 9000 (which we have some of). Those two tweaks should gain us a considerable amount of additional outgoing connection capacity, across our whole infrastructure.
Comment 31•12 years ago
|
||
Made related bug https://bugzilla.mozilla.org/show_bug.cgi?id=801777 for tracking ganglia graphing for the bouncer db cluster.
Comment 32•12 years ago
|
||
Bouncer graphs are at https://ganglia-phx1.mozilla.org/ganglia/?r=day&cs=&ce=&m=load_one&s=by+name&c=TP-Bouncer+DB&h=tp-bouncer01-slave01.phx.mozilla.com&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4
Comment 33•12 years ago
|
||
After some bouncer fixes have been made the machines look considerably healthier: [root@bouncer6.webapp.phx1 pradcliffe]# netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c 123 ESTABLISHED 20 FIN_WAIT2 8 LISTEN 307 TIME_WAIT I think we should monitor TIME_WAIT levels on webheads and maybe other places too. Opinions?
Comment 34•12 years ago
|
||
I'm fine with that if we can set reasonable expectations on what they should be and how people should respond. An alert that could conceivably page someone should be actionable, and I'm not sure this really would be. In the case of a server with high TIME_WAITs, what functionality is affected, and what action should be taken? I guess my underlying concern is that a high level of TIME_WAITs, like a high load average, is not necessarily a problem in and of itself. The real problem I think we ought to care about is "I can't open a new socket" (or, "I'm close to not being able to open a new socket"). If we turn on "tcp_tw_reuse" everywhere, it should be rather difficult for us to get into this situation due to TIME_WAITs alone.
Comment 35•12 years ago
|
||
If not alert on it (and I do agree with you about actionable alerts, I don't like the load pages for similar reasons) can we maybe graph them in ganglia so we at least have a reference for when a problem starts if we see issues? (I'd also like to see queries per second on webservers, db servers, etc, graphed since that makes it far easier to tell if a load issue is from higher input or changed behaviour, etc).
Comment 36•12 years ago
|
||
We already do have TIME_WAIT graphed in ganglia (I was looking at scl3 rather than phx1 ganglia). With that, an alert in irc set to a high number, with a low alert repeat rate and decent documentation (alert docs to go look at ganglia, see if the change can be correlated with anything and to talk to devs about it?) could be useful, I think.
Updated•11 years ago
|
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•