Closed
Bug 737144
Opened 13 years ago
Closed 13 years ago
3/19 AMO degredation
Categories
(Cloud Services :: Operations: Marketplace, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cshields, Assigned: jason)
References
Details
At 12:27, zeus started pulling AMO nodes because they were timing out or refusing connections. Still working on determining the root cause.
Reporter | ||
Comment 2•13 years ago
|
||
web7 and web14 are showing the following errors (truncated):
[Mon Mar 19 12:27:46 2012] [alert] (11)Resource temporarily unavailable: setuid: unable to change to uid: 48
[Mon Mar 19 12:27:47 2012] [emerg] (22)Invalid argument: couldn't grab the accept mutex
this may be unrelated given that more nodes than this became unavailable to zeus.
Reporter | ||
Comment 4•13 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #2)
> this may be unrelated given that more nodes than this became unavailable to
> zeus.
I should take that comment back, the timing of the outage lines up.
Comment 5•13 years ago
|
||
I reviewed the events from AMO and found nothing unusual. There was one CEF event for a CSP violation from 31.63.160.63. However this is a red herring. I have no other event/log sources to to explain this behavior. Thanks
Reporter | ||
Comment 6•13 years ago
|
||
web7 and web14 were restarted.. Jason has also restarted apache across the cluster (a rolling restart) just to be on the safe side here.
hit rate through zeus and netscaler were normal during this time (no obvious increase or attack)
the web heads became unaccessible to both zeus and the netscaler.
Assignee | ||
Updated•13 years ago
|
Assignee: server-ops → jthomas
Assignee | ||
Comment 7•13 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #2)
> web7 and web14 are showing the following errors (truncated):
>
> [Mon Mar 19 12:27:46 2012] [alert] (11)Resource temporarily unavailable:
> setuid: unable to change to uid: 48
> [Mon Mar 19 12:27:47 2012] [emerg] (22)Invalid argument: couldn't grab the
> accept mutex
>
> this may be unrelated given that more nodes than this became unavailable to
> zeus.
This also occurred on web14, web15 and again on web7 around ~13:30 PDT.
We were able to replicate this issue using apache benchmark with 200+ concurrent requests to a web node. Apache AcceptMutex default sysvsem which is not recommended for our setup. Jeremy modified the apache configuration to use posixsem instead. This looks to have fixed this specific issue and apache was stable with 400+ concurrent requests.
Comment 8•13 years ago
|
||
This happened again at 0419, oddly all of AMO returned Service Unavailable just because 1 node wasn't responding correctly:
04:19:28 < nagios-phx1> [109] web14.addons.phx1:web14.addons.phx1.mozilla.com - zamboni monitor is CRITICAL: CRITICAL: http://web14.addons.phx1:81/z/services/monitor.json is broken
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
nf_conntrack: table full, dropping packet.
httpd[7981]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca65590 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[8034]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca61190 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[7993]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca5f510 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[8030]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca5b210 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[7969]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca66510 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[8009]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca6a410 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[7963]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca69bd0 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[7967]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca4b390 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
httpd[8007]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca61e50 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]
Reporter | ||
Comment 9•13 years ago
|
||
(In reply to Ashish Vijayaram [:ashish] from comment #8)
> This happened again at 0419, oddly all of AMO returned Service Unavailable
> just because 1 node wasn't responding correctly:
..and this is again something to do with the web nodes (as opposed to a zeus issue), netscaler pulled them all too.
Reporter | ||
Comment 10•13 years ago
|
||
(In reply to Ashish Vijayaram [:ashish] from comment #8)
> nf_conntrack: table full, dropping packet.
Jason increased the max for conntrack.
Reporter | ||
Comment 11•13 years ago
|
||
We are now looking into core dumps that are apparent across most of the webheads, many of which coincide with these outages, and have been occurring since 3/10.
Reporter | ||
Comment 12•13 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #11)
> We are now looking into core dumps that are apparent across most of the
> webheads, many of which coincide with these outages, and have been occurring
> since 3/10.
"since 3/10" is misleading - dumps were cleared out of /tmp before 10 days ago. yet subdirs were left alone... herp derp
Assignee | ||
Comment 13•13 years ago
|
||
During AMO timeouts we noticed that redis was spiking.
A redis cleanup cron (bug 616865) that normally runs every 4 hours began to fail due to timeout on connections to redis. This began to occur since a change to settings was made to redis socket_timeout (bug 734322).
I have gone ahead and reverted the change and manually ran the cleanup.
Comment 14•13 years ago
|
||
I'm seeing lots of translations updates, for things that look like recent additions/modifications.
| 11161303 | addons_prod | 10.8.70.203:51714 | addons_mozilla_org | Query | 29 | Updating | UPDATE `translations` SET `created` = '2011-03-06 09:16:11', `modified` = '2012-03-21 09:07:08', `id` = 1772015, `locale` = 'en-us', `localized_string` = 'If you write a blog for your students, Zemanta will look for additional resources to add to your blog on the topic that you are writing about. Use this to find additional resources for Assignment Pages and research projects, as well. Even works in some email programs like GMail.', `localized_string_clean` = 'If you write a blog for your students, Zemanta will look for additional resources to add to your blog on the topic that you are writing about. Use this to find additional resources for Assignment Pages and research projects, as well. Even works in some email programs like GMail.' WHERE `translations`.`autoid` = 1858557 |
| 11163068 | addons_prod | 10.8.70.203:55886 | addons_mozilla_org | Query | 34 | Updating | UPDATE `translations` SET `created` = '2012-02-19 15:48:21', `modified` = '2012-03-21 09:07:03', `id` = 2969753, `locale` = 'en-us', `localized_string` = 'Preview and navigate tab contents through popup.', `localized_string_clean` = 'Preview and navigate tab contents through popup.' WHERE `translations`.`autoid` = 3088134 |
| 11163691 | addons_prod | 10.8.70.203:57298 | addons_mozilla_org | Query | 9 | Updating | UPDATE `translations` SET `created` = '2012-02-19 15:48:21', `modified` = '2012-03-21 09:07:28', `id` = 2969753, `locale` = 'en-us', `localized_string` = 'Preview and navigate tab contents through popup.', `localized_string_clean` = 'Preview and navigate tab contents through popup.' WHERE `translations`.`autoid` = 3088134
Comment 15•13 years ago
|
||
Note: there's nothing locking the translations table as far as I can see, no iowait or anything like that.
Comment 16•13 years ago
|
||
I analyzed some core dumps that were being generated on the webheads.
Nothing 100% definitive, however the core dumps lead me to believe there is a flaw with CakePHP and it's use of session_write_close combined with the APC cache handler.
My theory is that CakePHP is terminating variables prematurely, which when being acted upon by apc cache trying to access the variable, however active_opline->opcode is now null.
This is the line referenced in the core dump:
if (active_opline->opcode != ZEND_FETCH_CLASS) {
spl.c line 330
Reporter | ||
Comment 17•13 years ago
|
||
We were down a DB and as of 11:00 that db was put back into rotation.. GUID hits are no longer lagging and I think this should solve a lot of problems.
We need more redundancy here and will be adding 2 more db slaves to the pool today (737935).
Reporter | ||
Comment 18•13 years ago
|
||
(In reply to Corey Shields [:cshields] from comment #17)
> We were down a DB and as of 11:00 that db was put back into rotation.. GUID
> hits are no longer lagging and I think this should solve a lot of problems.
that db slave went down yesterday close to 03:00.. needed a full restore from backup, consistency checked (and fixed), and we ran it for a while out of production before putting it back in, hence the long time in bringing that back to service. That was reported and tracked in 737360
Reporter | ||
Comment 19•13 years ago
|
||
Update for today: AMO/Marketplace performance has been fine all day, and has been that way since the DB was replaced 11:00 yesterday.
Today we have added 2 more DB slaves to the pool.
We have also increased the depth of monitoring of the database servers by adding cacti-mysql-template graphs to cacti.
Assignee | ||
Comment 20•13 years ago
|
||
AMO/Marketplace has been fine the last couple of days.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: AMO Operations → Operations: Marketplace
Product: mozilla.org → Mozilla Services
You need to log in
before you can comment on or make changes to this bug.
Description
•