737144 - 3/19 AMO degredation

Reporter

Description

•

13 years ago

At 12:27, zeus started pulling AMO nodes because they were timing out or refusing connections. Still working on determining the root cause.

Corey Shields [:cshields]

Reporter

Comment 2

•

13 years ago

web7 and web14 are showing the following errors (truncated): [Mon Mar 19 12:27:46 2012] [alert] (11)Resource temporarily unavailable: setuid: unable to change to uid: 48 [Mon Mar 19 12:27:47 2012] [emerg] (22)Invalid argument: couldn't grab the accept mutex this may be unrelated given that more nodes than this became unavailable to zeus.

Corey Shields [:cshields]

Reporter

Comment 4

•

13 years ago

(In reply to Corey Shields [:cshields] from comment #2) > this may be unrelated given that more nodes than this became unavailable to > zeus. I should take that comment back, the timing of the outage lines up.

Eric Parker [:parker]

Comment 5

•

13 years ago

I reviewed the events from AMO and found nothing unusual. There was one CEF event for a CSP violation from 31.63.160.63. However this is a red herring. I have no other event/log sources to to explain this behavior. Thanks

Corey Shields [:cshields]

Reporter

Comment 6

•

13 years ago

web7 and web14 were restarted.. Jason has also restarted apache across the cluster (a rolling restart) just to be on the safe side here. hit rate through zeus and netscaler were normal during this time (no obvious increase or attack) the web heads became unaccessible to both zeus and the netscaler.

Jason Thomas [:jason]

Assignee

Updated

•

13 years ago

Assignee: server-ops → jthomas

Jason Thomas [:jason]

Assignee

Comment 7

•

13 years ago

(In reply to Corey Shields [:cshields] from comment #2) > web7 and web14 are showing the following errors (truncated): > > [Mon Mar 19 12:27:46 2012] [alert] (11)Resource temporarily unavailable: > setuid: unable to change to uid: 48 > [Mon Mar 19 12:27:47 2012] [emerg] (22)Invalid argument: couldn't grab the > accept mutex > > this may be unrelated given that more nodes than this became unavailable to > zeus. This also occurred on web14, web15 and again on web7 around ~13:30 PDT. We were able to replicate this issue using apache benchmark with 200+ concurrent requests to a web node. Apache AcceptMutex default sysvsem which is not recommended for our setup. Jeremy modified the apache configuration to use posixsem instead. This looks to have fixed this specific issue and apache was stable with 400+ concurrent requests.

Ashish Vijayaram [:ashish]

Comment 8

•

13 years ago

This happened again at 0419, oddly all of AMO returned Service Unavailable just because 1 node wasn't responding correctly: 04:19:28 < nagios-phx1> [109] web14.addons.phx1:web14.addons.phx1.mozilla.com - zamboni monitor is CRITICAL: CRITICAL: http://web14.addons.phx1:81/z/services/monitor.json is broken nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. nf_conntrack: table full, dropping packet. httpd[7981]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca65590 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[8034]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca61190 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[7993]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca5f510 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[8030]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca5b210 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[7969]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca66510 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[8009]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca6a410 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[7963]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca69bd0 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[7967]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca4b390 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000] httpd[8007]: segfault at 30 ip 00007f88d5abef67 sp 00007fff5ca61e50 error 4 in libapr-1.so.0.3.9[7f88d5aab000+2b000]

Corey Shields [:cshields]

Reporter

Comment 9

•

13 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #8) > This happened again at 0419, oddly all of AMO returned Service Unavailable > just because 1 node wasn't responding correctly: ..and this is again something to do with the web nodes (as opposed to a zeus issue), netscaler pulled them all too.

Corey Shields [:cshields]

Reporter

Comment 10

•

13 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #8) > nf_conntrack: table full, dropping packet. Jason increased the max for conntrack.

Corey Shields [:cshields]

Reporter

Comment 11

•

13 years ago

We are now looking into core dumps that are apparent across most of the webheads, many of which coincide with these outages, and have been occurring since 3/10.

Corey Shields [:cshields]

Reporter

Comment 12

•

13 years ago

(In reply to Corey Shields [:cshields] from comment #11) > We are now looking into core dumps that are apparent across most of the > webheads, many of which coincide with these outages, and have been occurring > since 3/10. "since 3/10" is misleading - dumps were cleared out of /tmp before 10 days ago. yet subdirs were left alone... herp derp

Jason Thomas [:jason]

Assignee

Comment 13

•

13 years ago

During AMO timeouts we noticed that redis was spiking. A redis cleanup cron (bug 616865) that normally runs every 4 hours began to fail due to timeout on connections to redis. This began to occur since a change to settings was made to redis socket_timeout (bug 734322). I have gone ahead and reverted the change and manually ran the cleanup.

Sheeri Cabral [:sheeri]

Comment 14

•

13 years ago

I'm seeing lots of translations updates, for things that look like recent additions/modifications. | 11161303 | addons_prod | 10.8.70.203:51714 | addons_mozilla_org | Query | 29 | Updating | UPDATE `translations` SET `created` = '2011-03-06 09:16:11', `modified` = '2012-03-21 09:07:08', `id` = 1772015, `locale` = 'en-us', `localized_string` = 'If you write a blog for your students, Zemanta will look for additional resources to add to your blog on the topic that you are writing about. Use this to find additional resources for Assignment Pages and research projects, as well. Even works in some email programs like GMail.', `localized_string_clean` = 'If you write a blog for your students, Zemanta will look for additional resources to add to your blog on the topic that you are writing about. Use this to find additional resources for Assignment Pages and research projects, as well. Even works in some email programs like GMail.' WHERE `translations`.`autoid` = 1858557 | | 11163068 | addons_prod | 10.8.70.203:55886 | addons_mozilla_org | Query | 34 | Updating | UPDATE `translations` SET `created` = '2012-02-19 15:48:21', `modified` = '2012-03-21 09:07:03', `id` = 2969753, `locale` = 'en-us', `localized_string` = 'Preview and navigate tab contents through popup.', `localized_string_clean` = 'Preview and navigate tab contents through popup.' WHERE `translations`.`autoid` = 3088134 | | 11163691 | addons_prod | 10.8.70.203:57298 | addons_mozilla_org | Query | 9 | Updating | UPDATE `translations` SET `created` = '2012-02-19 15:48:21', `modified` = '2012-03-21 09:07:28', `id` = 2969753, `locale` = 'en-us', `localized_string` = 'Preview and navigate tab contents through popup.', `localized_string_clean` = 'Preview and navigate tab contents through popup.' WHERE `translations`.`autoid` = 3088134

Sheeri Cabral [:sheeri]

Comment 15

•

13 years ago

Note: there's nothing locking the translations table as far as I can see, no iowait or anything like that.

Rob Tucker [:rtucker]

Comment 16

•

13 years ago

I analyzed some core dumps that were being generated on the webheads. Nothing 100% definitive, however the core dumps lead me to believe there is a flaw with CakePHP and it's use of session_write_close combined with the APC cache handler. My theory is that CakePHP is terminating variables prematurely, which when being acted upon by apc cache trying to access the variable, however active_opline->opcode is now null. This is the line referenced in the core dump: if (active_opline->opcode != ZEND_FETCH_CLASS) { spl.c line 330

Corey Shields [:cshields]

Reporter

Comment 17

•

13 years ago

We were down a DB and as of 11:00 that db was put back into rotation.. GUID hits are no longer lagging and I think this should solve a lot of problems. We need more redundancy here and will be adding 2 more db slaves to the pool today (737935).

Corey Shields [:cshields]

Reporter

Comment 18

•

13 years ago

(In reply to Corey Shields [:cshields] from comment #17) > We were down a DB and as of 11:00 that db was put back into rotation.. GUID > hits are no longer lagging and I think this should solve a lot of problems. that db slave went down yesterday close to 03:00.. needed a full restore from backup, consistency checked (and fixed), and we ran it for a while out of production before putting it back in, hence the long time in bringing that back to service. That was reported and tracked in 737360

Corey Shields [:cshields]

Reporter

Comment 19

•

13 years ago

Update for today: AMO/Marketplace performance has been fine all day, and has been that way since the DB was replaced 11:00 yesterday. Today we have added 2 more DB slaves to the pool. We have also increased the depth of monitoring of the database servers by adding cacti-mysql-template graphs to cacti.

Jason Thomas [:jason]

Assignee

Comment 20

•

13 years ago

AMO/Marketplace has been fine the last couple of days.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: AMO Operations → Operations: Marketplace

Product: mozilla.org → Mozilla Services

Bugzilla

3/19 AMO degredation

Categories

(Cloud Services :: Operations: Marketplace, task)

Tracking

(Not tracked)

People

(Reporter: cshields, Assigned: jason)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 2

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Updated