770271 - Help investigate timeouts on developer-new.mozilla.org

Reporter

Description

•

12 years ago

bug 769990 was filed this weekend, and it looks like there are a lot of delays and timeouts on developer-new. I'm waiting for a fix to ssh access (bug 766580) to read logs, so I'm not sure what's up.

Is there anything obvious going on with the servers that would tie things up?

Les Orchard [:lorchard]

Reporter

Updated

•

12 years ago

Blocks: 769990

Jake Maul [:jakem]

Assignee

Updated

•

12 years ago

Assignee: server-ops-webops → nmaul

Severity: minor → normal

Jake Maul [:jakem]

Assignee

Comment 1

•

12 years ago

I commented on that bug... I suspect this had a lot to do with the "leap second" bugs we encountered over the weekend. This affected MySQL and Java apps heavily (and not just Mozilla infra), so it seems likely Node.js was affected too (we don't have a lot of Node.js stuff to say for sure).

I restarted Kumascript and Apache, and all seems well again.

I'm inclined to leave it at that. If you need something more in-depth, let me know.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Les Orchard [:lorchard]

Reporter

Comment 2

•

12 years ago

This is still happening. I don't think it's to do with the Kumascript service, though. For example, this URL never touches kumascript:

https://developer-new.mozilla.org/en-US/docs/HTML/HTML5?raw=1

That's just attempting to fetch the raw page content from the Django app, and it seems like it's taking a really long time. If anything, kumascript is having issues because it can't fetch resources quickly from the Django side.

Severity: normal → minor

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Les Orchard [:lorchard]

Reporter

Updated

•

12 years ago

Severity: minor → normal

Les Orchard [:lorchard]

Reporter

Comment 3

•

12 years ago

Trying to tell if this might have something to do with any particular webhead. 

Seemed like requests came back fast, sometimes. And when they did, the header was X-Backend-Server: developer1.webapp.scl3.mozilla.com

Otherwise, it seemed like requests either never came back at all & timed out

Jake Maul [:jakem]

Assignee

Comment 4

•

12 years ago

I'm trying curl against the 3 nodes directly, and at first couldn't replicate slowness for any of them. I then disabled nodes 1 and 2 in the LB, leaving only 3 alive. At this point, curl requests directly to any of the 3 nodes became slow... but only temporarily. 1 and 2 became "fast" again after a couple attempts. 3 never did.

HOWEVER, when I changed this to query only 2 instead (disabling 1,3), then 2 started showing the problem and 3 was okay again.

When I changed the LB to have only 1 enabled (2,3 disabled), then 1 started to be slow and 2,3 were fast.

At other times, I would still encounter occasional slowness not matching any of the above configurations. In the end, I believe the particular node being used has no significant impact. I can't state this with complete 100% confidence, but the testing I've done shows now clear pattern pointing at one more more "bad" nodes.

During this testing I noticed that the error_log for at least one of the nodes was filled with this:

[Mon Jul 02 14:43:06 2012] [error] [client 10.22.81.210] Script timed out before returning headers: kuma.wsgi

:groovecoder tells me that the only dependencies for a URL like that are mysql and memcache... not even Kumascript is relevant. Both of these are working, and I can't find or replicate any slowness with either. There was a MySQL issue over the weekend (bug 769936), but that should be resolved (bug is open for further analysis, but the issue is resolved)... and it would have affected current-prod as well, not just -new. Multiple people have confirmed that current-prod is *not* affected.

Jake Maul [:jakem]

Assignee

Updated

•

12 years ago

Group: infra

Jean-Yves Perrier [:teoli]

Updated

•

12 years ago

Depends on: 770410

Eric Shepherd [:sheppy]

Comment 5

•

12 years ago

Do we have any progress here? Starting to make me nervous. :)

Les Orchard [:lorchard]

Reporter

Updated

•

12 years ago

Blocks: 770410

No longer depends on: 770410

Les Orchard [:lorchard]

Reporter

Comment 6

•

12 years ago

Yeah, I'm out of ideas myself, but this is a flat-out blocker to the soft launch we've been talking about in 2 days.

John Karahalis [:openjck]

Comment 7

•

12 years ago

Marking this as critical. We are planning to do a soft launch on July 5th and a full launch on the 15th, and development work is being severely hindered by this issue.

Severity: normal → critical

Jake Maul [:jakem]

Assignee

Comment 8

•

12 years ago

This appears to be much improved now. We disabled CACHE_BACKEND on developer-new, meaning that Kuma no longer uses memcache.

For a while it felt like this hurt the "really fast" requests that did work, but also removed the chance for a requests to take a ridiculously long time. Lately however things seem to be generally pretty fast for me across the board.

Also, we are currently running on just one of the 3 web nodes, although this is almost certainly not a factor, as problems have been seen in all attempted configurations of active nodes. However in the interest of having some stability, I'm leaving this as-is for now. We can deal with this later.

Here is a compendium of links I've been using to judge response time. None of these should take more than 15 seconds... as compared to previously, when they would take upwards of 30 seconds regularly, and sometimes not return at all.

https://developer-new.mozilla.org/media/revision.txt (insanely fast, never slow)
https://developer-new.mozilla.org/en-US/some-random-file (should 404)
https://developer-new.mozilla.org/en-US/docs/HTML/HTML5?raw=1 (no Kumascript)
https://developer-new.mozilla.org/admin/
https://developer-new.mozilla.org/en-US/learn/ (very simple code powering this)
https://developer-new.mozilla.org/en-US/

If this does indeed remain fast and stable, then it would appear that the issue is somehow related to memcache... bad keys, middleware bug, connectivity issue, etc. However if Kumascript also remains good, then that should rule out most infra issues with memcache... it is still configured to use it.

Jake Maul [:jakem]

Assignee

Comment 9

•

12 years ago

"locmem" is what Django is falling back to in the absence of a CACHE_BACKEND in settings_local.py. This works but causes cache discrepancy issues- it's per-process, and so doesn't scale to multiple processes or multiple servers.

We tried the db cache backend (https://docs.djangoproject.com/en/1.2/topics/cache/), and it works but appears to cause some deadlocks at times... not sure why exactly. We then tried using a MEMORY MySQL table instead of the default InnoDB, and this appeared not to work at all... I suspect maybe some sort of limitation on row length or column types or something, don't know.

We found an issue with current-prod and -new both writing to the same memcached nodes. This was fixed.

We found an issue with mod_wsgi worker processes not being plentiful enough and dying off too quickly. We increased the settings from 4 workers / 200 max-requests up to 16 workers / 5000 max requests, which is more in-line with what SUMO and Bedrock do. More yet is possible, but benchmarking indicates this is already a major improvement in capacity... more thorough testing would be needed to say for sure if even higher settings are beneficial.


Current status:

Memcache
1 active web node
16 processes, 5000 max requests

I don't know if we've had reports of poor performance since all of this was put into place, after I finished benchmarking (which may possibly have impacted performance a bit, although I tried to minimize this).

Jake Maul [:jakem]

Assignee

Comment 10

•

12 years ago

Closing this bug in favor of bug 772468, which is largely the same.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → DUPLICATE

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: Web Operations → WebOps: Other

Product: mozilla.org → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

Bugzilla

Quick Search

Help investigate timeouts on developer-new.mozilla.org

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

Tracking

(Not tracked)

People

(Reporter: lorchard, Assigned: nmaul)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Updated

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated