Closed Bug 770271 Opened 12 years ago Closed 12 years ago

Help investigate timeouts on developer-new.mozilla.org

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

All
Other
task
Not set
critical

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 772468

People

(Reporter: lorchard, Assigned: nmaul)

References

Details

bug 769990 was filed this weekend, and it looks like there are a lot of delays and timeouts on developer-new. I'm waiting for a fix to ssh access (bug 766580) to read logs, so I'm not sure what's up.

Is there anything obvious going on with the servers that would tie things up?
Blocks: 769990
Assignee: server-ops-webops → nmaul
Severity: minor → normal
I commented on that bug... I suspect this had a lot to do with the "leap second" bugs we encountered over the weekend. This affected MySQL and Java apps heavily (and not just Mozilla infra), so it seems likely Node.js was affected too (we don't have a lot of Node.js stuff to say for sure).

I restarted Kumascript and Apache, and all seems well again.

I'm inclined to leave it at that. If you need something more in-depth, let me know.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
This is still happening. I don't think it's to do with the Kumascript service, though. For example, this URL never touches kumascript:

https://developer-new.mozilla.org/en-US/docs/HTML/HTML5?raw=1

That's just attempting to fetch the raw page content from the Django app, and it seems like it's taking a really long time. If anything, kumascript is having issues because it can't fetch resources quickly from the Django side.
Severity: normal → minor
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Severity: minor → normal
Trying to tell if this might have something to do with any particular webhead. 

Seemed like requests came back fast, sometimes. And when they did, the header was X-Backend-Server: developer1.webapp.scl3.mozilla.com

Otherwise, it seemed like requests either never came back at all & timed out
I'm trying curl against the 3 nodes directly, and at first couldn't replicate slowness for any of them. I then disabled nodes 1 and 2 in the LB, leaving only 3 alive. At this point, curl requests directly to any of the 3 nodes became slow... but only temporarily. 1 and 2 became "fast" again after a couple attempts. 3 never did.

HOWEVER, when I changed this to query only 2 instead (disabling 1,3), then 2 started showing the problem and 3 was okay again.

When I changed the LB to have only 1 enabled (2,3 disabled), then 1 started to be slow and 2,3 were fast.

At other times, I would still encounter occasional slowness not matching any of the above configurations. In the end, I believe the particular node being used has no significant impact. I can't state this with complete 100% confidence, but the testing I've done shows now clear pattern pointing at one more more "bad" nodes.



During this testing I noticed that the error_log for at least one of the nodes was filled with this:

[Mon Jul 02 14:43:06 2012] [error] [client 10.22.81.210] Script timed out before returning headers: kuma.wsgi


:groovecoder tells me that the only dependencies for a URL like that are mysql and memcache... not even Kumascript is relevant. Both of these are working, and I can't find or replicate any slowness with either. There was a MySQL issue over the weekend (bug 769936), but that should be resolved (bug is open for further analysis, but the issue is resolved)... and it would have affected current-prod as well, not just -new. Multiple people have confirmed that current-prod is *not* affected.
Group: infra
Depends on: 770410
Do we have any progress here? Starting to make me nervous. :)
Blocks: 770410
No longer depends on: 770410
Yeah, I'm out of ideas myself, but this is a flat-out blocker to the soft launch we've been talking about in 2 days.
Marking this as critical. We are planning to do a soft launch on July 5th and a full launch on the 15th, and development work is being severely hindered by this issue.
Severity: normal → critical
This appears to be much improved now. We disabled CACHE_BACKEND on developer-new, meaning that Kuma no longer uses memcache.

For a while it felt like this hurt the "really fast" requests that did work, but also removed the chance for a requests to take a ridiculously long time. Lately however things seem to be generally pretty fast for me across the board.

Also, we are currently running on just one of the 3 web nodes, although this is almost certainly not a factor, as problems have been seen in all attempted configurations of active nodes. However in the interest of having some stability, I'm leaving this as-is for now. We can deal with this later.


Here is a compendium of links I've been using to judge response time. None of these should take more than 15 seconds... as compared to previously, when they would take upwards of 30 seconds regularly, and sometimes not return at all.

https://developer-new.mozilla.org/media/revision.txt (insanely fast, never slow)
https://developer-new.mozilla.org/en-US/some-random-file (should 404)
https://developer-new.mozilla.org/en-US/docs/HTML/HTML5?raw=1 (no Kumascript)
https://developer-new.mozilla.org/admin/
https://developer-new.mozilla.org/en-US/learn/ (very simple code powering this)
https://developer-new.mozilla.org/en-US/


If this does indeed remain fast and stable, then it would appear that the issue is somehow related to memcache... bad keys, middleware bug, connectivity issue, etc. However if Kumascript also remains good, then that should rule out most infra issues with memcache... it is still configured to use it.
"locmem" is what Django is falling back to in the absence of a CACHE_BACKEND in settings_local.py. This works but causes cache discrepancy issues- it's per-process, and so doesn't scale to multiple processes or multiple servers.

We tried the db cache backend (https://docs.djangoproject.com/en/1.2/topics/cache/), and it works but appears to cause some deadlocks at times... not sure why exactly. We then tried using a MEMORY MySQL table instead of the default InnoDB, and this appeared not to work at all... I suspect maybe some sort of limitation on row length or column types or something, don't know.

We found an issue with current-prod and -new both writing to the same memcached nodes. This was fixed.

We found an issue with mod_wsgi worker processes not being plentiful enough and dying off too quickly. We increased the settings from 4 workers / 200 max-requests up to 16 workers / 5000 max requests, which is more in-line with what SUMO and Bedrock do. More yet is possible, but benchmarking indicates this is already a major improvement in capacity... more thorough testing would be needed to say for sure if even higher settings are beneficial.


Current status:

Memcache
1 active web node
16 processes, 5000 max requests

I don't know if we've had reports of poor performance since all of this was put into place, after I finished benchmarking (which may possibly have impacted performance a bit, although I tried to minimize this).
Closing this bug in favor of bug 772468, which is largely the same.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → DUPLICATE
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.