l10n.nl.mozilla.org spiders mercurial too quickly

RESOLVED FIXED

Status

defect
RESOLVED FIXED
11 years ago
6 years ago

People

(Reporter: chizu, Assigned: benjamin)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

Upwards of 30 simultaneous requests at once are overloading hg.mozilla.org daily.

Example request:
GET /l10n-central/en-GB/index.cgi/pushlog HTTP/1.0" 200 34927 "-" "Twisted PageGetter
What's wrong with 30 simultaneous requests?
Sometimes this takes down hg while these run.
What does "takes down hg" mean?: other connections are slow for a few msecs/secs/minutes? You have to manually restart some process?

The pushlog requests used to be pretty trivial database queries... Ted, now that we're requesting file information and other stuff for the feed, is the request a lot more involved?
I expect to optimize out the file stuff as soon as we get bug 449381 fixed, at which point the responses should in general be empty and only touch the db, and not the repo at all, including generating the file lists.
PS: we resolved bug 443600 as WORKSFORME, which has more ideas, with harder deps on the server side.
To reiterate what Benjamin said, unless we know what to optimize for, it's hard to invest cycles in the right thing. So more data would be valuable.
A request to many locales at hg.mozilla.org/l10n-central/LOCALE/index.cgi/pushlog is made at once. Requests start to time out from high server load and IT is paged. It recovers a few minutes later.

We can watch for which locales are causing the most load next time it happens, in case that's useful.
Would also be helpful to know which parts of hg/hgweb are taking a lot of time. We might be able to optimize it, which I think we haven't done much with hgweb since few people seem to need it (but of course this is mostly the pushlog, which is a Mozilla thing).
I am working on getting the wsgi interface for mercurial to function correctly.  This should be able to handle a bunch of simultaneous requests correctly.  Will comment here once things start to work faster.
Assignee: nobody → server-ops
Component: Infrastructure → Server Operations
Product: Mozilla Localizations → mozilla.org
QA Contact: infrastructure → mrz
Version: unspecified → other
Assignee: server-ops → aravind
Aravind: I'd be interested to hear what you're changing. Are you moving to Apache's mod_wsgi? That is AFAIK the most performant solution for deploying WSGI applications (such as hgweb).
(In reply to comment #10)
> Aravind: I'd be interested to hear what you're changing. Are you moving to
> Apache's mod_wsgi? That is AFAIK the most performant solution for deploying
> WSGI applications (such as hgweb).
> 

Yup, I am moving to mod_wsgi, I am not sure if its from Apache, the code I have is from http://code.google.com/p/modwsgi/.  I don't think its an official apache module.  It should be ready sometime tomorrow.
I don't mean *from* Apache, but wanted to distinguish it from nginx's mod_wsgi (which might be another option), and I think there's one more now.

Are you also getting rid of the "index.cgi" in the URL's, or is that now set in stone forever?
On my buildbot I have taken the following customization which throttles the calls so there are never more than 4 simultaneously: http://hg.mozilla.org/users/bsmedberg_mozilla.com/buildbotcustom/rev/b8da2f63697c
I was wondering today if it would help if there were changegroup hooks instead of pollers (not sure if the pollers actually poll?).
Here is what I did this morning:

I removed the index.cgi from my links, to remove the redirect requests. That didn't help a whole lot.

I backed off my buildmaster from 3 to 5 minutes, which at least over the day today my time seemed to help a bit.

I have more ideas in my head, nameley, moving the polls on the individual feeds into the LoopingCall, with some statistics on the response time and a given timeout to getPage. But that's a different bug.

Re comment 14, I don't see how changegroup hooks would help, as we need to bridge the gap between the hg server and the buildbot masters. The main point here is that we want to keep the server up, and as many masters asking for changes as we see fit. The changegroup hook method basically says that we need to change the hook setup each time we set up a staging environment or something. That doesn't sound too scalable.
You could have the hook ping some script and have that script start all kinds of stuff (and more easily modify the script)? At least you're not polling every few minutes, then. And you could put some of the intelligence in the intermediary script, so you can select what buildbots to start/whether you want to start any.

Maybe introducing an extra layer makes it too complex, though.
Nah, that ain't gonna work. We should make polling cheap first.

Then we can load balance the polling. And then we can talk about some mirror. Or about aggregating the l10n feeds on the server side.
Is the new RR setup holding up or are you guys still noticing problems pulling in the multiple l10n trees at the same time?
I ran into a stale poller again, I'll need to add timeouts.
patch in preparation, I got timeouts working and de-parallized the l10n poller.

Polling all locales in parallel makes the individual queries take up to 6 seconds to respond, when I'm moving that down to two parallel requests, they take somewhere between 1 and 2 seconds.

I still need to set up my new hardware to be actually able to really test this locally, though.
This patch should fix the problem at hand as well as bug 453457 (HgPoller sticking without an errback).
Assignee: aravind → benjamin
Status: NEW → ASSIGNED
Attachment #337473 - Flags: review?(bhearsum)
Attachment #337473 - Flags: review?(bhearsum) → review+
Fixed in CVS.
Assignee: benjamin → nobody
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Component: Server Operations → Release Engineering
QA Contact: mrz → release
Resolution: --- → FIXED
Assignee: nobody → benjamin
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.