Closed Bug 851784 Opened 12 years ago Closed 12 years ago

Automated tests downloading builds from ftp.m.o hitting intermittent "503 Server Too Busy" and timeout errors

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: afernandez)

Details

Attachments

(1 file)

We were getting just a few of these, a couple an hour, after the fix for bug 851705, but now it's more like tens and rapidly increasing. Trees are already closed, but this would keep them closed if the other closer gets fixed. Example logs of the two sorts of failure (these URLs are not the problem, they are logs of the problem happening - I always confuse people into thinking I'm saying that the logs don't load, when what I mean is "open this URL and read this log of the failure"): https://tbpl.mozilla.org/php/getParsedLog.php?id=20718014&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=20718222&tree=Mozilla-Inbound
Severity: blocker → major
Do you have a subnet (or list of subnets) that we could possibly whitelist so that the measures taken in bug 851705 don't apply to the tree?
:philor I added a list of subnets to the whitelist that should alleviate the issue. Please let us know if you still experience issues, thank you. Feel free to bump up importance if it occurs again.
Severity: major → normal
Looks like despite most trees being closed we did still get a couple of them at 06:41, probably when load would have picked up a little bit from nightly builds being tested, so I'd guess it's just a ticking timebomb waiting for us to say "see, it's fine when there's nothing happening, we should make things happen again."
But I got talked into reopening, and within 90 minutes we'd built up enough load that we hit 10 of these.
More like 30 by now. The trees are still open, because I can retrigger the failing jobs, but that means I have to close them 4-6 hours before I'm going to leave, so I might as well page now as page 4-6 hours before I go to sleep (which isn't all that long from now).
Severity: normal → blocker
Assignee: server-ops → afernandez
Attached image ftp-http bandwidth
We have increased the previously set bandwidth cap from 500mbps to 700mbps. This should fix the "ERROR 503: Server Too Busy." errors. If you still see issues, please let us know.
Assignee: afernandez → server-ops
Severity: blocker → normal
Was watching the current bandwidth activity and seems at times we reached the 700mbps cap. Increased the cap by another 300mbps for a total of 1000mbps. Seems at random times we do reach the new 1G cap but it doesn't straight line so it should be much better now. We could possibly increase it to more we prefer to do gradual increases as to not cause load issues on the ftp cluster.
Doubled the cap to 2G
:philor have you experienced the same errors/issues? Are we stable now? Please advise, thank you.
We survived Sunday just fine, and have survived Monday morning fine, but we're not up to full weekday load, 664 jobs running when full capacity is around 1200.
:philor, ok please keep us updated. We looked at the bandwidth history for 90 days and seems we never reached 2G. There were only two times that it passed the 1.5G barrier.
We did hit the full weekday load several times today without incident (well, with lots of incidents which were not this), so much as I hate to jinx it I think we can call this fixed. Thanks!
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Assignee: server-ops → afernandez
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: