Closed Bug 767762 Opened 13 years ago Closed 13 years ago

HG.m.o still consistently returning HTTP 500 errors

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86_64
Windows 7
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: dumitru)

References

Details

So, as of now, we've been hitting hg.m.o 500 errors primarily cloning hg.m.o/build/tools (which is a *small* repo, and normally takes almost no time) though there are a few tegra jobs that failed due to wget'ing a raw file from hg as well. This has been ongoing since "2012-06-23 19:44:57" (PT) Up until now (last failed job with this issue "2012-06-23 20:58:30") This has not affected every job in this range, but has affected roughly 25-50% of the total job attempts, according to https://build.mozilla.org/buildapi/recent You can see the large swath of errors on tbpl. Aurora, Trunk, Inbound, Beta, etc. trees. Relatedly is likely Bug 767739, possibly Bug 767657, and nagios alerts blaming hgweb3 as having high CPU and maxed connections within the last hour (though that has turned to an OK nagios message since well before the most recent error)
I am verifying this since it was reported but I can't find any issues. - load on the webheads did spike a little between 2050 and 2110, but it recovered shortly. - no errors in the error_log - Zeus internal checks don't report anything - nagios didn't page about any problems - I was able to successfully do a hg clone http://hg.mozilla.org/build/tools several times from boris and natasha This could be because of bug 767745, thus extra load is added to the main hg.m.o? I will keep an eye on ganglia graphs and Zeus...
Assignee: server-ops → dgherman
[00:24:44] Callek dumitru: ok, slightly good news [00:24:54] Callek no (current) issues beyond the | 2012-06-23 20:57:08| one I mentioned [00:25:07] Callek so this past (almost) half hour has been relatively sane [00:25:18] Callek but multiple hours before that [00:25:38] Callek but there is still some underlying issue that "we" need to find and correct ... [00:27:04] Callek "releng not using releng hg servers" is not the issueits a mediation to a symptom [00:27:10] Callek fwiw [00:27:33] Callek that we are not using, *for l10n repacks only* [00:27:55] Callek and we *cant* (currently) use them for the tools clone without a lot of man-hours of work. [00:30:49] Callek dumitru: perhaps we need to add better (or more frequent) health checks, with more details when one fails? [00:32:00] dumitru I think that's a good start [00:32:20] dumitru throwing more resources to the pool would help, too
21:34 < dumitru> | so, I might have a theory, but needs to be proven right or wrong 21:34 < dumitru> | we used the "round robin" algorithm for hg pool. this is just "blindly" assigning connections to each node in turn, without caring about latency, current connections etc 21:35 < dumitru> | I changed that to "least connections". Zeus is now keeping track of how many connections a node in a pool has 21:35 < dumitru> | so if a pool is not closing some requests that take long, and others are idle, Zeus will not throw new connections to that node 21:35 < dumitru> | and will keep the connections count even 21:35 < dumitru> | if this will help, we shall see
per irc w/callek: so far looks ok since dumitru's change. However, lets keep this open for another day, while we watch for recurring 500 errors during these chemspills.
Are we good to close?
I'm slightly worried about using a least-connection algorithm for load balancing. In my experience it's easy for this to turn into a stampeding problem (flood the server with least connections, making it very slow, then flood another, then another). I don't have a better suggestion offhand, but it's something to be aware of. It's one reason why RR and WRR are so widely used... the Law of Large Numbers indicates that all the servers will trend towards even usage... that is not the case with LC or WLC. Perhaps a better long-term solution is just to add nodes to the pool, in order to reduce the likeliness of problems.
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #5) > Are we good to close? Yes, 500s were not reported since Saturday night.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.