767762 - HG.m.o still consistently returning HTTP 500 errors

Reporter

Description

•

13 years ago

So, as of now, we've been hitting hg.m.o 500 errors primarily cloning hg.m.o/build/tools (which is a *small* repo, and normally takes almost no time) though there are a few tegra jobs that failed due to wget'ing a raw file from hg as well. This has been ongoing since "2012-06-23 19:44:57" (PT) Up until now (last failed job with this issue "2012-06-23 20:58:30") This has not affected every job in this range, but has affected roughly 25-50% of the total job attempts, according to https://build.mozilla.org/buildapi/recent You can see the large swath of errors on tbpl. Aurora, Trunk, Inbound, Beta, etc. trees. Relatedly is likely Bug 767739, possibly Bug 767657, and nagios alerts blaming hgweb3 as having high CPU and maxed connections within the last hour (though that has turned to an OK nagios message since well before the most recent error)

Dumitru Gherman [:dumitru]

Assignee

Comment 1

•

13 years ago

I am verifying this since it was reported but I can't find any issues. - load on the webheads did spike a little between 2050 and 2110, but it recovered shortly. - no errors in the error_log - Zeus internal checks don't report anything - nagios didn't page about any problems - I was able to successfully do a hg clone http://hg.mozilla.org/build/tools several times from boris and natasha This could be because of bug 767745, thus extra load is added to the main hg.m.o? I will keep an eye on ganglia graphs and Zeus...

Assignee: server-ops → dgherman

Justin Wood (:Callek)

Reporter

Comment 2

•

13 years ago

[00:24:44] Callek dumitru: ok, slightly good news [00:24:54] Callek no (current) issues beyond the | 2012-06-23 20:57:08| one I mentioned [00:25:07] Callek so this past (almost) half hour has been relatively sane [00:25:18] Callek but multiple hours before that [00:25:38] Callek but there is still some underlying issue that "we" need to find and correct ... [00:27:04] Callek "releng not using releng hg servers" is not the issueits a mediation to a symptom [00:27:10] Callek fwiw [00:27:33] Callek that we are not using, *for l10n repacks only* [00:27:55] Callek and we *cant* (currently) use them for the tools clone without a lot of man-hours of work. [00:30:49] Callek dumitru: perhaps we need to add better (or more frequent) health checks, with more details when one fails? [00:32:00] dumitru I think that's a good start [00:32:20] dumitru throwing more resources to the pool would help, too

Dumitru Gherman [:dumitru]

Assignee

Comment 3

•

13 years ago

21:34 < dumitru> | so, I might have a theory, but needs to be proven right or wrong 21:34 < dumitru> | we used the "round robin" algorithm for hg pool. this is just "blindly" assigning connections to each node in turn, without caring about latency, current connections etc 21:35 < dumitru> | I changed that to "least connections". Zeus is now keeping track of how many connections a node in a pool has 21:35 < dumitru> | so if a pool is not closing some requests that take long, and others are idle, Zeus will not throw new connections to that node 21:35 < dumitru> | and will keep the connections count even 21:35 < dumitru> | if this will help, we shall see

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

13 years ago

per irc w/callek: so far looks ok since dumitru's change. However, lets keep this open for another day, while we watch for recurring 500 errors during these chemspills.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

13 years ago

Blocks: 767612

Armen [:armenzg]

Comment 5

•

13 years ago

Are we good to close?

Jake Maul [:jakem]

Comment 6

•

13 years ago

I'm slightly worried about using a least-connection algorithm for load balancing. In my experience it's easy for this to turn into a stampeding problem (flood the server with least connections, making it very slow, then flood another, then another). I don't have a better suggestion offhand, but it's something to be aware of. It's one reason why RR and WRR are so widely used... the Law of Large Numbers indicates that all the servers will trend towards even usage... that is not the case with LC or WLC. Perhaps a better long-term solution is just to add nodes to the pool, in order to reduce the likeliness of problems.

Dumitru Gherman [:dumitru]

Assignee

Comment 7

•

13 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #5) > Are we good to close? Yes, 500s were not reported since Saturday night.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

HG.m.o still consistently returning HTTP 500 errors

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: Callek, Assigned: dumitru)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated