Closed
Bug 767762
Opened 13 years ago
Closed 13 years ago
HG.m.o still consistently returning HTTP 500 errors
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: dumitru)
References
Details
So, as of now, we've been hitting hg.m.o 500 errors primarily cloning hg.m.o/build/tools (which is a *small* repo, and normally takes almost no time) though there are a few tegra jobs that failed due to wget'ing a raw file from hg as well.
This has been ongoing since "2012-06-23 19:44:57" (PT)
Up until now (last failed job with this issue "2012-06-23 20:58:30")
This has not affected every job in this range, but has affected roughly 25-50% of the total job attempts, according to https://build.mozilla.org/buildapi/recent
You can see the large swath of errors on tbpl. Aurora, Trunk, Inbound, Beta, etc. trees.
Relatedly is likely Bug 767739, possibly Bug 767657, and nagios alerts blaming hgweb3 as having high CPU and maxed connections within the last hour (though that has turned to an OK nagios message since well before the most recent error)
| Assignee | ||
Comment 1•13 years ago
|
||
I am verifying this since it was reported but I can't find any issues.
- load on the webheads did spike a little between 2050 and 2110, but it recovered shortly.
- no errors in the error_log
- Zeus internal checks don't report anything
- nagios didn't page about any problems
- I was able to successfully do a hg clone http://hg.mozilla.org/build/tools several times from boris and natasha
This could be because of bug 767745, thus extra load is added to the main hg.m.o?
I will keep an eye on ganglia graphs and Zeus...
Assignee: server-ops → dgherman
| Reporter | ||
Comment 2•13 years ago
|
||
[00:24:44] Callek dumitru: ok, slightly good news
[00:24:54] Callek no (current) issues beyond the | 2012-06-23 20:57:08| one I mentioned
[00:25:07] Callek so this past (almost) half hour has been relatively sane
[00:25:18] Callek but multiple hours before that
[00:25:38] Callek but there is still some underlying issue that "we" need to find and correct
...
[00:27:04] Callek "releng not using releng hg servers" is not the issueits a mediation to a symptom
[00:27:10] Callek fwiw
[00:27:33] Callek that we are not using, *for l10n repacks only*
[00:27:55] Callek and we *cant* (currently) use them for the tools clone without a lot of man-hours of work.
[00:30:49] Callek dumitru: perhaps we need to add better (or more frequent) health checks, with more details when one fails?
[00:32:00] dumitru I think that's a good start
[00:32:20] dumitru throwing more resources to the pool would help, too
| Assignee | ||
Comment 3•13 years ago
|
||
21:34 < dumitru> | so, I might have a theory, but needs to be proven right or wrong
21:34 < dumitru> | we used the "round robin" algorithm for hg pool. this is just "blindly" assigning connections to each node in turn, without caring about latency, current connections etc
21:35 < dumitru> | I changed that to "least connections". Zeus is now keeping track of how many connections a node in a pool has
21:35 < dumitru> | so if a pool is not closing some requests that take long, and others are idle, Zeus will not throw new connections to that node
21:35 < dumitru> | and will keep the connections count even
21:35 < dumitru> | if this will help, we shall see
Comment 4•13 years ago
|
||
per irc w/callek: so far looks ok since dumitru's change. However, lets keep this open for another day, while we watch for recurring 500 errors during these chemspills.
Comment 5•13 years ago
|
||
Are we good to close?
Comment 6•13 years ago
|
||
I'm slightly worried about using a least-connection algorithm for load balancing. In my experience it's easy for this to turn into a stampeding problem (flood the server with least connections, making it very slow, then flood another, then another).
I don't have a better suggestion offhand, but it's something to be aware of. It's one reason why RR and WRR are so widely used... the Law of Large Numbers indicates that all the servers will trend towards even usage... that is not the case with LC or WLC. Perhaps a better long-term solution is just to add nodes to the pool, in order to reduce the likeliness of problems.
| Assignee | ||
Comment 7•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #5)
> Are we good to close?
Yes, 500s were not reported since Saturday night.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•