Closed
Bug 687633
Opened 13 years ago
Closed 13 years ago
issues with varnish cause high load on dm-vcview* and cause hg clone commands to fail
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Unassigned)
References
Details
hg clone http://hg.mozilla.org/build/tools tools in dir /home/cltbld/talos-slave/test/. (timeout 1320 secs) watching logfiles {} argv: ['hg', 'clone', 'http://hg.mozilla.org/build/tools', 'tools'] environment: CVS_RSH=ssh DISPLAY=:0.0 G_BROKEN_FILENAMES=1 HISTCONTROL=ignoreboth HISTSIZE=1000 HOME=/home/cltbld HOSTNAME=talos-r3-fed64-057 LANG=en_US.UTF-8 LESSOPEN=|/usr/bin/lesspipe.sh %s LOGNAME=cltbld MAIL=/var/spool/mail/cltbld PATH=/home/cltbld/bin:/tools/buildbot-0.8.0/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin PWD=/home/cltbld/talos-slave/test QTDIR=/usr/lib64/qt-3.3 QTINC=/usr/lib64/qt-3.3/include QTLIB=/usr/lib64/qt-3.3/lib SHELL=/bin/bash SHLVL=1 SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TERM=xterm USER=cltbld _=/home/cltbld/bin/python using PTY: False abort: HTTP Error 502: Bad Gateway program finished with exit code 255
Reporter | ||
Comment 1•13 years ago
|
||
This is hitting most of the builds, causing them to retry.
Assignee: server-ops-releng → server-ops
Severity: normal → critical
Component: Server Operations: RelEng → Server Operations
QA Contact: zandr → cshields
Comment 2•13 years ago
|
||
This seems to have fixed itself. Around this time there were some alerts in #sysadmins for dm-vcview02, which appear to be load issues. Those have resolved, and running curl in a loop against that URL results in "200 Script output follows" very reliably. As for what caused the load spike, I'm not entirely sure.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 3•13 years ago
|
||
This is still going on. Jakem and I are looking at it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 4•13 years ago
|
||
Is this still happening? [shyam@boris ~]$ hg clone http://hg.mozilla.org/build/tools tools requesting all changes adding changesets adding manifests adding file changes added 1807 changesets with 3853 changes to 823 files updating to branch default 367 files updated, 0 files merged, 0 files removed, 0 files unresolved Seems to be working fine.
Updated•13 years ago
|
Assignee: server-ops → shyam
Comment 5•13 years ago
|
||
This looks a lot better. If you go back in time in https://tbpl.mozilla.org/ or https://tbpl.mozilla.org/?tree=Mozilla-Inbound you can see what was happening. I'm going to guess this was completely load-based and load has dropped off.
Comment 6•13 years ago
|
||
Removing 3.6.23 blocking + lowering priority. I wish we had a better answer than "let's hope it doesn't happen again" but not sure what we can do right now.
No longer blocks: 686888
Severity: critical → normal
Updated•13 years ago
|
Assignee: shyam → server-ops
Comment 7•13 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #6) > Removing 3.6.23 blocking + lowering priority. > I wish we had a better answer than "let's hope it doesn't happen again" but > not sure what we can do right now. Varnish is the culprit here. Looks like either an admin restarted varnish or it cycled on its own, but from the graphs (and previous issues that have caused errors) it seems that varnish choked. It is fixed now, however.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Comment 8•13 years ago
|
||
Do we have a way to know when varnish goes into such state?
Comment 9•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #8) > Do we have a way to know when varnish goes into such state? Yes, I mentioned seeing it in graphs (ganglia). Proactively? No, not today.
Comment 10•13 years ago
|
||
I wasn't meaning proactively. What I am trying to figure out if there could be a nagios check (for the next time) that would help us look into what the state of varnish is? If that can't be done that's OK.
Comment 11•13 years ago
|
||
right, the nagios check is a proactive one.. We can probably hack something together.. but all I am going by is a metric here, one that might not always be indicative of a fault.
Comment 13•13 years ago
|
||
I was the one that restarted varnish, and we saw a re-occurrence of the problem *after* the restart. Granted, this could have been due to varnish having an empty cache, but I'm skeptical of that. The server load spikes were not really dampened at all as compared to before the restart, even though the cache hit rate on varnish was over 50% by the time the problem hit. Next time we need to check the output of 'varnishstat' before restarting it- this should give us some insight on varnish's behavior. The problem is, we don't have a good way of defining what the failure state is in terms that we can check for. All we know currently is that server load on the backends starts to get bad. It could be an issue with varnish not effectively making room for new requests in the cache and the hit rate going to hell... we'll just have to watch it more closely next time and see if we can figure out specifically what is going wrong with varnish.
Reporter | ||
Comment 14•13 years ago
|
||
I'm re-opening this since this is still an open question, although there do not appear to be failures at the moment, so I'm leaving the priority at "normal". Moving to server ops releng as a holding pen for the moment. Ideally, we can figure out both what happened and how to detect this kind of failure in the future.
Assignee: server-ops → server-ops-releng
Status: RESOLVED → REOPENED
Component: Server Operations → Server Operations: RelEng
QA Contact: cshields → zandr
Resolution: FIXED → ---
Updated•13 years ago
|
Summary: hg clones failing with 502 Bad Gateway → issues with varnish cause high load on dm-vcview* and cause hg clone commands to fail
Reporter | ||
Comment 15•13 years ago
|
||
Haven't seen this again, and given that there's not a lot of certainty about what went wrong, I don't think monitoring for a repeat is practical. I'll close this for now, but if we see this again it's still in bugzilla's history for linking against.
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 16•13 years ago
|
||
Something similar happened in bug 693202. Not clear if it's the same problem.
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•