Closed Bug 687633 Opened 13 years ago Closed 13 years ago

issues with varnish cause high load on dm-vcview* and cause hg clone commands to fail

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Unassigned)

References

Details

hg clone http://hg.mozilla.org/build/tools tools
 in dir /home/cltbld/talos-slave/test/. (timeout 1320 secs)
 watching logfiles {}
 argv: ['hg', 'clone', 'http://hg.mozilla.org/build/tools', 'tools']
 environment:
  CVS_RSH=ssh
  DISPLAY=:0.0
  G_BROKEN_FILENAMES=1
  HISTCONTROL=ignoreboth
  HISTSIZE=1000
  HOME=/home/cltbld
  HOSTNAME=talos-r3-fed64-057
  LANG=en_US.UTF-8
  LESSOPEN=|/usr/bin/lesspipe.sh %s
  LOGNAME=cltbld
  MAIL=/var/spool/mail/cltbld
  PATH=/home/cltbld/bin:/tools/buildbot-0.8.0/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin
  PWD=/home/cltbld/talos-slave/test
  QTDIR=/usr/lib64/qt-3.3
  QTINC=/usr/lib64/qt-3.3/include
  QTLIB=/usr/lib64/qt-3.3/lib
  SHELL=/bin/bash
  SHLVL=1
  SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
  TERM=xterm
  USER=cltbld
  _=/home/cltbld/bin/python
 using PTY: False
abort: HTTP Error 502: Bad Gateway
program finished with exit code 255
This is hitting most of the builds, causing them to retry.
Assignee: server-ops-releng → server-ops
Severity: normal → critical
Component: Server Operations: RelEng → Server Operations
QA Contact: zandr → cshields
This seems to have fixed itself. Around this time there were some alerts in #sysadmins for dm-vcview02, which appear to be load issues. Those have resolved, and running curl in a loop against that URL results in "200 Script output follows" very reliably.

As for what caused the load spike, I'm not entirely sure.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
This is still going on.  Jakem and I are looking at it.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 686888
Is this still happening? 

[shyam@boris ~]$ hg clone http://hg.mozilla.org/build/tools tools
requesting all changes
adding changesets
adding manifests
adding file changes
added 1807 changesets with 3853 changes to 823 files
updating to branch default
367 files updated, 0 files merged, 0 files removed, 0 files unresolved

Seems to be working fine.
Assignee: server-ops → shyam
This looks a lot better.
If you go back in time in https://tbpl.mozilla.org/ or https://tbpl.mozilla.org/?tree=Mozilla-Inbound you can see what was happening.
I'm going to guess this was completely load-based and load has dropped off.
Removing 3.6.23 blocking + lowering priority.
I wish we had a better answer than "let's hope it doesn't happen again" but not sure what we can do right now.
No longer blocks: 686888
Severity: critical → normal
Assignee: shyam → server-ops
(In reply to Aki Sasaki [:aki] from comment #6)
> Removing 3.6.23 blocking + lowering priority.
> I wish we had a better answer than "let's hope it doesn't happen again" but
> not sure what we can do right now.

Varnish is the culprit here.  Looks like either an admin restarted varnish or it cycled on its own, but from the graphs (and previous issues that have caused errors) it seems that varnish choked.  It is fixed now, however.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Do we have a way to know when varnish goes into such state?
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #8)
> Do we have a way to know when varnish goes into such state?

Yes, I mentioned seeing it in graphs (ganglia).

Proactively?  No, not today.
I wasn't meaning proactively. What I am trying to figure out if there could be a nagios check (for the next time) that would help us look into what the state of varnish is? If that can't be done that's OK.
right, the nagios check is a proactive one..  

We can probably hack something together..  but all I am going by is a metric here, one that might not always be indicative of a fault.
I was the one that restarted varnish, and we saw a re-occurrence of the problem *after* the restart. Granted, this could have been due to varnish having an empty cache, but I'm skeptical of that. The server load spikes were not really dampened at all as compared to before the restart, even though the cache hit rate on varnish was over 50% by the time the problem hit.

Next time we need to check the output of 'varnishstat' before restarting it- this should give us some insight on varnish's behavior.

The problem is, we don't have a good way of defining what the failure state is in terms that we can check for. All we know currently is that server load on the backends starts to get bad. It could be an issue with varnish not effectively making room for new requests in the cache and the hit rate going to hell... we'll just have to watch it more closely next time and see if we can figure out specifically what is going wrong with varnish.
I'm re-opening this since this is still an open question, although there do not appear to be failures at the moment, so I'm leaving the priority at "normal".  Moving to server ops releng as a holding pen for the moment.

Ideally, we can figure out both what happened and how to detect this kind of failure in the future.
Assignee: server-ops → server-ops-releng
Status: RESOLVED → REOPENED
Component: Server Operations → Server Operations: RelEng
QA Contact: cshields → zandr
Resolution: FIXED → ---
Summary: hg clones failing with 502 Bad Gateway → issues with varnish cause high load on dm-vcview* and cause hg clone commands to fail
Haven't seen this again, and given that there's not a lot of certainty about what went wrong, I don't think monitoring for a repeat is practical.  I'll close this for now, but if we see this again it's still in bugzilla's history for linking against.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Something similar happened in bug 693202.  Not clear if it's the same problem.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.