687633 - issues with varnish cause high load on dm-vcview* and cause hg clone commands to fail

Reporter

Description

•

14 years ago

hg clone http://hg.mozilla.org/build/tools tools in dir /home/cltbld/talos-slave/test/. (timeout 1320 secs) watching logfiles {} argv: ['hg', 'clone', 'http://hg.mozilla.org/build/tools', 'tools'] environment: CVS_RSH=ssh DISPLAY=:0.0 G_BROKEN_FILENAMES=1 HISTCONTROL=ignoreboth HISTSIZE=1000 HOME=/home/cltbld HOSTNAME=talos-r3-fed64-057 LANG=en_US.UTF-8 LESSOPEN=|/usr/bin/lesspipe.sh %s LOGNAME=cltbld MAIL=/var/spool/mail/cltbld PATH=/home/cltbld/bin:/tools/buildbot-0.8.0/bin:/usr/lib64/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib64/ccache:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin PWD=/home/cltbld/talos-slave/test QTDIR=/usr/lib64/qt-3.3 QTINC=/usr/lib64/qt-3.3/include QTLIB=/usr/lib64/qt-3.3/lib SHELL=/bin/bash SHLVL=1 SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass TERM=xterm USER=cltbld _=/home/cltbld/bin/python using PTY: False abort: HTTP Error 502: Bad Gateway program finished with exit code 255

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 1

•

14 years ago

This is hitting most of the builds, causing them to retry.

Assignee: server-ops-releng → server-ops

Severity: normal → critical

Component: Server Operations: RelEng → Server Operations

QA Contact: zandr → cshields

Jake Maul [:jakem]

Comment 2

•

14 years ago

This seems to have fixed itself. Around this time there were some alerts in #sysadmins for dm-vcview02, which appear to be load issues. Those have resolved, and running curl in a loop against that URL results in "200 Script output follows" very reliably. As for what caused the load spike, I'm not entirely sure.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 3

•

14 years ago

This is still going on. Jakem and I are looking at it.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Aki Sasaki (not active)

Updated

•

14 years ago

Blocks: 686888

Shyam Mani [:fox2mike]

Comment 4

•

14 years ago

Is this still happening? [shyam@boris ~]$ hg clone http://hg.mozilla.org/build/tools tools requesting all changes adding changesets adding manifests adding file changes added 1807 changesets with 3853 changes to 823 files updating to branch default 367 files updated, 0 files merged, 0 files removed, 0 files unresolved Seems to be working fine.

Shyam Mani [:fox2mike]

Updated

•

14 years ago

Assignee: server-ops → shyam

Aki Sasaki (not active)

Comment 5

•

14 years ago

This looks a lot better. If you go back in time in https://tbpl.mozilla.org/ or https://tbpl.mozilla.org/?tree=Mozilla-Inbound you can see what was happening. I'm going to guess this was completely load-based and load has dropped off.

Aki Sasaki (not active)

Comment 6

•

14 years ago

Removing 3.6.23 blocking + lowering priority. I wish we had a better answer than "let's hope it doesn't happen again" but not sure what we can do right now.

No longer blocks: 686888

Severity: critical → normal

Shyam Mani [:fox2mike]

Updated

•

14 years ago

Assignee: shyam → server-ops

Corey Shields [:cshields]

Comment 7

•

14 years ago

(In reply to Aki Sasaki [:aki] from comment #6) > Removing 3.6.23 blocking + lowering priority. > I wish we had a better answer than "let's hope it doesn't happen again" but > not sure what we can do right now. Varnish is the culprit here. Looks like either an admin restarted varnish or it cycled on its own, but from the graphs (and previous issues that have caused errors) it seems that varnish choked. It is fixed now, however.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 8

•

14 years ago

Do we have a way to know when varnish goes into such state?

Corey Shields [:cshields]

Comment 9

•

14 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #8) > Do we have a way to know when varnish goes into such state? Yes, I mentioned seeing it in graphs (ganglia). Proactively? No, not today.

Armen [:armenzg]

Comment 10

•

14 years ago

I wasn't meaning proactively. What I am trying to figure out if there could be a nagios check (for the next time) that would help us look into what the state of varnish is? If that can't be done that's OK.

Corey Shields [:cshields]

Comment 11

•

14 years ago

right, the nagios check is a proactive one.. We can probably hack something together.. but all I am going by is a metric here, one that might not always be indicative of a fault.

Jake Maul [:jakem]

Comment 13

•

14 years ago

I was the one that restarted varnish, and we saw a re-occurrence of the problem *after* the restart. Granted, this could have been due to varnish having an empty cache, but I'm skeptical of that. The server load spikes were not really dampened at all as compared to before the restart, even though the cache hit rate on varnish was over 50% by the time the problem hit. Next time we need to check the output of 'varnishstat' before restarting it- this should give us some insight on varnish's behavior. The problem is, we don't have a good way of defining what the failure state is in terms that we can check for. All we know currently is that server load on the backends starts to get bad. It could be an issue with varnish not effectively making room for new requests in the cache and the hit rate going to hell... we'll just have to watch it more closely next time and see if we can figure out specifically what is going wrong with varnish.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 14

•

14 years ago

I'm re-opening this since this is still an open question, although there do not appear to be failures at the moment, so I'm leaving the priority at "normal". Moving to server ops releng as a holding pen for the moment. Ideally, we can figure out both what happened and how to detect this kind of failure in the future.

Assignee: server-ops → server-ops-releng

Status: RESOLVED → REOPENED

Component: Server Operations → Server Operations: RelEng

QA Contact: cshields → zandr

Resolution: FIXED → ---

Amy Rich [:arr] [:arich]

Updated

•

14 years ago

Summary: hg clones failing with 502 Bad Gateway → issues with varnish cause high load on dm-vcview* and cause hg clone commands to fail

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 15

•

14 years ago

Haven't seen this again, and given that there's not a lot of certainty about what went wrong, I don't think monitoring for a repeat is practical. I'll close this for now, but if we see this again it's still in bugzilla's history for linking against.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 16

•

14 years ago

Something similar happened in bug 693202. Not clear if it's the same problem.

Nobody; OK to take it and work on it

Updated

•

12 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

issues with varnish cause high load on dm-vcview* and cause hg clone commands to fail

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 13

Comment 14

Updated

Comment 15

Comment 16

Updated