Closed Bug 862174 Opened 11 years ago Closed 11 years ago

Lots of Command Queue nagios alerts from new AWS masters

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

Attachments

(2 files)

Typically:
Mon 18:44:17 PDT [487] buildbot-master52.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 21 new items

which means they are processing statusdb, log upload, and/or pulse slowly.
I verified they're running the same nagios command as older masters, tools/buildbot-helpers/check_queuedir.py, which has defaults built in. bm51 and bm17 have different versions of that file but they only differ by some PEP8 changes - http://hg.mozilla.org/build/tools/rev/47b77f5fcd44.

Is it expected that /builds/buildbot/queue/tools isn't getting updated ?
(In reply to Nick Thomas [:nthomas] from comment #1)
> Is it expected that /builds/buildbot/queue/tools isn't getting updated ?

Kind of expected. We explicitly update it: http://hg.mozilla.org/build/tools/file/676b39366098/lib/python/util/fabric/actions.py#l205
Quieted nagios with https://hg.mozilla.org/build/tools/rev/70f9ebef61dc and am about to deploy this.

r=aki over IRC
Hrm this is actually interested -- notice dates:

[cltbld@buildbot-master77 ~]$ ls -l /dev/shm/queue/commands/cur/
total 4
-rw------- 1 cltbld cltbld 437 Apr 25 16:18 1365642642-12531-2161wsPJyn
[cltbld@buildbot-master77 ~]$ ls -l /dev/shm/queue/commands/new/
total 420
-rw------- 1 cltbld cltbld 441 Apr 25 16:03 1365642642-12532-2161mRWBS4
-rw------- 1 cltbld cltbld 441 Apr 25 16:03 1365642642-12533-2161RrSq3R
-rw------- 1 cltbld cltbld 455 Apr 25 16:04 1365642642-12534-2161OAuAbK
-rw------- 1 cltbld cltbld 458 Apr 25 16:04 1365642642-12535-2161b6ys9a
-rw------- 1 cltbld cltbld 445 Apr 25 16:07 1365642642-12536-2161qfI2xr
-rw------- 1 cltbld cltbld 459 Apr 25 16:09 1365642642-12537-2161FfPnBr
-rw------- 1 cltbld cltbld 457 Apr 25 16:10 1365642642-12538-2161YRDMBd
-rw------- 1 cltbld cltbld 457 Apr 25 16:12 1365642642-12539-2161AxqKJJ
-rw------- 1 cltbld cltbld 447 Apr 25 16:12 1365642642-12540-216194UXzf
-rw------- 1 cltbld cltbld 445 Apr 25 16:13 1365642642-12541-2161hZv8a1
-rw------- 1 cltbld cltbld 459 Apr 25 16:13 1365642642-12542-21618udYw1
-rw------- 1 cltbld cltbld 455 Apr 25 16:14 1365642642-12543-2161B9SBbu
-rw------- 1 cltbld cltbld 445 Apr 25 16:15 1365642642-12544-2161UU4zJo
-rw------- 1 cltbld cltbld 447 Apr 25 16:16 1365642642-12545-2161gh6UdN
-rw------- 1 cltbld cltbld 459 Apr 25 16:16 1365642642-12546-2161xv0roI
-rw------- 1 cltbld cltbld 442 Apr 25 16:17 1365642642-12547-2161tklFv8
-rw------- 1 cltbld cltbld 454 Apr 25 16:18 1365642642-12548-2161gdLGhz
-rw------- 1 cltbld cltbld 657 Apr 25 16:02 1366930958-0-29624uAKyPD
-rw------- 1 cltbld cltbld 490 Apr 25 16:02 1366930958-1-2962496OV_9
-rw------- 1 cltbld cltbld 677 Apr 25 16:03 1366930973-0-29639DciNRb
-rw------- 1 cltbld cltbld 629 Apr 25 16:03 1366930986-0-29646eN0zb6
-rw------- 1 cltbld cltbld 462 Apr 25 16:03 1366930986-1-29646PjzT6A
-rw------- 1 cltbld cltbld 658 Apr 25 16:03 1366931001-0-29659gk2m2e
-rw------- 1 cltbld cltbld 643 Apr 25 16:03 1366931022-0-29671Eh2Suy
-rw------- 1 cltbld cltbld 476 Apr 25 16:03 1366931022-1-29671v5KpVq
-rw------- 1 cltbld cltbld 678 Apr 25 16:04 1366931035-0-296864_NXgd
-rw------- 1 cltbld cltbld 627 Apr 25 16:04 1366931048-0-29698naFhhc
-rw------- 1 cltbld cltbld 460 Apr 25 16:04 1366931048-1-296980_UNdj
-rw------- 1 cltbld cltbld 646 Apr 25 16:04 1366931061-0-29713tQywSG
-rw------- 1 cltbld cltbld 635 Apr 25 16:04 1366931075-0-29730eJxQqq
-rw------- 1 cltbld cltbld 468 Apr 25 16:04 1366931075-1-297306lmjJX
-rw------- 1 cltbld cltbld 668 Apr 25 16:04 1366931087-0-297431Jxri6
-rw------- 1 cltbld cltbld 671 Apr 25 16:05 1366931103-0-29759QYAbJ4
-rw------- 1 cltbld cltbld 504 Apr 25 16:05 1366931103-1-297591MTfaU
-rw------- 1 cltbld cltbld 703 Apr 25 16:05 1366931117-0-29772lL2w4g
-rw------- 1 cltbld cltbld 629 Apr 25 16:05 1366931135-0-29783Bcl_Go
-rw------- 1 cltbld cltbld 462 Apr 25 16:05 1366931135-1-29783mJN169
-rw------- 1 cltbld cltbld 627 Apr 25 16:05 1366931147-0-29796dfxsYk
-rw------- 1 cltbld cltbld 460 Apr 25 16:05 1366931147-1-2979690ltuH
-rw------- 1 cltbld cltbld 668 Apr 25 16:06 1366931161-0-29811vTtiBu
-rw------- 1 cltbld cltbld 631 Apr 25 16:06 1366931176-0-29814wNlEVZ
-rw------- 1 cltbld cltbld 464 Apr 25 16:06 1366931176-1-29814T398ch
-rw------- 1 cltbld cltbld 635 Apr 25 16:06 1366931188-0-29827BKSpdm
-rw------- 1 cltbld cltbld 468 Apr 25 16:06 1366931188-1-298277y51Y3
-rw------- 1 cltbld cltbld 668 Apr 25 16:06 1366931202-0-29840Asn1yF
-rw------- 1 cltbld cltbld 631 Apr 25 16:07 1366931218-0-29856jdqWRF
-rw------- 1 cltbld cltbld 464 Apr 25 16:07 1366931218-1-29856KCBWvY
-rw------- 1 cltbld cltbld 668 Apr 25 16:07 1366931234-0-29871JLBcd3
-rw------- 1 cltbld cltbld 648 Apr 25 16:07 1366931247-0-29874zslCHI
-rw------- 1 cltbld cltbld 481 Apr 25 16:07 1366931247-1-29874S9yEu0
-rw------- 1 cltbld cltbld 699 Apr 25 16:07 1366931260-0-29887St51qv
-rw------- 1 cltbld cltbld 619 Apr 25 16:08 1366931281-0-29897wiVIUO
-rw------- 1 cltbld cltbld 452 Apr 25 16:08 1366931281-1-29897imkQaH
-rw------- 1 cltbld cltbld 695 Apr 25 16:08 1366931293-0-29910TAhMMc
-rw------- 1 cltbld cltbld 669 Apr 25 16:08 1366931316-0-299208d1zCl
-rw------- 1 cltbld cltbld 637 Apr 25 16:09 1366931330-0-29925guk0Cc
-rw------- 1 cltbld cltbld 470 Apr 25 16:09 1366931330-1-29925D6PkRy
-rw------- 1 cltbld cltbld 626 Apr 25 16:09 1366931345-0-29938exlRSv
-rw------- 1 cltbld cltbld 459 Apr 25 16:09 1366931345-1-29938tz0fP8
-rw------- 1 cltbld cltbld 631 Apr 25 16:09 1366931361-0-29963iUipsc
-rw------- 1 cltbld cltbld 464 Apr 25 16:09 1366931361-1-29963mGBpJd
-rw------- 1 cltbld cltbld 705 Apr 25 16:09 1366931373-0-29990UisnB4
-rw------- 1 cltbld cltbld 654 Apr 25 16:10 1366931388-0-30015DR26Q5
-rw------- 1 cltbld cltbld 641 Apr 25 16:10 1366931403-0-30022lwKigT
-rw------- 1 cltbld cltbld 474 Apr 25 16:10 1366931403-1-30022wELkcW
-rw------- 1 cltbld cltbld 627 Apr 25 16:10 1366931417-0-30035jJRLnF
-rw------- 1 cltbld cltbld 460 Apr 25 16:10 1366931417-1-300355U0Hzz
-rw------- 1 cltbld cltbld 660 Apr 25 16:10 1366931436-0-30098OFlnJY
-rw------- 1 cltbld cltbld 694 Apr 25 16:11 1366931450-0-30103pfAt_Z
-rw------- 1 cltbld cltbld 652 Apr 25 16:11 1366931468-0-30121lfO4Wa
-rw------- 1 cltbld cltbld 678 Apr 25 16:11 1366931483-0-301243phPbW
-rw------- 1 cltbld cltbld 652 Apr 25 16:11 1366931498-0-301318Fl8D_
-rw------- 1 cltbld cltbld 621 Apr 25 16:12 1366931511-0-30134eWvWBX
-rw------- 1 cltbld cltbld 454 Apr 25 16:12 1366931511-1-30134pDdffV
-rw------- 1 cltbld cltbld 703 Apr 25 16:12 1366931523-0-30147h0y8oB
-rw------- 1 cltbld cltbld 701 Apr 25 16:12 1366931540-0-301501BJ_Vh
-rw------- 1 cltbld cltbld 630 Apr 25 16:12 1366931564-0-30159Nj9SHg
-rw------- 1 cltbld cltbld 463 Apr 25 16:12 1366931564-1-30159ez2Y5Q
-rw------- 1 cltbld cltbld 648 Apr 25 16:13 1366931576-0-30172Lt9UUD
-rw------- 1 cltbld cltbld 629 Apr 25 16:13 1366931598-0-301831g1Vyp
-rw------- 1 cltbld cltbld 462 Apr 25 16:13 1366931598-1-30183WnEyal
-rw------- 1 cltbld cltbld 705 Apr 25 16:13 1366931616-0-30199ghvL6j
-rw------- 1 cltbld cltbld 656 Apr 25 16:14 1366931631-0-30204Pq396N
-rw------- 1 cltbld cltbld 705 Apr 25 16:14 1366931647-0-30210QHyPcO
-rw------- 1 cltbld cltbld 693 Apr 25 16:14 1366931667-0-30219NrDjZx
-rw------- 1 cltbld cltbld 691 Apr 25 16:15 1366931682-0-302223wQ9Dp
-rw------- 1 cltbld cltbld 709 Apr 25 16:15 1366931706-0-30232AgHCu1
-rw------- 1 cltbld cltbld 638 Apr 25 16:15 1366931723-0-30235_zRG50
-rw------- 1 cltbld cltbld 471 Apr 25 16:15 1366931723-1-30235OoxZdN
-rw------- 1 cltbld cltbld 693 Apr 25 16:15 1366931737-0-302839sSoSg
-rw------- 1 cltbld cltbld 703 Apr 25 16:16 1366931754-0-30289Gm2o0A
-rw------- 1 cltbld cltbld 676 Apr 25 16:16 1366931775-0-30295_A5I3J
-rw------- 1 cltbld cltbld 697 Apr 25 16:16 1366931789-0-30308jtUf70
-rw------- 1 cltbld cltbld 631 Apr 25 16:16 1366931803-0-30311W7sA9t
-rw------- 1 cltbld cltbld 464 Apr 25 16:16 1366931803-1-30311iI87yH
-rw------- 1 cltbld cltbld 695 Apr 25 16:17 1366931818-0-30324RmRDHx
-rw------- 1 cltbld cltbld 623 Apr 25 16:17 1366931836-0-30327gxo2W1
-rw------- 1 cltbld cltbld 456 Apr 25 16:17 1366931836-1-303272IMPA4
-rw------- 1 cltbld cltbld 650 Apr 25 16:17 1366931848-0-30354mduv9V
-rw------- 1 cltbld cltbld 624 Apr 25 16:17 1366931867-0-30362IMo8Ma
-rw------- 1 cltbld cltbld 457 Apr 25 16:17 1366931867-1-30362Z07cdU
-rw------- 1 cltbld cltbld 662 Apr 25 16:18 1366931879-0-303776JQoUB
-rw------- 1 cltbld cltbld 658 Apr 25 16:18 1366931899-0-30418V1ppYT
-rw------- 1 cltbld cltbld 631 Apr 25 16:18 1366931915-0-30424vTdxDp
-rw------- 1 cltbld cltbld 464 Apr 25 16:18 1366931915-1-30424BAAn2T
Callek meant that we might not be processing the oldest jobs first, but I think it's just that the file is opened for writing in cur/ so the timestamp is off.

Some things I noticed:
* concurrency defaults to 1 in command_runner.py, we could try turning this up if we'd not be blocked on table locking in statusdb
* we don't log how long the statusdb updates or pulse messaging take

* on bm77, log_uploader.py takes 10-20 seconds to compress and upload a log to stage, no info on which is the slow part of that
* stage doesn't look like the bottleneck, not many concurrent connections or high load
* contacting stage from bm77 in use1 takes significantly longer: ssh to stage to run hostname takes 2s, vs 0.2s from bm12 in scl1. log uploading makes lots of these connections so this is a bit painful
Temporary for debugging.
Attachment #742153 - Flags: review?(catlee)
Comment on attachment 742153 [details] [diff] [review]
[buildbotcustom] More verbose logging in postrun

r=Callek on irc

https://hg.mozilla.org/build/buildbotcustom/rev/684d4241af5f + production-0.8

Updated masters (no reconfig needed).
Attachment #742153 - Flags: review?(catlee) → review+
Comment on attachment 742151 [details] [diff] [review]
[tools] Log total time for queue job

r=Callek on irc.
https://hg.mozilla.org/build/tools/rev/4580ea535a5d

Updated the checkouts on the masters and restarted the command_runner service.
Attachment #742151 - Flags: review?(catlee)
Attachment #742151 - Flags: review+
Attachment #742151 - Flags: checked-in+
Attachment #742153 - Flags: checked-in+
(In reply to Nick Thomas [:nthomas] from comment #10)
> Updated the checkouts on the masters and restarted the command_runner
> service.

I did this by being root, and going 'service command_runner restart'. Puppet has decided that it doesn't like the running service and does
Apr 25 22:20:33 buildbot-master53 puppet-agent[27342]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]/ensure) ensure changed 'stopped' to 'running'

For extra fun it does that every 30 minutes, so something has confused it. This is on old and AWS masters, which is both puppet systems AIUI.
Fixed the nagios alerts about more than command_runner process. I suspect restart doesn't always leave the pid file in the right state for status to respond correctly, resulting in puppet starting it up again. If you kill a duplicate process nicely then the pid file will be wiped and status will say nothing is running; kill it with -9, or put back the pid file's contents, and you're OK.

Anyway, to get back on track ... I've set up rsync's in screen from /dev/shm/queue/command/logs/ to ~cltbld/bug862174/ on
 buildbot-master13.build.scl1.mozilla.com        (build, as a baseline)
 buildbot-master57.srv.releng.use1.mozilla.com   (build)
 buildbot-master58.srv.releng.usw2.mozilla.com   (build)
 buildbot-master51.srv.releng.use1.mozilla.com   (tests1-linux)
to give us some history to dig into.

StatusDB logging comes from 
 http://hg.mozilla.org/build/buildbotcustom/file/default/bin/postrun.py#l206
bm67 was alerting, so I hopped on and took a look

$ cd /dev/shm/queue/commands/logs
# Log uploading is first; how long does it take?
$ grep -l 'post_upload' *.log | xargs grep Elapsed
1366731323-15948-300183dtwD0.log:Result: 0, Elapsed: 18.0 seconds
1366731323-15949-30018rOcuCD.log:Result: 0, Elapsed: 12.0 seconds
1366731323-15950-30018W5ZLTJ.log:Result: 0, Elapsed: 12.0 seconds
1366731323-15951-30018XORHax.log:Result: 0, Elapsed: 12.0 seconds
1366731323-15952-30018KKEYyb.log:Result: 0, Elapsed: 13.0 seconds
1366731323-15953-30018J453jf.log:Result: 0, Elapsed: 12.0 seconds
1366731323-15954-30018kMJJCl.log:Result: 0, Elapsed: 13.0 seconds

# Updating db is next; how long does it take?
$ grep -l 'adding to statusdb' *.log | xargs grep Elapsed
1367531149-0-16595j7a89X.log:Result: 0, Elapsed: 15.0 seconds
1367531865-0-17624gd4nDR.log:Result: 0, Elapsed: 14.0 seconds
1367531894-0-17645pU2wvP.log:Result: 0, Elapsed: 14.0 seconds
1367531926-0-17665UC9Z9J.log:Result: 0, Elapsed: 17.0 seconds
1367531939-0-176788GmZS4.log:Result: 0, Elapsed: 15.0 seconds
1367531965-0-17694UTCslb.log:Result: 0, Elapsed: 14.0 seconds
1367532010-0-1773198EUpO.log:Result: 0, Elapsed: 15.0 seconds
1367532148-0-17780jcuXJK.log:Result: 0, Elapsed: 15.0 seconds
1367532280-0-17819ugtNCc.log:Result: 0, Elapsed: 18.0 seconds
1367532292-0-17836IlbeTS.log:Result: 0, Elapsed: 22.0 seconds
1367532330-0-17859UKdA_B.log:Result: 0, Elapsed: 14.0 seconds

# Pulse is last; how long does it take?
$ grep -l pulse *.log | xargs grep Elapsed
1367531164-0-16608dh5q7Q.log:Result: 0, Elapsed: 3.0 seconds
1367531951-0-17691Eu8bpS.log:Result: 0, Elapsed: 3.0 seconds
1367531981-0-17710QwOQZy.log:Result: 0, Elapsed: 3.0 seconds
1367531995-0-17717jbU1p8.log:Result: 0, Elapsed: 3.0 seconds
1367532314-0-17856tLKl1u.log:Result: 0, Elapsed: 3.0 seconds
1367532343-0-17880RuydUs.log:Result: 0, Elapsed: 3.0 seconds

So our bottlenecks here are log uploading and updating the db. It looks like the bulk of the log uploading time is spent transferring the data.

From "Updating statusdb" to "searching for build" takes ~5s
http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l207
http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l225

From "creating new build" to "committed" here takes ~11s for updating db
http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l234
http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l241
Depends on: 868241
Committed https://hg.mozilla.org/build/puppet/rev/5f497d58cf3c to run the command queue with -j4.
Depends on: 869749
Haven't seen any of these since the changes landed on Tuesday.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Found and stopped the rsync commands running from comment 12
deleted the backup directories
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: