Closed
Bug 862174
Opened 11 years ago
Closed 11 years ago
Lots of Command Queue nagios alerts from new AWS masters
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Unassigned)
References
Details
Attachments
(2 files)
877 bytes,
patch
|
nthomas
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
676 bytes,
patch
|
nthomas
:
review+
nthomas
:
checked-in+
|
Details | Diff | Splinter Review |
Typically: Mon 18:44:17 PDT [487] buildbot-master52.srv.releng.use1.mozilla.com:Command Queue is CRITICAL: 21 new items which means they are processing statusdb, log upload, and/or pulse slowly.
Reporter | ||
Comment 1•11 years ago
|
||
I verified they're running the same nagios command as older masters, tools/buildbot-helpers/check_queuedir.py, which has defaults built in. bm51 and bm17 have different versions of that file but they only differ by some PEP8 changes - http://hg.mozilla.org/build/tools/rev/47b77f5fcd44. Is it expected that /builds/buildbot/queue/tools isn't getting updated ?
Comment 2•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #1) > Is it expected that /builds/buildbot/queue/tools isn't getting updated ? Kind of expected. We explicitly update it: http://hg.mozilla.org/build/tools/file/676b39366098/lib/python/util/fabric/actions.py#l205
Comment 4•11 years ago
|
||
Quieted nagios with https://hg.mozilla.org/build/tools/rev/70f9ebef61dc and am about to deploy this. r=aki over IRC
Comment 5•11 years ago
|
||
Hrm this is actually interested -- notice dates: [cltbld@buildbot-master77 ~]$ ls -l /dev/shm/queue/commands/cur/ total 4 -rw------- 1 cltbld cltbld 437 Apr 25 16:18 1365642642-12531-2161wsPJyn [cltbld@buildbot-master77 ~]$ ls -l /dev/shm/queue/commands/new/ total 420 -rw------- 1 cltbld cltbld 441 Apr 25 16:03 1365642642-12532-2161mRWBS4 -rw------- 1 cltbld cltbld 441 Apr 25 16:03 1365642642-12533-2161RrSq3R -rw------- 1 cltbld cltbld 455 Apr 25 16:04 1365642642-12534-2161OAuAbK -rw------- 1 cltbld cltbld 458 Apr 25 16:04 1365642642-12535-2161b6ys9a -rw------- 1 cltbld cltbld 445 Apr 25 16:07 1365642642-12536-2161qfI2xr -rw------- 1 cltbld cltbld 459 Apr 25 16:09 1365642642-12537-2161FfPnBr -rw------- 1 cltbld cltbld 457 Apr 25 16:10 1365642642-12538-2161YRDMBd -rw------- 1 cltbld cltbld 457 Apr 25 16:12 1365642642-12539-2161AxqKJJ -rw------- 1 cltbld cltbld 447 Apr 25 16:12 1365642642-12540-216194UXzf -rw------- 1 cltbld cltbld 445 Apr 25 16:13 1365642642-12541-2161hZv8a1 -rw------- 1 cltbld cltbld 459 Apr 25 16:13 1365642642-12542-21618udYw1 -rw------- 1 cltbld cltbld 455 Apr 25 16:14 1365642642-12543-2161B9SBbu -rw------- 1 cltbld cltbld 445 Apr 25 16:15 1365642642-12544-2161UU4zJo -rw------- 1 cltbld cltbld 447 Apr 25 16:16 1365642642-12545-2161gh6UdN -rw------- 1 cltbld cltbld 459 Apr 25 16:16 1365642642-12546-2161xv0roI -rw------- 1 cltbld cltbld 442 Apr 25 16:17 1365642642-12547-2161tklFv8 -rw------- 1 cltbld cltbld 454 Apr 25 16:18 1365642642-12548-2161gdLGhz -rw------- 1 cltbld cltbld 657 Apr 25 16:02 1366930958-0-29624uAKyPD -rw------- 1 cltbld cltbld 490 Apr 25 16:02 1366930958-1-2962496OV_9 -rw------- 1 cltbld cltbld 677 Apr 25 16:03 1366930973-0-29639DciNRb -rw------- 1 cltbld cltbld 629 Apr 25 16:03 1366930986-0-29646eN0zb6 -rw------- 1 cltbld cltbld 462 Apr 25 16:03 1366930986-1-29646PjzT6A -rw------- 1 cltbld cltbld 658 Apr 25 16:03 1366931001-0-29659gk2m2e -rw------- 1 cltbld cltbld 643 Apr 25 16:03 1366931022-0-29671Eh2Suy -rw------- 1 cltbld cltbld 476 Apr 25 16:03 1366931022-1-29671v5KpVq -rw------- 1 cltbld cltbld 678 Apr 25 16:04 1366931035-0-296864_NXgd -rw------- 1 cltbld cltbld 627 Apr 25 16:04 1366931048-0-29698naFhhc -rw------- 1 cltbld cltbld 460 Apr 25 16:04 1366931048-1-296980_UNdj -rw------- 1 cltbld cltbld 646 Apr 25 16:04 1366931061-0-29713tQywSG -rw------- 1 cltbld cltbld 635 Apr 25 16:04 1366931075-0-29730eJxQqq -rw------- 1 cltbld cltbld 468 Apr 25 16:04 1366931075-1-297306lmjJX -rw------- 1 cltbld cltbld 668 Apr 25 16:04 1366931087-0-297431Jxri6 -rw------- 1 cltbld cltbld 671 Apr 25 16:05 1366931103-0-29759QYAbJ4 -rw------- 1 cltbld cltbld 504 Apr 25 16:05 1366931103-1-297591MTfaU -rw------- 1 cltbld cltbld 703 Apr 25 16:05 1366931117-0-29772lL2w4g -rw------- 1 cltbld cltbld 629 Apr 25 16:05 1366931135-0-29783Bcl_Go -rw------- 1 cltbld cltbld 462 Apr 25 16:05 1366931135-1-29783mJN169 -rw------- 1 cltbld cltbld 627 Apr 25 16:05 1366931147-0-29796dfxsYk -rw------- 1 cltbld cltbld 460 Apr 25 16:05 1366931147-1-2979690ltuH -rw------- 1 cltbld cltbld 668 Apr 25 16:06 1366931161-0-29811vTtiBu -rw------- 1 cltbld cltbld 631 Apr 25 16:06 1366931176-0-29814wNlEVZ -rw------- 1 cltbld cltbld 464 Apr 25 16:06 1366931176-1-29814T398ch -rw------- 1 cltbld cltbld 635 Apr 25 16:06 1366931188-0-29827BKSpdm -rw------- 1 cltbld cltbld 468 Apr 25 16:06 1366931188-1-298277y51Y3 -rw------- 1 cltbld cltbld 668 Apr 25 16:06 1366931202-0-29840Asn1yF -rw------- 1 cltbld cltbld 631 Apr 25 16:07 1366931218-0-29856jdqWRF -rw------- 1 cltbld cltbld 464 Apr 25 16:07 1366931218-1-29856KCBWvY -rw------- 1 cltbld cltbld 668 Apr 25 16:07 1366931234-0-29871JLBcd3 -rw------- 1 cltbld cltbld 648 Apr 25 16:07 1366931247-0-29874zslCHI -rw------- 1 cltbld cltbld 481 Apr 25 16:07 1366931247-1-29874S9yEu0 -rw------- 1 cltbld cltbld 699 Apr 25 16:07 1366931260-0-29887St51qv -rw------- 1 cltbld cltbld 619 Apr 25 16:08 1366931281-0-29897wiVIUO -rw------- 1 cltbld cltbld 452 Apr 25 16:08 1366931281-1-29897imkQaH -rw------- 1 cltbld cltbld 695 Apr 25 16:08 1366931293-0-29910TAhMMc -rw------- 1 cltbld cltbld 669 Apr 25 16:08 1366931316-0-299208d1zCl -rw------- 1 cltbld cltbld 637 Apr 25 16:09 1366931330-0-29925guk0Cc -rw------- 1 cltbld cltbld 470 Apr 25 16:09 1366931330-1-29925D6PkRy -rw------- 1 cltbld cltbld 626 Apr 25 16:09 1366931345-0-29938exlRSv -rw------- 1 cltbld cltbld 459 Apr 25 16:09 1366931345-1-29938tz0fP8 -rw------- 1 cltbld cltbld 631 Apr 25 16:09 1366931361-0-29963iUipsc -rw------- 1 cltbld cltbld 464 Apr 25 16:09 1366931361-1-29963mGBpJd -rw------- 1 cltbld cltbld 705 Apr 25 16:09 1366931373-0-29990UisnB4 -rw------- 1 cltbld cltbld 654 Apr 25 16:10 1366931388-0-30015DR26Q5 -rw------- 1 cltbld cltbld 641 Apr 25 16:10 1366931403-0-30022lwKigT -rw------- 1 cltbld cltbld 474 Apr 25 16:10 1366931403-1-30022wELkcW -rw------- 1 cltbld cltbld 627 Apr 25 16:10 1366931417-0-30035jJRLnF -rw------- 1 cltbld cltbld 460 Apr 25 16:10 1366931417-1-300355U0Hzz -rw------- 1 cltbld cltbld 660 Apr 25 16:10 1366931436-0-30098OFlnJY -rw------- 1 cltbld cltbld 694 Apr 25 16:11 1366931450-0-30103pfAt_Z -rw------- 1 cltbld cltbld 652 Apr 25 16:11 1366931468-0-30121lfO4Wa -rw------- 1 cltbld cltbld 678 Apr 25 16:11 1366931483-0-301243phPbW -rw------- 1 cltbld cltbld 652 Apr 25 16:11 1366931498-0-301318Fl8D_ -rw------- 1 cltbld cltbld 621 Apr 25 16:12 1366931511-0-30134eWvWBX -rw------- 1 cltbld cltbld 454 Apr 25 16:12 1366931511-1-30134pDdffV -rw------- 1 cltbld cltbld 703 Apr 25 16:12 1366931523-0-30147h0y8oB -rw------- 1 cltbld cltbld 701 Apr 25 16:12 1366931540-0-301501BJ_Vh -rw------- 1 cltbld cltbld 630 Apr 25 16:12 1366931564-0-30159Nj9SHg -rw------- 1 cltbld cltbld 463 Apr 25 16:12 1366931564-1-30159ez2Y5Q -rw------- 1 cltbld cltbld 648 Apr 25 16:13 1366931576-0-30172Lt9UUD -rw------- 1 cltbld cltbld 629 Apr 25 16:13 1366931598-0-301831g1Vyp -rw------- 1 cltbld cltbld 462 Apr 25 16:13 1366931598-1-30183WnEyal -rw------- 1 cltbld cltbld 705 Apr 25 16:13 1366931616-0-30199ghvL6j -rw------- 1 cltbld cltbld 656 Apr 25 16:14 1366931631-0-30204Pq396N -rw------- 1 cltbld cltbld 705 Apr 25 16:14 1366931647-0-30210QHyPcO -rw------- 1 cltbld cltbld 693 Apr 25 16:14 1366931667-0-30219NrDjZx -rw------- 1 cltbld cltbld 691 Apr 25 16:15 1366931682-0-302223wQ9Dp -rw------- 1 cltbld cltbld 709 Apr 25 16:15 1366931706-0-30232AgHCu1 -rw------- 1 cltbld cltbld 638 Apr 25 16:15 1366931723-0-30235_zRG50 -rw------- 1 cltbld cltbld 471 Apr 25 16:15 1366931723-1-30235OoxZdN -rw------- 1 cltbld cltbld 693 Apr 25 16:15 1366931737-0-302839sSoSg -rw------- 1 cltbld cltbld 703 Apr 25 16:16 1366931754-0-30289Gm2o0A -rw------- 1 cltbld cltbld 676 Apr 25 16:16 1366931775-0-30295_A5I3J -rw------- 1 cltbld cltbld 697 Apr 25 16:16 1366931789-0-30308jtUf70 -rw------- 1 cltbld cltbld 631 Apr 25 16:16 1366931803-0-30311W7sA9t -rw------- 1 cltbld cltbld 464 Apr 25 16:16 1366931803-1-30311iI87yH -rw------- 1 cltbld cltbld 695 Apr 25 16:17 1366931818-0-30324RmRDHx -rw------- 1 cltbld cltbld 623 Apr 25 16:17 1366931836-0-30327gxo2W1 -rw------- 1 cltbld cltbld 456 Apr 25 16:17 1366931836-1-303272IMPA4 -rw------- 1 cltbld cltbld 650 Apr 25 16:17 1366931848-0-30354mduv9V -rw------- 1 cltbld cltbld 624 Apr 25 16:17 1366931867-0-30362IMo8Ma -rw------- 1 cltbld cltbld 457 Apr 25 16:17 1366931867-1-30362Z07cdU -rw------- 1 cltbld cltbld 662 Apr 25 16:18 1366931879-0-303776JQoUB -rw------- 1 cltbld cltbld 658 Apr 25 16:18 1366931899-0-30418V1ppYT -rw------- 1 cltbld cltbld 631 Apr 25 16:18 1366931915-0-30424vTdxDp -rw------- 1 cltbld cltbld 464 Apr 25 16:18 1366931915-1-30424BAAn2T
Reporter | ||
Comment 6•11 years ago
|
||
Callek meant that we might not be processing the oldest jobs first, but I think it's just that the file is opened for writing in cur/ so the timestamp is off. Some things I noticed: * concurrency defaults to 1 in command_runner.py, we could try turning this up if we'd not be blocked on table locking in statusdb * we don't log how long the statusdb updates or pulse messaging take * on bm77, log_uploader.py takes 10-20 seconds to compress and upload a log to stage, no info on which is the slow part of that * stage doesn't look like the bottleneck, not many concurrent connections or high load * contacting stage from bm77 in use1 takes significantly longer: ssh to stage to run hostname takes 2s, vs 0.2s from bm12 in scl1. log uploading makes lots of these connections so this is a bit painful
Reporter | ||
Comment 7•11 years ago
|
||
Attachment #742151 -
Flags: review?(catlee)
Reporter | ||
Comment 8•11 years ago
|
||
Temporary for debugging.
Attachment #742153 -
Flags: review?(catlee)
Reporter | ||
Comment 9•11 years ago
|
||
Comment on attachment 742153 [details] [diff] [review] [buildbotcustom] More verbose logging in postrun r=Callek on irc https://hg.mozilla.org/build/buildbotcustom/rev/684d4241af5f + production-0.8 Updated masters (no reconfig needed).
Attachment #742153 -
Flags: review?(catlee) → review+
Reporter | ||
Comment 10•11 years ago
|
||
Comment on attachment 742151 [details] [diff] [review] [tools] Log total time for queue job r=Callek on irc. https://hg.mozilla.org/build/tools/rev/4580ea535a5d Updated the checkouts on the masters and restarted the command_runner service.
Attachment #742151 -
Flags: review?(catlee)
Attachment #742151 -
Flags: review+
Attachment #742151 -
Flags: checked-in+
Reporter | ||
Updated•11 years ago
|
Attachment #742153 -
Flags: checked-in+
Reporter | ||
Comment 11•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #10) > Updated the checkouts on the masters and restarted the command_runner > service. I did this by being root, and going 'service command_runner restart'. Puppet has decided that it doesn't like the running service and does Apr 25 22:20:33 buildbot-master53 puppet-agent[27342]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]/ensure) ensure changed 'stopped' to 'running' For extra fun it does that every 30 minutes, so something has confused it. This is on old and AWS masters, which is both puppet systems AIUI.
Reporter | ||
Comment 12•11 years ago
|
||
Fixed the nagios alerts about more than command_runner process. I suspect restart doesn't always leave the pid file in the right state for status to respond correctly, resulting in puppet starting it up again. If you kill a duplicate process nicely then the pid file will be wiped and status will say nothing is running; kill it with -9, or put back the pid file's contents, and you're OK. Anyway, to get back on track ... I've set up rsync's in screen from /dev/shm/queue/command/logs/ to ~cltbld/bug862174/ on buildbot-master13.build.scl1.mozilla.com (build, as a baseline) buildbot-master57.srv.releng.use1.mozilla.com (build) buildbot-master58.srv.releng.usw2.mozilla.com (build) buildbot-master51.srv.releng.use1.mozilla.com (tests1-linux) to give us some history to dig into. StatusDB logging comes from http://hg.mozilla.org/build/buildbotcustom/file/default/bin/postrun.py#l206
Comment 13•11 years ago
|
||
bm67 was alerting, so I hopped on and took a look $ cd /dev/shm/queue/commands/logs # Log uploading is first; how long does it take? $ grep -l 'post_upload' *.log | xargs grep Elapsed 1366731323-15948-300183dtwD0.log:Result: 0, Elapsed: 18.0 seconds 1366731323-15949-30018rOcuCD.log:Result: 0, Elapsed: 12.0 seconds 1366731323-15950-30018W5ZLTJ.log:Result: 0, Elapsed: 12.0 seconds 1366731323-15951-30018XORHax.log:Result: 0, Elapsed: 12.0 seconds 1366731323-15952-30018KKEYyb.log:Result: 0, Elapsed: 13.0 seconds 1366731323-15953-30018J453jf.log:Result: 0, Elapsed: 12.0 seconds 1366731323-15954-30018kMJJCl.log:Result: 0, Elapsed: 13.0 seconds # Updating db is next; how long does it take? $ grep -l 'adding to statusdb' *.log | xargs grep Elapsed 1367531149-0-16595j7a89X.log:Result: 0, Elapsed: 15.0 seconds 1367531865-0-17624gd4nDR.log:Result: 0, Elapsed: 14.0 seconds 1367531894-0-17645pU2wvP.log:Result: 0, Elapsed: 14.0 seconds 1367531926-0-17665UC9Z9J.log:Result: 0, Elapsed: 17.0 seconds 1367531939-0-176788GmZS4.log:Result: 0, Elapsed: 15.0 seconds 1367531965-0-17694UTCslb.log:Result: 0, Elapsed: 14.0 seconds 1367532010-0-1773198EUpO.log:Result: 0, Elapsed: 15.0 seconds 1367532148-0-17780jcuXJK.log:Result: 0, Elapsed: 15.0 seconds 1367532280-0-17819ugtNCc.log:Result: 0, Elapsed: 18.0 seconds 1367532292-0-17836IlbeTS.log:Result: 0, Elapsed: 22.0 seconds 1367532330-0-17859UKdA_B.log:Result: 0, Elapsed: 14.0 seconds # Pulse is last; how long does it take? $ grep -l pulse *.log | xargs grep Elapsed 1367531164-0-16608dh5q7Q.log:Result: 0, Elapsed: 3.0 seconds 1367531951-0-17691Eu8bpS.log:Result: 0, Elapsed: 3.0 seconds 1367531981-0-17710QwOQZy.log:Result: 0, Elapsed: 3.0 seconds 1367531995-0-17717jbU1p8.log:Result: 0, Elapsed: 3.0 seconds 1367532314-0-17856tLKl1u.log:Result: 0, Elapsed: 3.0 seconds 1367532343-0-17880RuydUs.log:Result: 0, Elapsed: 3.0 seconds So our bottlenecks here are log uploading and updating the db. It looks like the bulk of the log uploading time is spent transferring the data. From "Updating statusdb" to "searching for build" takes ~5s http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l207 http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l225 From "creating new build" to "committed" here takes ~11s for updating db http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l234 http://hg.mozilla.org/build/buildbotcustom/file/13967138af38/bin/postrun.py#l241
Comment 14•11 years ago
|
||
Committed https://hg.mozilla.org/build/puppet/rev/5f497d58cf3c to run the command queue with -j4.
Comment 15•11 years ago
|
||
Haven't seen any of these since the changes landed on Tuesday.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 16•10 years ago
|
||
Found and stopped the rsync commands running from comment 12
Comment 17•10 years ago
|
||
deleted the backup directories
Assignee | ||
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•