Closed Bug 1158729 Opened 9 years ago Closed 8 years ago

manage_masters.py retry_dead_queue should run periodically

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: massimo, Unassigned)

Details

Attachments

(4 files, 19 obsolete files)

976 bytes, patch
kmoir
: review+
kmoir
: checked-in+
Details | Diff | Splinter Review
860 bytes, patch
kmoir
: review+
Details | Diff | Splinter Review
2.34 KB, patch
kmoir
: review+
kmoir
: checked-in+
Details | Diff | Splinter Review
533 bytes, patch
arich
: review+
Details | Diff | Splinter Review
This morning, the #buildduty channel had several notification of this kind: 

"buildbot-master116.bb.releng.usw2.mozilla.com:Command Queue is CRITICAL: 1 dead item" 

and since this happened during the weekend, we had "dead items" for days. The errors were recovered by a simple manual run of manage_masters.py retry_dead_queue.

I don't see any reasons why the retry_dead_queue is not executed periodically in a crontab
Summary: manage_masters.py retry_dead_queue should run in periodically → manage_masters.py retry_dead_queue should run periodically
we could also adjust the command runner to retry for longer, with more backoff
Which machine could we use to setup a cronjob that runs "manage_masters.py" with the "retry_dead_queue" option periodically? Thanks.
Flags: needinfo?(rail)
We have (had?) a habit to run experimental stuff on cruncher, but I don't think this is a good idea - it's not properly puppetized. I'd probably use soem of the relelngAPI machines, but dustin or Callek may have better ideas.
Flags: needinfo?(rail) → needinfo?(bugspam.Callek)
(In reply to Rail Aliiev [:rail] from comment #3)
> We have (had?) a habit to run experimental stuff on cruncher, but I don't
> think this is a good idea - it's not properly puppetized. I'd probably use
> soem of the relelngAPI machines, but dustin or Callek may have better ideas.

I'm not a fan of using relengapi for this (at this time) - we don't have a need flow wise or credential wise for relengapi to directly connect via ssh to masters, and I don't necessarily want to change that (would need a stronger RRA/sec-audit I think).

I also don't necessarily think this on cruncher would be a good idea.

Any reason we can't/shouldn't run this directly on the masters themselves as a cron, similar to the reconfig script. With papertrail we have a good way to bubble info out of the master itself to us, including via sns if necessary?
Flags: needinfo?(bugspam.Callek)
VladC: if you look at puppet/modules/buildmaster/manifests/buildbot_master directory in http://hg.mozilla.org/build/puppet you should see some examples of how we configure cron jobs on the master.  Let me know if you have further questions :-)
Attached the patch that added on buildmasters a cron job that run the clean command if the /dev/shm/queue/pulse/dead directory has dead jobs
There was no dead jobs on buildbot-master01, I only use echo to view if the command is working
Did you verify that the resulting puppet changes generate the right crontab? (i.e. connect  test server to test puppet instance and run command) (note, don't test this on a production server)
For example from what I saw, on the following master : buildbot-master74.bb.releng.usw2.mozilla.com the following module : fabric.api doesn't exist.
The error that I have received:
Traceback (most recent call last):
  File "/builds/buildbot/queue/tools/buildfarm/maintenance/manage_masters.py", line 6, in <module>
    from fabric.api import env
ImportError: No module named fabric.api
hmm so the fabric package isn't installed on the masters.  It is installed on the aws-manager2.  So perhaps a better approach would be to setup a cronjob on the aws manager instead of installing fabric on all the buildbot masters. Callek what are your thoughts?
Flags: needinfo?(bugspam.Callek)
I think this should be a cron on all the masters, I don't think we should install fabric on the masters.

Specifically look at what http://mxr.mozilla.org/build/source/tools/lib/python/util/fabric/actions.py#286 does

(copy/pasting here more for posterity)

286 def action_retry_dead_queue(host):
287     for q in 'commands', 'pulse':
288         cmd = "find /dev/shm/queue/%s/dead -type f" % q
289         for f in run(cmd).split("\n"):
290             f = f.strip()
291             if not f:
292                 continue
293             with show('running'):
294                 if f.endswith(".log"):
295                     run("rm %s" % f)
296                 else:
297                     run("mv %s /dev/shm/queue/%s/new" % (f, q))

I think we can write a simple bash script to do this without python, or write a python script to do this without fabric, either way we don't need fabric and is something that can easily run on the master itself.
Flags: needinfo?(bugspam.Callek)
Attached file retry_dead_queue.sh (obsolete) —
Attached you can find the bash script that move jobs dead directory in new directory
Assignee: nobody → vlad.ciobancai
Comments on the script
You don't need to echo anything. It will just generate content in the logs when the puppet command runs and we don't need that

other than that it looks good

fyi for the future you can ask for review or feedback on the patch by setting the feedback or review patch when you attach the patch

https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/How_to_Submit_a_Patch#Getting_the_patch_reviewed
Attached file retry_dead_queue.sh (obsolete) —
Updated the bash script by removing echo commands
Attachment #8656437 - Flags: review?(kmoir)
Comment on attachment 8656437 [details]
retry_dead_queue.sh

script looks good, but needs to be included in a puppet patch to be landed
Attachment #8656437 - Flags: feedback+
Regarding where to update puppet
If you look on a master, for instance buildbot-master81, there are cronjob entries here
cltbld@buildbot-master81.bb.releng.scl3.mozilla.com cron.d]$ pwd
/etc/cron.d
[cltbld@buildbot-master81.bb.releng.scl3.mozilla.com cron.d]$ ls
0hourly  bm81-build_scheduler  bm81-tests_scheduler  buildbot_db_maintenance  mig-agent  puppetcheck  raid-check  sa-update  sysstat


These files are created by puppet
For instance,
puppet/modules/buildmaster/manifests/db_maintenance.pp 

creates
buildbot_db_maintenance

which looks like this on the server
[root@buildbot-master81.bb.releng.scl3.mozilla.com cron.d]# cat buildbot_db_maintenance 
MAILTO=release@mozilla.com
@weekly cltbld /builds/buildbot/db_maint/bin/python /builds/buildbot/db_maint/tools/buildfarm/maintenance/cleanup_db.py --cutoff $(date -d 'last year' +\%Y-\%m-\%d) -l /builds/buildbot/db_maint/cleanup_db.log --config /builds/buildbot/db_maint/config.ini


So you need to add definitions in puppet to run your script.  I would suggest adding a new one instead of adding it to an existing cron definition
Once you have your puppet patch written, you can use bm69 to test.  It is a master that is not currently in production.  I brought up buildbot-master69 in the aws console.  I seem from the logs it has puppetized.  It is a production master but is not enabled.  You'll have to pin the puppet environment to your puppet staging env

like this
https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Set_up_a_user_environment#Pinning

However, since this machine is a not a build slave as the example above provides, you'll have to define it as a master or when you run puppet it will install the wrong packages

so 
node "buildbot-master69.bb.releng.use1.mozilla.com" {
    $node_security_level = 'high'
    buildmaster::buildbot_master::mozilla {
        "bm69-tests1-windows":
            http_port => 8201,
            master_type => "tests",
            basedir => "tests1-windows";
    }
    $pin_puppet_server = "releng-puppet2.build.scl1.mozilla.com"
    $pin_puppet_env = "vladciobancai"
    include toplevel::server::buildmaster::mozilla
}

Once you are done with it, we'll have to reimage it the master and repuppetize it since it was used for testing
Attached patch mozilla.pp.patch (obsolete) — Splinter Review
Attached is the patch that I have created for buildmaster/manifests/buildbot_master/mozilla.pp
Attachment #8657045 - Flags: review?(kmoir)
Attachment #8657045 - Flags: review?(bugspam.Callek)
Attached patch mozilla.pp.patch (obsolete) — Splinter Review
Attached the correct patch for buildmaster/manifests/buildbot_master/mozilla.pp . Sorry for inconvenience
Attachment #8657045 - Attachment is obsolete: true
Attachment #8657045 - Flags: review?(kmoir)
Attachment #8657045 - Flags: review?(bugspam.Callek)
Attachment #8657053 - Flags: review?(kmoir)
Attachment #8657053 - Flags: review?(bugspam.Callek)
Attached the template file buildmaster/templates/buildmaster-retry_dead_queue.erb
Attachment #8657054 - Flags: review?(kmoir)
Attached the bash scrtipt (was reviewed by :kmoir)buildmaster/files/buildmaster-retry_dead_queue.sh
Attachment #8655994 - Attachment is obsolete: true
Attachment #8656437 - Attachment is obsolete: true
Attachment #8656437 - Flags: review?(kmoir)
I'll review the patches once they are tested and working on bm69.  It looks like Amy has been helping you fix up your testing environment, thanks Amy!
I have tested the changes on bm69 and worked successfully, when you have time to review it
Comment on attachment 8657053 [details] [diff] [review]
mozilla.pp.patch

I think it would be better if the shell script (buildmaster-retry_dead_queue.sh") were stored in /usr/local/bin.

You can see this was done for 
/etc/cron.d/puppetcheck

We don't store shell scripts in /etc/cron.d in this directory you just specify the cron definitions there.

Also, the permissions on buildmaster/buildmaster-retry_dead_queue should be set to rw-r--r like the other files in this dir

I think you need  require => Exec["setup-${basedir}" for buildmaster-retry_dead_queue too

I won't review the other patches because they require some refactoring required to these changes.

Good progress so far!

To test make sure you remove the old files and then run the puppet command with the new patches.
Attachment #8657053 - Flags: review?(kmoir) → review-
Attached patch mozilla.pp.patch (obsolete) — Splinter Review
Updated the mozilla.pp file with the proposals provided by kmoir
Attachment #8657053 - Attachment is obsolete: true
Attachment #8657053 - Flags: review?(bugspam.Callek)
Attachment #8657118 - Flags: review?(kmoir)
Updated the template file with the proposals provided by kmoir
Attachment #8657054 - Attachment is obsolete: true
Attachment #8657054 - Flags: review?(kmoir)
Comment on attachment 8657118 [details] [diff] [review]
mozilla.pp.patch

This patch looks good.

However in buildmaster-retry_dead_queue
I don't think you need to set the MAILTO=release@mozilla.com, that is too much email
just set it to 
MAILTO=""

Did the cronjob actually run as expected?  (Change the time locally (not in puppet) to a current time to check)
Updated the template file with the proposals provided by kmoir
Attachment #8657120 - Attachment is obsolete: true
Updated the template file, forgot to add the user root in cronjob command
Attachment #8657139 - Attachment is obsolete: true
Applied the changes on bm69. Changed the time for cronjob to be run every 5 minutes; checked the cron logs and from what I see the cronjob is run successfully
Attachment #8657118 - Flags: review?(kmoir) → review+
Comment on attachment 8654910 [details] [diff] [review]
clean_dead_jobs.patch

Review of attachment 8654910 [details] [diff] [review]:
-----------------------------------------------------------------

Added kmoir for review
Attachment #8654910 - Flags: review?(kmoir)
Comment on attachment 8657055 [details] [diff] [review]
buildmaster-retry_dead_queue.sh.patch

Review of attachment 8657055 [details] [diff] [review]:
-----------------------------------------------------------------

Added kmoir for review
Attachment #8657055 - Flags: review?(kmoir)
Comment on attachment 8657157 [details] [diff] [review]
template-buildmaster-retry_dead_queue.erb.patch

Review of attachment 8657157 [details] [diff] [review]:
-----------------------------------------------------------------

Added kmoir for review
Attachment #8657157 - Flags: review?(kmoir)
Comment on attachment 8654910 [details] [diff] [review]
clean_dead_jobs.patch

I thought this patch was not needed any more because we went with another approach, I'm going to mark it obsolete.
Attachment #8654910 - Attachment is obsolete: true
Attachment #8654910 - Flags: review?(kmoir)
Attachment #8657055 - Flags: review?(kmoir) → review+
Attachment #8657157 - Flags: review?(kmoir) → review+
Attachment #8657055 - Flags: checked-in+
Attachment #8657118 - Flags: checked-in+
Attachment #8657157 - Flags: checked-in+
Puppet run is failing with this message, vlad could you please retest and submit an updated patch

Mon Sep 14 06:36:12 -0700 2015 Puppet (err): Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: File[/usr/local/bin/buildmaster-retry_dead_queue.sh] is already declared in file /etc/puppet/production/modules/buildmaster/manifests/buildbot_master/mozilla.pp:71; cannot redeclare at /etc/puppet/production/modules/buildmaster/manifests/buildbot_master/mozilla.pp:71 on node buildbot-master81.bb.releng.scl3.mozilla.com
Mon Sep 14 06:36:12 -0700 2015 Puppet (err): Could not retrieve catalog; skipping run
Flags: needinfo?(vlad.ciobancai)
This will be because, bm81 (and only bm81 at present) installs two masters.

Since it is the scheduling master.....
Attachment #8657055 - Flags: checked-in+ → checked-in-
Attachment #8657118 - Flags: checked-in+ → checked-in-
Attachment #8657157 - Flags: checked-in+ → checked-in-
should have mentioned that this error occurred on bm81, not the other master. bm81 is a scheduling master.  So you might want to exclude this from running on scheduling masters, but run on all other masters
Flags: needinfo?(vlad.ciobancai)
Attached patch mozilla.pp.patch (obsolete) — Splinter Review
Attachment #8657118 - Attachment is obsolete: true
Attachment #8660772 - Flags: review?(kmoir)
Comment on attachment 8660772 [details] [diff] [review]
mozilla.pp.patch

Should this be under file { 
starting in line 71 
and the closing bracket } at line 79 moved to line 87
Attached patch mozilla.pp.patchSplinter Review
Attachment #8660772 - Attachment is obsolete: true
Attachment #8660772 - Flags: review?(kmoir)
Attachment #8661095 - Flags: review?(kmoir)
Attachment #8661095 - Flags: checked-in+
Attachment #8661095 - Flags: review?(kmoir) → review+
Attachment #8657055 - Flags: checked-in- → checked-in+
Attachment #8657157 - Flags: checked-in- → checked-in+
I just saw this land. Instead of sending email, can we please send output to syslog and then we can set up alerts in papertrail if need be?
Updated the bash
Attachment #8657055 - Attachment is obsolete: true
Attachment #8662339 - Flags: review?(kmoir)
Attachment #8662339 - Flags: review?(arich)
Updated the cronjob script in order to send the output from the script in syslog
Attachment #8662339 - Attachment is obsolete: true
Attachment #8662339 - Flags: review?(kmoir)
Attachment #8662339 - Flags: review?(arich)
Attachment #8662340 - Flags: review?(kmoir)
Attachment #8662340 - Flags: review?(arich)
I tested the changes on bm69
Attachment #8662340 - Attachment is obsolete: true
Attachment #8662340 - Flags: review?(kmoir)
Attachment #8662340 - Flags: review?(arich)
Attachment #8662351 - Flags: review?(kmoir)
Attachment #8662351 - Flags: review?(arich)
Attachment #8657157 - Attachment is obsolete: true
Attachment #8662352 - Flags: review?(kmoir)
Attachment #8662352 - Flags: review?(arich)
Comment on attachment 8662351 [details] [diff] [review]
buildmaster-retry_dead_queue.sh.patch

Is that a spurious "\ No newline at end of file?" actually in the file, or is that the just the diff? If the first, be sure to remove that. If the second, putting a newline at the end of the file doesn't hurt.
Attachment #8662351 - Flags: review?(arich) → review+
Comment on attachment 8662352 [details] [diff] [review]
buildmaster-retry_dead_queue.erb.patch

Same comment about the newline warning.
Attachment #8662352 - Flags: review?(arich) → review+
Comment on attachment 8662352 [details] [diff] [review]
buildmaster-retry_dead_queue.erb.patch

Oh, one thing that should change. You should be sure to redirect stdout and stderr to logger:

* */6 * * * root /usr/local/bin/buildmaster-retry_dead_queue.sh 2>&1 | logger -t retry_dead_queue
Attachment #8662352 - Flags: review+ → review-
Updated the bash script by removing the "\ No newline at end of file"
Attachment #8662351 - Attachment is obsolete: true
Attachment #8662351 - Flags: review?(kmoir)
Attachment #8662360 - Flags: review?(kmoir)
Attachment #8662360 - Flags: review?(arich)
Updated the template by removing the "\ No newline at end of file"
Attachment #8662352 - Attachment is obsolete: true
Attachment #8662352 - Flags: review?(kmoir)
Attachment #8662361 - Flags: review?(kmoir)
Attachment #8662361 - Flags: review?(arich)
Attachment #8662360 - Flags: review?(arich) → review+
Attachment #8662361 - Flags: review?(arich) → review+
Attachment #8662360 - Flags: review?(kmoir) → review+
Attachment #8662361 - Flags: review?(kmoir) → review+
Attachment #8662360 - Flags: checked-in+
Comment on attachment 8662361 [details] [diff] [review]
buildmaster-retry_dead_queue.erb.patch

merged them to production too
Attachment #8662361 - Flags: checked-in+
Updated the bash script in order to search for errors in logs files before them to be deleted
Attachment #8662360 - Attachment is obsolete: true
Attachment #8668333 - Flags: review?(kmoir)
Attachment #8668333 - Flags: review?(kmoir) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Hi Vlad, thanks for working on this. To clean up your development environment could you please
* unpin bm69 from your puppet env
* shutdown the machine
* ack the nagios alert for PING (which will appear shortly after step 2)


Any reason we don't run the cron hourly ? Results won't appear on treeherder until they've done all the steps in the command queue.
just a fyi Vlad is currently away on PTO and will return oct 21.  Bm69 should also be reimaged.
Updated cron template to run the script every hour
Attachment #8662361 - Attachment is obsolete: true
Attachment #8676777 - Flags: review?(kmoir)
(In reply to Nick Thomas [:nthomas] from comment #56)
> Hi Vlad, thanks for working on this. To clean up your development
> environment could you please
> * unpin bm69 from your puppet env
> * shutdown the machine
> * ack the nagios alert for PING (which will appear shortly after step 2)
> 
> 
> Any reason we don't run the cron hourly ? Results won't appear on treeherder
> until they've done all the steps in the command queue.

I have updated my puppet environment by removing the unnecessary lines from bm69 node. I stopped the machine. I searched how to re-image a buildmaster ec2 instance but I was not able to find anything, can somebody show how to do ?
Attachment #8676777 - Flags: review?(kmoir) → review+
I spoke with Amy on irc and she provided me the details to create a new buildbot-master in AWS https://wiki.mozilla.org/ReleaseEngineering/How_To/Setup_buildbot_masters_in_AWS#Puppet ; I terminated the old bm69 and I created a new one
Comment on attachment 8676777 [details] [diff] [review]
buildmaster-retry_dead_queue.erb.patch

checked in and merged to production
Attachment #8676777 - Flags: checked-in+
VladC: To reimage the machine, go into the aws console and terminate the bm69 instance

Then bring it up again in puppet using the commands here
https://wiki.mozilla.org/ReleaseEngineering/AWS_Master_Setup#Puppet

under the section #create the instance
You will use the existing ip that is defined in inventory for this machine
You will need to substitute the us-west-2 parts of the command with us-east-1 because bm69 is defined in us-east-1
bm69 should already be in inventory
you will need to do this part since there will be new ssh keys "Add master's SSH key to known_hosts"
The master won't go into production so I think you can just stop the instance after it is puppetized
Attachment #8676840 - Flags: review?(kmoir) → review+
Comment on attachment 8676840 [details] [diff] [review]
known_hosts.erb.patch

and merged to production
Attachment #8676840 - Flags: checked-in+
(In reply to Kim Moir [:kmoir] from comment #62)
> VladC: To reimage the machine, go into the aws console and terminate the
> bm69 instance
> 
> Then bring it up again in puppet using the commands here
> https://wiki.mozilla.org/ReleaseEngineering/AWS_Master_Setup#Puppet
> 
> under the section #create the instance
> You will use the existing ip that is defined in inventory for this machine
> You will need to substitute the us-west-2 parts of the command with
> us-east-1 because bm69 is defined in us-east-1
> bm69 should already be in inventory
> you will need to do this part since there will be new ssh keys "Add master's
> SSH key to known_hosts"
> The master won't go into production so I think you can just stop the
> instance after it is puppetized

:kmoir Amy suggested on irc that it would be easier if the instance it will be terminated and to be created only when is needed? otherwise the machine will be outdated
okay, that's a good idea, wasn't sure when it was going to be used again
I will terminate the instance but I will not delete the fqdn from inventory and in this way the ip assigned to the fqdn can be used in future.
sounds good, thanks!
Updated the template in order to work correctly
Attachment #8676777 - Attachment is obsolete: true
Attachment #8690792 - Flags: review?(arich)
Attachment #8690792 - Flags: review?(arich) → review+
(In reply to Vlad Ciobancai [:vladC] from comment #55)
> Created attachment 8668333 [details] [diff] [review]
> buildmaster-retry_dead_queue.sh.patch
> 
> Updated the bash script in order to search for errors in logs files before
> them to be deleted

@Kim can you please push this change because is not in production. The update will provide more error information
Status: RESOLVED → REOPENED
Flags: needinfo?(kmoir)
Resolution: FIXED → ---
I've landed the change.

However, in the future, you should be able to land all your own changes after they have been reviewed

Since you have newly minted level 2 commit rights in bug 1229897 :-)

We can go over this process in Orlando
Flags: needinfo?(kmoir)
Assignee: vciobancai → nobody
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.