Runner dies before starting buildbot due to hg failures and leaves slave doing nothing

RESOLVED FIXED

Status

Release Engineering
General
RESOLVED FIXED
3 years ago
17 days ago

People

(Reporter: pmoore, Assigned: rail)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(6 attachments, 1 obsolete attachment)

(Reporter)

Description

3 years ago
Created attachment 8472937 [details]
runner.log

Today :arr found a bunch of slaves reported by nagios not to have a running buildbot. When I investigated, the first host I saw (b-linux64-hp-0004.try.releng.scl3.mozilla.com) had no buildbot running, due to a Runner failure. I've attached the /var/log/runner.log from this machine.

As you see, when it fails to update the tools checkout, it deletes it, in the hope of doing a fresh clone. When the fresh clone of tools repo also fails, it ends up without a tools repo at all.

This then causes the call to purge_builds.py to fail, since this script is run from the tools checkout.

There are a couple of issues here:
1) hgtools probably shouldn't actively delete a repo, just because it can't update (maybe clone somewhere else, and then move over in place)
2) we should probably gracefully accept it, if we can't update the hg repos when runner starts, and use dated checkout, until at least buildbot is on its feet
3) we didn't have any notifications in #buildduty about problems - probably we should have some runner monitoring
4) there is nothing "on the box" keeping runner alive. If it dies, it dies. Maybe we should have a cron to start it, rather than /etc/init.d/runner - and then start it every minute, and use a lock file - or have a flag on the filesystem to say if we don't want to run it. Something like this might not be so pretty, but ensures it doesn't die, and then wait for half a day to get rebooted by slaverebooter, because it wasn't taking jobs.
(Reporter)

Comment 1

3 years ago
So tackling the job of getting these machines up again, things are not off to a good start .....

Not sure (yet) what is causing this. Just tried to run runner (after checking it wasn't running, and nor was buildbot):

[root@b-linux64-hp-0004.try.releng.scl3.mozilla.com ~]# /etc/init.d/runner start
Starting runner
failed
[root@b-linux64-hp-0004.try.releng.scl3.mozilla.com ~]# cat /var/log/runner.log
2014-08-14 04:39:11,290 - INFO - iteration 1
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.02s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
command: END (0.00s elapsed)

Traceback (most recent call last):
  File "/usr/local/bin/hgtool.py", line 104, in <module>
    autoPurge=options.auto_purge)
  File "<string>", line 358, in mercurial
  File "<string>", line 47, in run_cmd
  File "/usr/lib64/python2.6/subprocess.py", line 497, in check_call
    retcode = call(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 478, in call
    p = Popen(*popenargs, **kwargs)
  File "/usr/lib64/python2.6/subprocess.py", line 639, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 13] Permission denied: '.'
2014-08-14 04:39:20,308 - WARNING - maximum attempts reached
[root@b-linux64-hp-0004.try.releng.scl3.mozilla.com ~]# hg -q version
Mercurial Distributed SCM (version 2.5.4)
[root@b-linux64-hp-0004.try.releng.scl3.mozilla.com ~]# /etc/init.d/runner start
Starting runner
failed
[root@b-linux64-hp-0004.try.releng.scl3.mozilla.com ~]#
(Reporter)

Comment 2

3 years ago
Created attachment 8472946 [details]
b-linux64-hp-0006.try.releng.scl3.mozilla.com_runner.log
(Reporter)

Comment 3

3 years ago
Created attachment 8472947 [details]
b-linux64-hp-0025.build.releng.scl3.mozilla.com_runner.log

More of the same...
(Assignee)

Comment 4

3 years ago
I think this is why we've been seeing more reboots by aws_stop_idle.py these days. I was wondering what's going on, because the instances looked fine except not running buildbot.


runner can:
* reboot (ooh!)
* retry forever
* report about the failure (email?)
* ignore some errors and start buildbot

I think ignoring errors shouldn't be critical and not hard to implement.
(Assignee)

Updated

3 years ago
Severity: normal → major
(Assignee)

Comment 5

3 years ago
Created attachment 8472987 [details] [diff] [review]
runner.diff

AFAIK, runner.py checks the exit code of the scripts. Setting not critical ones to zero should help in the short term.
Attachment #8472987 - Flags: review?(iconnolly)
(Assignee)

Comment 6

3 years ago
Also, IMO the scripts should explicitly set their exit codes since "exit 2" (which can be set by some programs run inside the scripts implicitly) means HALT.
Comment on attachment 8472987 [details] [diff] [review]
runner.diff

Yep, this is fine!

You're right re: error codes, how to `write` a runner task is something that'll be part of my Black Box session on Friday.
Attachment #8472987 - Flags: review?(iconnolly) → review+
(Assignee)

Comment 8

3 years ago
Comment on attachment 8472987 [details] [diff] [review]
runner.diff

remote:   https://hg.mozilla.org/build/puppet/rev/6fa680c8ef2a
remote:   https://hg.mozilla.org/build/puppet/rev/03b2a5c6fe7f
Attachment #8472987 - Flags: checked-in+
(Reporter)

Comment 9

3 years ago
Please note that hgtool itself is deleting the tools repo when it can't update it, thinking that it can then clone it - which then failed afterwards - so we have a situation where the tools repo gets completely deleted. It might be that the tools repo is only needed for running purge_builds.py (?) but if it is needed for other things, it is now gone. Therefore maybe part of the fix for this should be not to allow hgtool to completely remove a checkout before cloning it to get it fresh - instead, cloning it, and only if the clone is successful, replacing the old version. If the clone is not successful, it should probably not remove the old checkout. Does that make sense?
bug 1047207 for hgtool being a bad^W suboptimal boy. The tools repo is also used for rebooting, which will become moot once we get all the way to runner, but needed in the meantime.
(Assignee)

Comment 11

3 years ago
aws_stop_idle.py looks much better today. The same applies to aws_sanity_checker emails.
Severity: major → normal
(Assignee)

Comment 12

3 years ago
I think the reported issue is resolved now.
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Assignee)

Comment 13

3 years ago
Need to make the script fail to make sure that the repos exist.
Assignee: nobody → rail
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 14

3 years ago
Created attachment 8480013 [details] [diff] [review]
runner-set-e.diff
Attachment #8480013 - Flags: review?(catlee)
Comment on attachment 8480013 [details] [diff] [review]
runner-set-e.diff

Review of attachment 8480013 [details] [diff] [review]:
-----------------------------------------------------------------

Does this address the permission denied problem?
Attachment #8480013 - Flags: review?(catlee) → review+
(Assignee)

Comment 16

3 years ago
(In reply to Chris AtLee [:catlee] from comment #15)
> Comment on attachment 8480013 [details] [diff] [review]
> runner-set-e.diff
> 
> Review of attachment 8480013 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Does this address the permission denied problem?

I don't think so, it should just exit non zero if something goes wrong inside the scripts.
(Assignee)

Comment 17

3 years ago
Comment on attachment 8480013 [details] [diff] [review]
runner-set-e.diff

remote:   https://hg.mozilla.org/build/puppet/rev/5aa298f2a2cb
remote:   https://hg.mozilla.org/build/puppet/rev/a9cbb7b70606
Attachment #8480013 - Flags: checked-in+
(Assignee)

Comment 18

3 years ago
cltbld@bld-linux64-spot-026.build.releng.use1.mozilla.com

from runner.log:

Updating hg repos in /builds/hg-shared
hg pull -R /builds/hg-shared/mozilla-central 
hg pull -R /builds/hg-shared/integration/b2g-inbound 
hg pull -R /builds/hg-shared/integration/mozilla-inbound 
hg pull -R /builds/hg-shared/build/mozharness 
abort: repository default not found!
xargs: hg: exited with status 255; aborting
warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (check hostfingerprints or web.cacerts config setting)
warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (check hostfingerprints or web.cacerts config setting)
warning: hg.mozilla.org certificate with fingerprint af:27:b9:34:47:4e:e5:98:01:f6:83:2b:51:c9:aa:d8:df:fb:1a:27 not verified (check hostfingerprints or web.cacerts config setting)
pulling from https://hg.mozilla.org/build/mozharness
searching for changes
no changes found
pulling from https://hg.mozilla.org/integration/mozilla-inbound
searching for changes
no changes found
pulling from https://hg.mozilla.org/mozilla-central
searching for changes
no changes found
Reporting hg version in use
command: START
command: hg -q version
command: cwd: .
command: output:
Mercurial Distributed SCM (version 2.5.4)
command: END (0.17s elapsed)

checking the repos:
 $ cd /builds/hg-shared/integration/b2g-inbound
 $ hg pull
 abort: repository default not found!
 $ echo $?
 255
 $ cat .hg/hgrc
 cat: .hg/hgrc: No such file or directory
(Assignee)

Comment 19

3 years ago
rm -rf /builds/hg-shared/integration/b2g-inbound on that machine lets buildbot start
(Assignee)

Comment 20

3 years ago
Another one with the same repo borken:

[cltbld@bld-linux64-spot-382.build.releng.usw2.mozilla.com tasks.d]$ find /builds/hg-shared -type d -name .hg -not -wholename '*l10n*' -not -wholename '*try*' -prune                                               
/builds/hg-shared/mozilla-central/.hg
/builds/hg-shared/integration/b2g-inbound/.hg
/builds/hg-shared/integration/mozilla-inbound/.hg
/builds/hg-shared/build/mozharness/.hg
/builds/hg-shared/build/tools/.hg

[cltbld@bld-linux64-spot-382.build.releng.usw2.mozilla.com tasks.d]$ find /builds/hg-shared -type f -wholename '*/.hg/hgrc' -not -wholename '*l10n*' -not -wholename '*try*' -prune                                 
/builds/hg-shared/mozilla-central/.hg/hgrc
/builds/hg-shared/integration/mozilla-inbound/.hg/hgrc
/builds/hg-shared/build/mozharness/.hg/hgrc
/builds/hg-shared/build/tools/.hg/hgrc
(Assignee)

Comment 21

3 years ago
Created attachment 8481378 [details] [diff] [review]
skip_borken_repos.diff
Attachment #8481378 - Flags: review?(catlee)
(Assignee)

Comment 22

3 years ago
Comment on attachment 8481378 [details] [diff] [review]
skip_borken_repos.diff

xargs -P makes it ||, need to fix it
Attachment #8481378 - Attachment is obsolete: true
Attachment #8481378 - Flags: review?(catlee)
(Assignee)

Comment 23

3 years ago
Created attachment 8481462 [details] [diff] [review]
skip_borken_repos.diff
Attachment #8481462 - Flags: review?(catlee)

Updated

3 years ago
Attachment #8481462 - Flags: review?(catlee) → review+
Duplicate of this bug: 1059950
(Assignee)

Comment 25

3 years ago
Comment on attachment 8481462 [details] [diff] [review]
skip_borken_repos.diff

remote:   https://hg.mozilla.org/build/puppet/rev/838747a6b699
remote:   https://hg.mozilla.org/build/puppet/rev/0444bd46fc4b
Attachment #8481462 - Flags: checked-in+
(Assignee)

Comment 26

3 years ago
No reboots since it's landed. \o/
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.