Closed Bug 1103123 Opened 6 years ago Closed 5 years ago

Stop rebooting Linux slaves after buildbot jobs

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mrrrgn, Assigned: mrrrgn)

References

Details

Attachments

(16 files, 3 obsolete files)

42 bytes, text/plain
Details
4.29 KB, patch
rail
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
47 bytes, text/x-github-pull-request
rail
: review+
mrrrgn
: checked-in+
Details | Review
2.75 KB, patch
rail
: review+
dustin
: review-
mrrrgn
: checked-in+
Details | Diff | Splinter Review
7.19 KB, patch
dustin
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
2.19 KB, patch
mrrrgn
: checked-in+
Details | Diff | Splinter Review
1.95 KB, patch
coop
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
1.54 KB, patch
rail
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
4.45 KB, patch
rail
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
47 bytes, text/x-github-pull-request
catlee
: review+
Details | Review
982 bytes, patch
rail
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
1.70 KB, patch
rail
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
1006 bytes, patch
catlee
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
2.34 KB, patch
rail
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
1.18 KB, patch
Callek
: review+
mrrrgn
: checked-in+
Details | Diff | Splinter Review
1.28 KB, patch
philor
: feedback+
Details | Diff | Splinter Review
To primary things need to happen:

- Modify the cleanup task so that it does not run before every job.
- Create a runner task which kills processes which were not present just after boot.
Blocks: 1028191
Attachment #8527733 - Flags: review?
Attachment #8527733 - Flags: feedback?(rail)
I built a tool called 'cleanslate' for killing all processes which weren't live just after system boot. This seems like a good start at cleaning up after jobs w/o rebooting.
Deploy cleanslate to CentOS machines and create a cleanslate puppet module.
Attachment #8528563 - Flags: review?(rail)
Attached patch tools.clean_or_reboot.diff (obsolete) — Splinter Review
Attachment #8528565 - Flags: review?(catlee)
Depends on: 1074277
Comment on attachment 8528563 [details] [diff] [review]
cleanslate.puppet.diff

lgtm
Attachment #8528563 - Flags: review?(rail) → review+
Attachment #8528565 - Attachment is obsolete: true
Attachment #8528565 - Flags: review?(catlee)
Attachment #8528563 - Flags: checked-in+
Sorry to keep hounding you with reviews! >.<
Attachment #8529395 - Flags: review?(rail)
(In reply to Morgan Phillips [:mrrrgn] from comment #7)
> Created attachment 8529395 [details] [review]
> https://github.com/mozilla/build-runner/pull/16
> 
> Sorry to keep hounding you with reviews! >.<

No worries!
Attached patch runner.halt.diffSplinter Review
This deploys the halt task to all runner platforms. It's turning out to be important in my local tests, since I've noticed some weird errors popping up after 4-5 runs in some cases.
Attachment #8529435 - Flags: review?(rail)
Attachment #8529435 - Flags: review?(rail) → review+
Attached patch idlelizer.diffSplinter Review
Attachment #8530877 - Flags: review?(dustin)
Comment on attachment 8529435 [details] [diff] [review]
runner.halt.diff

Review of attachment 8529435 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/buildslave/manifests/startup/runner.pp
@@ +33,5 @@
>              }
>          }
>      }
>      include runner::tasks::buildbot
> +    include runner::tasks::halt

Is this class defined in a different, as-yet-unapplied, patch?
Attachment #8529435 - Flags: review-
Comment on attachment 8530877 [details] [diff] [review]
idlelizer.diff

Review of attachment 8530877 [details] [diff] [review]:
-----------------------------------------------------------------

::: slave/buildslave/idleizer.py
@@ +76,5 @@
> +            if err.check(AttributeError):
> +                log.msg("Master does not support slave initiated shutdown.  Upgrade master to 0.8.3 or later to use this feature.")
> +            else:
> +                log.msg('callRemote("shutdown") failed')
> +                log.err(err)

Our masters are running 0.8.2 .. does this even work?  If it does work (because we backported support, as I seem to recall we did), we could just remove this conditional; if it doesn't, then we have other problems!

@@ +100,5 @@
>              self._setTimer(self.max_disconnected_time)
>  
>      def registerActivity(self):
>          if self._state == CONNECTED:
> +            if sys.platform == 'linux2' and not self._remote_shutdown_requested:

Why just on Linuxes?  Is that temporary?
Attachment #8530877 - Flags: review?(dustin) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> Comment on attachment 8530877 [details] [diff] [review]
> idlelizer.diff
> 
> Review of attachment 8530877 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: slave/buildslave/idleizer.py
> @@ +76,5 @@
> > +            if err.check(AttributeError):
> > +                log.msg("Master does not support slave initiated shutdown.  Upgrade master to 0.8.3 or later to use this feature.")
> > +            else:
> > +                log.msg('callRemote("shutdown") failed')
> > +                log.err(err)
> 
> Our masters are running 0.8.2 .. does this even work?  If it does work
> (because we backported support, as I seem to recall we did), we could just
> remove this conditional; if it doesn't, then we have other problems!
> 
> @@ +100,5 @@
> >              self._setTimer(self.max_disconnected_time)
> >  
> >      def registerActivity(self):
> >          if self._state == CONNECTED:
> > +            if sys.platform == 'linux2' and not self._remote_shutdown_requested:
> 
> Why just on Linuxes?  Is that temporary?

It's temporary. It will also work on OSX currently; but I didn't want to roll this out all at once.
Attachment #8530916 - Flags: checked-in+
Attachment #8529435 - Flags: checked-in+
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> Comment on attachment 8530877 [details] [diff] [review]
> idlelizer.diff
> 
> Review of attachment 8530877 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: slave/buildslave/idleizer.py
> @@ +76,5 @@
> > +            if err.check(AttributeError):
> > +                log.msg("Master does not support slave initiated shutdown.  Upgrade master to 0.8.3 or later to use this feature.")
> > +            else:
> > +                log.msg('callRemote("shutdown") failed')
> > +                log.err(err)
> 
> Our masters are running 0.8.2 .. does this even work?  If it does work
> (because we backported support, as I seem to recall we did), we could just
> remove this conditional; if it doesn't, then we have other problems!
> 
> @@ +100,5 @@
> >              self._setTimer(self.max_disconnected_time)
> >  
> >      def registerActivity(self):
> >          if self._state == CONNECTED:
> > +            if sys.platform == 'linux2' and not self._remote_shutdown_requested:
> 
> Why just on Linuxes?  Is that temporary?

I'll remove the un-necessary conditional in another commit.
Attachment #8529395 - Flags: checked-in+
Attachment #8530877 - Flags: checked-in+
Attachment #8531303 - Flags: review?(dustin)
Attachment #8531303 - Flags: review?(dustin) → review?(coop)
Attachment #8531303 - Flags: review?(coop) → review+
Attachment #8531303 - Flags: checked-in+
I accidentally submitted the 'update buildslave' patch with the wrong reviewer in the commit message. The reviewer was coop not dustin.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #8533296 - Flags: review?(rail)
Attachment #8533296 - Flags: review?(rail) → review+
Attached patch postflight.diff (obsolete) — Splinter Review
This will allow us to turn on off no-reboots selectively. In this version only build machines will go into no reboot mode.
Attachment #8533302 - Flags: review?(rail)
Comment on attachment 8533302 [details] [diff] [review]
postflight.diff

Review of attachment 8533302 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/runner/files/post_flight.py
@@ +16,5 @@
> +
> +
> +def get_hostname():
> +    sysname, nodename, release, version, machine = os.uname()
> +    return nodename.split('.')[0]

I think we use socket.gethostname() for this.

@@ +22,5 @@
> +
> +def get_recent_builds(slavename):
> +    url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> +    request = urllib2.Request(url)
> +    results = urllib2.urlopen(request)

Please set some reasonable timeout here. Otherwise this may hang forever.

@@ +23,5 @@
> +def get_recent_builds(slavename):
> +    url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> +    request = urllib2.Request(url)
> +    results = urllib2.urlopen(request)
> +    processed_results = json.loads(results.read())

This may raise at least ValueError and the script would exit(1) in this case. This means RETRY for runner. Not sure if this is what you want here.

@@ +37,5 @@
> +    rm_files = ['/var/tmp/cleanslate']
> +    for f in rm_files:
> +        if os.path.exists(f):
> +            print('removing: %s' % f)
> +            os.remove(f)

Possible race condition and uncaught exception (IOError and OSError IIRc). Can you wrap os.remove() with try/except and don't call os.path.exists?

@@ +45,5 @@
> +def get_hostname_blacklist():
> +    '''
> +    A list of hostname expressions which will coerce a halt.
> +    '''
> +    return ['^t.*']  # halt all test machines

this would also blacklist try-linux64. 
["^tst-.*", "^t-.*", "^talos-.*"] should match per http://hg.mozilla.org/build/buildbot-configs/file/f2255d82ce6d/mozilla-tests/production_config.py

@@ +56,5 @@
> +    '''
> +    for expression in blacklist:
> +        if re.match(expression, entry):
> +            return True
> +    return False

A nit, the same can be done as

return any(re.match(e, entry) for e in blacklist)

@@ +71,5 @@
> +        halt()
> +
> +    # stagger web requests in case of retries
> +    time.sleep(random.randint(1, 5))
> +    build_data = get_recent_builds(hostname)

I'm not sure how sleeping without retrying may help here... Also 5 seconds won't stop slaves from hammering if you add retrying.

@@ +75,5 @@
> +    build_data = get_recent_builds(hostname)
> +
> +    if not build_data:
> +        # The exception should coerce a retry
> +        raise Exception('Failed to find data for %s' % hostname)

This will retry the script, correct?
Attachment #8533302 - Flags: review?(rail) → review-
(In reply to Rail Aliiev [:rail] from comment #21)
> Comment on attachment 8533302 [details] [diff] [review]
> postflight.diff
> 
> Review of attachment 8533302 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: modules/runner/files/post_flight.py
> @@ +16,5 @@
> > +
> > +
> > +def get_hostname():
> > +    sysname, nodename, release, version, machine = os.uname()
> > +    return nodename.split('.')[0]
> 
> I think we use socket.gethostname() for this.
> 
> @@ +22,5 @@
> > +
> > +def get_recent_builds(slavename):
> > +    url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> > +    request = urllib2.Request(url)
> > +    results = urllib2.urlopen(request)
> 
> Please set some reasonable timeout here. Otherwise this may hang forever.
> 
> @@ +23,5 @@
> > +def get_recent_builds(slavename):
> > +    url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> > +    request = urllib2.Request(url)
> > +    results = urllib2.urlopen(request)
> > +    processed_results = json.loads(results.read())
> 
> This may raise at least ValueError and the script would exit(1) in this
> case. This means RETRY for runner. Not sure if this is what you want here.
> 
> @@ +37,5 @@
> > +    rm_files = ['/var/tmp/cleanslate']
> > +    for f in rm_files:
> > +        if os.path.exists(f):
> > +            print('removing: %s' % f)
> > +            os.remove(f)
> 
> Possible race condition and uncaught exception (IOError and OSError IIRc).
> Can you wrap os.remove() with try/except and don't call os.path.exists?
> 
> @@ +45,5 @@
> > +def get_hostname_blacklist():
> > +    '''
> > +    A list of hostname expressions which will coerce a halt.
> > +    '''
> > +    return ['^t.*']  # halt all test machines
> 
> this would also blacklist try-linux64. 
> ["^tst-.*", "^t-.*", "^talos-.*"] should match per
> http://hg.mozilla.org/build/buildbot-configs/file/f2255d82ce6d/mozilla-tests/
> production_config.py
> 
> @@ +56,5 @@
> > +    '''
> > +    for expression in blacklist:
> > +        if re.match(expression, entry):
> > +            return True
> > +    return False
> 
> A nit, the same can be done as
> 
> return any(re.match(e, entry) for e in blacklist)
> 
> @@ +71,5 @@
> > +        halt()
> > +
> > +    # stagger web requests in case of retries
> > +    time.sleep(random.randint(1, 5))
> > +    build_data = get_recent_builds(hostname)
> 
> I'm not sure how sleeping without retrying may help here... Also 5 seconds
> won't stop slaves from hammering if you add retrying.
> 
> @@ +75,5 @@
> > +    build_data = get_recent_builds(hostname)
> > +
> > +    if not build_data:
> > +        # The exception should coerce a retry
> > +        raise Exception('Failed to find data for %s' % hostname)
> 
> This will retry the script, correct? It will.

Sleeping was a silly idea; there's now a timeout of 12 seconds for open sockets. I addressed the other changes as well.

Anything that happens outside of calling halt() should result in a retry, so not handling most of the exceptions will be fine. Problems will show up in the logs and hit max-retries if it's not something intermittent (like a single timeout).
"defensive programming" / "let-it-crash"
 ++!
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> "defensive programming" / "let-it-crash"
>  ++!

Defensive programming is always good; but in this case let it crash makes the most since. The idea is that, if anything seems even slightly awry we should consider forcing a reboot. Letting exceptions raise here will trigger a retry, then, eventually a reboot thanks to the fact that it's a process being managed by runner. The only cases worth handling are cases where we want to reboot immediately instead of retrying.
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> "defensive programming" / "let-it-crash"
>  ++!

*sense (& any other grammer derps) :( :( :(
Attached patch postflight.diffSplinter Review
Attachment #8533302 - Attachment is obsolete: true
Attachment #8533386 - Flags: review?(rail)
Comment on attachment 8533386 [details] [diff] [review]
postflight.diff


                     ,@;@,
        ,@;@;@;@;@;@/ )@;@;
      ,;@;@;@;@;@;@|_/@' e\
     (|@;@:@\@;@;@;@:@(    \
       '@;@;@|@;@;@;@;'`"--'
        '@;@;/;@;/;@;' 
         ) //   | ||
         \ \\   | ||
          \ \\  ) \\  it!
           `"`  `"``
Attachment #8533386 - Flags: review?(rail) → review+
Attachment #8534378 - Flags: review?(rail)
Attachment #8534378 - Flags: review?(rail) → review+
Attachment #8534397 - Flags: review?(rail)
Attachment #8534397 - Flags: review?(rail) → review+
This will stop rebooting talos machines (should be applied on a Monday)
Attachment #8535844 - Flags: review?(catlee)
Attachment #8533296 - Flags: checked-in+
Attachment #8533386 - Flags: checked-in+
Comment on attachment 8535844 [details] [diff] [review]
runner.noreboot.talos.diff

Review of attachment 8535844 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm, as long is joel is ok with it!
Attachment #8535844 - Flags: review?(catlee) → review+
Comment on attachment 8533811 [details] [review]
https://github.com/mozilla/build-runner/pull/19

merged on github a while ago
Attachment #8533811 - Flags: review?(catlee) → review+
Attachment #8534378 - Flags: checked-in+
Attachment #8534397 - Flags: checked-in+
Attachment #8535844 - Flags: checked-in+
Attached patch runner.service_restart.diff (obsolete) — Splinter Review
Attachment #8538746 - Flags: review?(rail)
Comment on attachment 8538746 [details] [diff] [review]
runner.service_restart.diff

Review of attachment 8538746 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/runner/templates/tasks/restart_services.erb
@@ +4,5 @@
> +
> +RESTART_SERVICES="<%= @service_names %>"
> +
> +for service_name in ${RESTART_SERVICES}; do
> +  if [[ $(service ${service_name} status) != '' ]]; then

Can you check the exit code instead? The approach above may not work on Ubuntu since the command always prints something to stdout.
Attachment #8538746 - Flags: review?(rail) → review-
Attachment #8538746 - Attachment is obsolete: true
Attachment #8539342 - Flags: review?(rail)
(In reply to Rail Aliiev [:rail] from comment #35)
> Comment on attachment 8538746 [details] [diff] [review]
> runner.service_restart.diff
> 
> Review of attachment 8538746 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: modules/runner/templates/tasks/restart_services.erb
> @@ +4,5 @@
> > +
> > +RESTART_SERVICES="<%= @service_names %>"
> > +
> > +for service_name in ${RESTART_SERVICES}; do
> > +  if [[ $(service ${service_name} status) != '' ]]; then
> 
> Can you check the exit code instead? The approach above may not work on
> Ubuntu since the command always prints something to stdout.

Good point, I  swapped it out with something more sensible. Thanks! :)
Attachment #8539342 - Flags: review?(rail) → review+
Assignee: nobody → winter2718
Attachment #8539342 - Flags: checked-in+
So, now with service restarts in place I'm going to stop reboots on a large swath of testers before proceeding. Once the 't-' and 'tst-' machines are removed from the post_flight hostname blacklist this task will be _finished_! \m/
To turn off tester rebooting again.
Attachment #8542648 - Flags: review?(bugspam.Callek)
Comment on attachment 8542648 [details] [diff] [review]
test-noreboot.diff

Review of attachment 8542648 [details] [diff] [review]:
-----------------------------------------------------------------

lets see what burns
Attachment #8542648 - Flags: review?(bugspam.Callek) → review+
Good luck telling what burns - between bug 989048 and bug 1109932 exploding and the bizarre "anything can fail and probably will" bug 1114541 and the month-long rash of bug 1073442 consistently happening once a slave started hitting it until the instance was terminated, which seems to have suddenly stopped at 7am Monday, I had myself convinced that we already were doing no-reboot on tst-linux*, and that it was the reason why absolutely no result on those slaves could be believed. So if this does wind up having any effect on the running of tests other than complete and permanent failure, you absolutely will not be able to tell that from sheriff reports, because we currently can't tell the difference between tst-linux* results and random() results.
Attachment #8542648 - Flags: checked-in+
Philor, I have this patch ready and am watching the trees. I added you for info just to call your attention to the way that reboots are working now. The buildbot logs say "reboot skipped" but that's a result of the configs not being changed to reflect our current situation.

Now, after every job, several checks are performed to decide if a machine should be rebooted. If its hostname matches a blacklist regex it will always be rebooted. An overall view of how many machines are rebooting at any time can be found here: https://stats.taskcluster.net/grafana/#/dashboard/db/runner
Attachment #8542746 - Flags: feedback?(philringnalda)
Comment on attachment 8542746 [details] [diff] [review]
enable-test-reboots.diff

Got it, thanks.

(Other than an inability to see anything on taskcluster, since "login with your @mozilla.com address" would require that I first quit my current job, and then get hired by MoCo.)
Attachment #8542746 - Flags: feedback?(philringnalda) → feedback+
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
QA Contact: pmoore → mshal
You need to log in before you can comment on or make changes to this bug.