Closed Bug 1103123 Opened 11 years ago Closed 11 years ago

Stop rebooting Linux slaves after buildbot jobs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mrrrgn, Assigned: mrrrgn)

References

Details

Attachments

(16 files, 3 obsolete files)

https://github.com/mrrrgn/build-cleanslate 11 years ago Morgan Phillips [:mrrrgn] 42 bytes, text/plain		Details
cleanslate.puppet.diff 11 years ago Morgan Phillips [:mrrrgn] 4.29 KB, patch	rail : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
tools.clean_or_reboot.diff 11 years ago Morgan Phillips [:mrrrgn] 9.25 KB, patch		Details \| Diff \| Splinter Review
https://github.com/mozilla/build-runner/pull/16 11 years ago Morgan Phillips [:mrrrgn] 47 bytes, text/x-github-pull-request	rail : review+ mrrrgn : checked-in+	Details \| Review
runner.halt.diff 11 years ago Morgan Phillips [:mrrrgn] 2.75 KB, patch	rail : review+ dustin : review- mrrrgn : checked-in+	Details \| Diff \| Splinter Review
idlelizer.diff 11 years ago Morgan Phillips [:mrrrgn] 7.19 KB, patch	dustin : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
runner.haltclass.diff 11 years ago Morgan Phillips [:mrrrgn] 2.19 KB, patch	mrrrgn : checked-in+	Details \| Diff \| Splinter Review
buildslave.update.diff 11 years ago Morgan Phillips [:mrrrgn] 1.95 KB, patch	coop : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
cleanslate.runner.update.diff 11 years ago Morgan Phillips [:mrrrgn] 1.54 KB, patch	rail : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
postflight.diff 11 years ago Morgan Phillips [:mrrrgn] 4.74 KB, patch	rail : review-	Details \| Diff \| Splinter Review
postflight.diff 11 years ago Morgan Phillips [:mrrrgn] 4.45 KB, patch	rail : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
https://github.com/mozilla/build-runner/pull/19 11 years ago Morgan Phillips [:mrrrgn] 47 bytes, text/x-github-pull-request	catlee : review+	Details \| Review
runner.update.1.5.diff 11 years ago Morgan Phillips [:mrrrgn] 982 bytes, patch	rail : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
runner.reboot-after-30.diff 11 years ago Morgan Phillips [:mrrrgn] 1.70 KB, patch	rail : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
runner.noreboot.talos.diff 11 years ago Morgan Phillips [:mrrrgn] 1006 bytes, patch	catlee : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
runner.service_restart.diff 11 years ago Morgan Phillips [:mrrrgn] 2.34 KB, patch	rail : review-	Details \| Diff \| Splinter Review
runner.service_restart.diff 11 years ago Morgan Phillips [:mrrrgn] 2.34 KB, patch	rail : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
test-noreboot.diff 11 years ago Morgan Phillips [:mrrrgn] 1.18 KB, patch	Callek : review+ mrrrgn : checked-in+	Details \| Diff \| Splinter Review
enable-test-reboots.diff 11 years ago Morgan Phillips [:mrrrgn] 1.28 KB, patch	philor : feedback+	Details \| Diff \| Splinter Review

Morgan Phillips [:mrrrgn]

Assignee

Description

•

11 years ago

To primary things need to happen: - Modify the cleanup task so that it does not run before every job. - Create a runner task which kills processes which were not present just after boot.

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Blocks: 1028191

Morgan Phillips [:mrrrgn]

Assignee

Comment 1

•

11 years ago

Attached file https://github.com/mrrrgn/build-cleanslate — Details

Attachment #8527733 - Flags: review?

Attachment #8527733 - Flags: feedback?(rail)

Morgan Phillips [:mrrrgn]

Assignee

Comment 2

•

11 years ago

I built a tool called 'cleanslate' for killing all processes which weren't live just after system boot. This seems like a good start at cleaning up after jobs w/o rebooting.

Rail Aliiev [:rail]

Comment 3

•

11 years ago

Comment on attachment 8527733 [details] https://github.com/mrrrgn/build-cleanslate I created a branch with some comments: https://github.com/rail/build-cleanslate/compare/comments?expand=1

Attachment #8527733 - Flags: review?

Attachment #8527733 - Flags: feedback?(rail)

Morgan Phillips [:mrrrgn]

Assignee

Comment 4

•

11 years ago

Attached patch cleanslate.puppet.diff — Details — Splinter Review

Deploy cleanslate to CentOS machines and create a cleanslate puppet module.

Attachment #8528563 - Flags: review?(rail)

Morgan Phillips [:mrrrgn]

Assignee

Comment 5

•

11 years ago

Attached patch tools.clean_or_reboot.diff (obsolete) — Details — Splinter Review

Attachment #8528565 - Flags: review?(catlee)

Rail Aliiev [:rail]

Updated

•

11 years ago

Depends on: 1074277

Rail Aliiev [:rail]

Comment 6

•

11 years ago

Comment on attachment 8528563 [details] [diff] [review] cleanslate.puppet.diff lgtm

Attachment #8528563 - Flags: review?(rail) → review+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8528565 - Attachment is obsolete: true

Attachment #8528565 - Flags: review?(catlee)

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8528563 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Comment 7

•

11 years ago

Attached file https://github.com/mozilla/build-runner/pull/16 — Details

Sorry to keep hounding you with reviews! >.<

Attachment #8529395 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 8

•

11 years ago

Comment on attachment 8529395 [details] [review] https://github.com/mozilla/build-runner/pull/16 Merged

Attachment #8529395 - Flags: review?(rail) → review+

Rail Aliiev [:rail]

Comment 9

•

11 years ago

(In reply to Morgan Phillips [:mrrrgn] from comment #7) > Created attachment 8529395 [details] [review] > https://github.com/mozilla/build-runner/pull/16 > > Sorry to keep hounding you with reviews! >.< No worries!

Morgan Phillips [:mrrrgn]

Assignee

Comment 10

•

11 years ago

Attached patch runner.halt.diff — Details — Splinter Review

This deploys the halt task to all runner platforms. It's turning out to be important in my local tests, since I've noticed some weird errors popping up after 4-5 runs in some cases.

Attachment #8529435 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #8529435 - Flags: review?(rail) → review+

Morgan Phillips [:mrrrgn]

Assignee

Comment 11

•

11 years ago

Attached patch idlelizer.diff — Details — Splinter Review

Attachment #8530877 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

11 years ago

Comment on attachment 8529435 [details] [diff] [review] runner.halt.diff Review of attachment 8529435 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/buildslave/manifests/startup/runner.pp @@ +33,5 @@ > } > } > } > include runner::tasks::buildbot > + include runner::tasks::halt Is this class defined in a different, as-yet-unapplied, patch?

Attachment #8529435 - Flags: review-

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

11 years ago

Comment on attachment 8530877 [details] [diff] [review] idlelizer.diff Review of attachment 8530877 [details] [diff] [review]: ----------------------------------------------------------------- ::: slave/buildslave/idleizer.py @@ +76,5 @@ > + if err.check(AttributeError): > + log.msg("Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature.") > + else: > + log.msg('callRemote("shutdown") failed') > + log.err(err) Our masters are running 0.8.2 .. does this even work? If it does work (because we backported support, as I seem to recall we did), we could just remove this conditional; if it doesn't, then we have other problems! @@ +100,5 @@ > self._setTimer(self.max_disconnected_time) > > def registerActivity(self): > if self._state == CONNECTED: > + if sys.platform == 'linux2' and not self._remote_shutdown_requested: Why just on Linuxes? Is that temporary?

Attachment #8530877 - Flags: review?(dustin) → review+

Morgan Phillips [:mrrrgn]

Assignee

Comment 14

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #13) > Comment on attachment 8530877 [details] [diff] [review] > idlelizer.diff > > Review of attachment 8530877 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: slave/buildslave/idleizer.py > @@ +76,5 @@ > > + if err.check(AttributeError): > > + log.msg("Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature.") > > + else: > > + log.msg('callRemote("shutdown") failed') > > + log.err(err) > > Our masters are running 0.8.2 .. does this even work? If it does work > (because we backported support, as I seem to recall we did), we could just > remove this conditional; if it doesn't, then we have other problems! > > @@ +100,5 @@ > > self._setTimer(self.max_disconnected_time) > > > > def registerActivity(self): > > if self._state == CONNECTED: > > + if sys.platform == 'linux2' and not self._remote_shutdown_requested: > > Why just on Linuxes? Is that temporary? It's temporary. It will also work on OSX currently; but I didn't want to roll this out all at once.

Morgan Phillips [:mrrrgn]

Assignee

Comment 15

•

11 years ago

Attached patch runner.haltclass.diff — Details — Splinter Review

Attachment #8530916 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8529435 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Comment 16

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #13) > Comment on attachment 8530877 [details] [diff] [review] > idlelizer.diff > > Review of attachment 8530877 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: slave/buildslave/idleizer.py > @@ +76,5 @@ > > + if err.check(AttributeError): > > + log.msg("Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature.") > > + else: > > + log.msg('callRemote("shutdown") failed') > > + log.err(err) > > Our masters are running 0.8.2 .. does this even work? If it does work > (because we backported support, as I seem to recall we did), we could just > remove this conditional; if it doesn't, then we have other problems! > > @@ +100,5 @@ > > self._setTimer(self.max_disconnected_time) > > > > def registerActivity(self): > > if self._state == CONNECTED: > > + if sys.platform == 'linux2' and not self._remote_shutdown_requested: > > Why just on Linuxes? Is that temporary? I'll remove the un-necessary conditional in another commit.

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8529395 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8530877 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Comment 17

•

11 years ago

Attached patch buildslave.update.diff — Details — Splinter Review

Attachment #8531303 - Flags: review?(dustin)

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8531303 - Flags: review?(dustin) → review?(coop)

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Attachment #8531303 - Flags: review?(coop) → review+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8531303 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Comment 18

•

11 years ago

I accidentally submitted the 'update buildslave' patch with the wrong reviewer in the commit message. The reviewer was coop not dustin.

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Morgan Phillips [:mrrrgn]

Assignee

Comment 19

•

11 years ago

Attached patch cleanslate.runner.update.diff — Details — Splinter Review

Attachment #8533296 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #8533296 - Flags: review?(rail) → review+

Morgan Phillips [:mrrrgn]

Assignee

Comment 20

•

11 years ago

Attached patch postflight.diff (obsolete) — Details — Splinter Review

This will allow us to turn on off no-reboots selectively. In this version only build machines will go into no reboot mode.

Attachment #8533302 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 21

•

11 years ago

Comment on attachment 8533302 [details] [diff] [review] postflight.diff Review of attachment 8533302 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/runner/files/post_flight.py @@ +16,5 @@ > + > + > +def get_hostname(): > + sysname, nodename, release, version, machine = os.uname() > + return nodename.split('.')[0] I think we use socket.gethostname() for this. @@ +22,5 @@ > + > +def get_recent_builds(slavename): > + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX) > + request = urllib2.Request(url) > + results = urllib2.urlopen(request) Please set some reasonable timeout here. Otherwise this may hang forever. @@ +23,5 @@ > +def get_recent_builds(slavename): > + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX) > + request = urllib2.Request(url) > + results = urllib2.urlopen(request) > + processed_results = json.loads(results.read()) This may raise at least ValueError and the script would exit(1) in this case. This means RETRY for runner. Not sure if this is what you want here. @@ +37,5 @@ > + rm_files = ['/var/tmp/cleanslate'] > + for f in rm_files: > + if os.path.exists(f): > + print('removing: %s' % f) > + os.remove(f) Possible race condition and uncaught exception (IOError and OSError IIRc). Can you wrap os.remove() with try/except and don't call os.path.exists? @@ +45,5 @@ > +def get_hostname_blacklist(): > + ''' > + A list of hostname expressions which will coerce a halt. > + ''' > + return ['^t.*'] # halt all test machines this would also blacklist try-linux64. ["^tst-.*", "^t-.*", "^talos-.*"] should match per http://hg.mozilla.org/build/buildbot-configs/file/f2255d82ce6d/mozilla-tests/production_config.py @@ +56,5 @@ > + ''' > + for expression in blacklist: > + if re.match(expression, entry): > + return True > + return False A nit, the same can be done as return any(re.match(e, entry) for e in blacklist) @@ +71,5 @@ > + halt() > + > + # stagger web requests in case of retries > + time.sleep(random.randint(1, 5)) > + build_data = get_recent_builds(hostname) I'm not sure how sleeping without retrying may help here... Also 5 seconds won't stop slaves from hammering if you add retrying. @@ +75,5 @@ > + build_data = get_recent_builds(hostname) > + > + if not build_data: > + # The exception should coerce a retry > + raise Exception('Failed to find data for %s' % hostname) This will retry the script, correct?

Attachment #8533302 - Flags: review?(rail) → review-

Morgan Phillips [:mrrrgn]

Assignee

Comment 22

•

11 years ago

(In reply to Rail Aliiev [:rail] from comment #21) > Comment on attachment 8533302 [details] [diff] [review] > postflight.diff > > Review of attachment 8533302 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: modules/runner/files/post_flight.py > @@ +16,5 @@ > > + > > + > > +def get_hostname(): > > + sysname, nodename, release, version, machine = os.uname() > > + return nodename.split('.')[0] > > I think we use socket.gethostname() for this. > > @@ +22,5 @@ > > + > > +def get_recent_builds(slavename): > > + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX) > > + request = urllib2.Request(url) > > + results = urllib2.urlopen(request) > > Please set some reasonable timeout here. Otherwise this may hang forever. > > @@ +23,5 @@ > > +def get_recent_builds(slavename): > > + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX) > > + request = urllib2.Request(url) > > + results = urllib2.urlopen(request) > > + processed_results = json.loads(results.read()) > > This may raise at least ValueError and the script would exit(1) in this > case. This means RETRY for runner. Not sure if this is what you want here. > > @@ +37,5 @@ > > + rm_files = ['/var/tmp/cleanslate'] > > + for f in rm_files: > > + if os.path.exists(f): > > + print('removing: %s' % f) > > + os.remove(f) > > Possible race condition and uncaught exception (IOError and OSError IIRc). > Can you wrap os.remove() with try/except and don't call os.path.exists? > > @@ +45,5 @@ > > +def get_hostname_blacklist(): > > + ''' > > + A list of hostname expressions which will coerce a halt. > > + ''' > > + return ['^t.*'] # halt all test machines > > this would also blacklist try-linux64. > ["^tst-.*", "^t-.*", "^talos-.*"] should match per > http://hg.mozilla.org/build/buildbot-configs/file/f2255d82ce6d/mozilla-tests/ > production_config.py > > @@ +56,5 @@ > > + ''' > > + for expression in blacklist: > > + if re.match(expression, entry): > > + return True > > + return False > > A nit, the same can be done as > > return any(re.match(e, entry) for e in blacklist) > > @@ +71,5 @@ > > + halt() > > + > > + # stagger web requests in case of retries > > + time.sleep(random.randint(1, 5)) > > + build_data = get_recent_builds(hostname) > > I'm not sure how sleeping without retrying may help here... Also 5 seconds > won't stop slaves from hammering if you add retrying. > > @@ +75,5 @@ > > + build_data = get_recent_builds(hostname) > > + > > + if not build_data: > > + # The exception should coerce a retry > > + raise Exception('Failed to find data for %s' % hostname) > > This will retry the script, correct? It will. Sleeping was a silly idea; there's now a timeout of 12 seconds for open sockets. I addressed the other changes as well. Anything that happens outside of calling halt() should result in a retry, so not handling most of the exceptions will be fine. Problems will show up in the logs and hit max-retries if it's not something intermittent (like a single timeout).

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

11 years ago

"defensive programming" / "let-it-crash" ++!

Morgan Phillips [:mrrrgn]

Assignee

Comment 24

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #23) > "defensive programming" / "let-it-crash" > ++! Defensive programming is always good; but in this case let it crash makes the most since. The idea is that, if anything seems even slightly awry we should consider forcing a reboot. Letting exceptions raise here will trigger a retry, then, eventually a reboot thanks to the fact that it's a process being managed by runner. The only cases worth handling are cases where we want to reboot immediately instead of retrying.

Morgan Phillips [:mrrrgn]

Assignee

Comment 25

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #23) > "defensive programming" / "let-it-crash" > ++! *sense (& any other grammer derps) :( :( :(

Morgan Phillips [:mrrrgn]

Assignee

Comment 26

•

11 years ago

Attached patch postflight.diff — Details — Splinter Review

Attachment #8533302 - Attachment is obsolete: true

Attachment #8533386 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 27

•

11 years ago

Comment on attachment 8533386 [details] [diff] [review] postflight.diff ,@;@, ,@;@;@;@;@;@/ )@;@; ,;@;@;@;@;@;@|_/@' e\ (|@;@:@\@;@;@;@:@( \ '@;@;@|@;@;@;@;'`"--' '@;@;/;@;/;@;' ) // | || \ \\ | || \ \\ ) \\ it! `"` `"``

Attachment #8533386 - Flags: review?(rail) → review+

Morgan Phillips [:mrrrgn]

Assignee

Comment 28

•

11 years ago

Attached file https://github.com/mozilla/build-runner/pull/19 — Details

Attachment #8533811 - Flags: review?(catlee)

Morgan Phillips [:mrrrgn]

Assignee

Comment 29

•

11 years ago

Attached patch runner.update.1.5.diff — Details — Splinter Review

Attachment #8534378 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #8534378 - Flags: review?(rail) → review+

Morgan Phillips [:mrrrgn]

Assignee

Comment 30

•

11 years ago

Attached patch runner.reboot-after-30.diff — Details — Splinter Review

Attachment #8534397 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #8534397 - Flags: review?(rail) → review+

Morgan Phillips [:mrrrgn]

Assignee

Comment 31

•

11 years ago

Attached patch runner.noreboot.talos.diff — Details — Splinter Review

This will stop rebooting talos machines (should be applied on a Monday)

Attachment #8535844 - Flags: review?(catlee)

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8533296 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8533386 - Flags: checked-in+

Chris AtLee [:catlee]

Comment 32

•

11 years ago

Comment on attachment 8535844 [details] [diff] [review] runner.noreboot.talos.diff Review of attachment 8535844 [details] [diff] [review]: ----------------------------------------------------------------- lgtm, as long is joel is ok with it!

Attachment #8535844 - Flags: review?(catlee) → review+

Chris AtLee [:catlee]

Comment 33

•

11 years ago

Comment on attachment 8533811 [details] [review] https://github.com/mozilla/build-runner/pull/19 merged on github a while ago

Attachment #8533811 - Flags: review?(catlee) → review+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8534378 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8534397 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8535844 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Comment 34

•

11 years ago

Attached patch runner.service_restart.diff (obsolete) — Details — Splinter Review

Attachment #8538746 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 35

•

11 years ago

Comment on attachment 8538746 [details] [diff] [review] runner.service_restart.diff Review of attachment 8538746 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/runner/templates/tasks/restart_services.erb @@ +4,5 @@ > + > +RESTART_SERVICES="<%= @service_names %>" > + > +for service_name in ${RESTART_SERVICES}; do > + if [[ $(service ${service_name} status) != '' ]]; then Can you check the exit code instead? The approach above may not work on Ubuntu since the command always prints something to stdout.

Attachment #8538746 - Flags: review?(rail) → review-

Morgan Phillips [:mrrrgn]

Assignee

Comment 36

•

11 years ago

Attached patch runner.service_restart.diff — Details — Splinter Review

Attachment #8538746 - Attachment is obsolete: true

Attachment #8539342 - Flags: review?(rail)

Morgan Phillips [:mrrrgn]

Assignee

Comment 37

•

11 years ago

(In reply to Rail Aliiev [:rail] from comment #35) > Comment on attachment 8538746 [details] [diff] [review] > runner.service_restart.diff > > Review of attachment 8538746 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: modules/runner/templates/tasks/restart_services.erb > @@ +4,5 @@ > > + > > +RESTART_SERVICES="<%= @service_names %>" > > + > > +for service_name in ${RESTART_SERVICES}; do > > + if [[ $(service ${service_name} status) != '' ]]; then > > Can you check the exit code instead? The approach above may not work on > Ubuntu since the command always prints something to stdout. Good point, I swapped it out with something more sensible. Thanks! :)

Rail Aliiev [:rail]

Updated

•

11 years ago

Attachment #8539342 - Flags: review?(rail) → review+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Assignee: nobody → winter2718

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8539342 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Comment 38

•

11 years ago

So, now with service restarts in place I'm going to stop reboots on a large swath of testers before proceeding. Once the 't-' and 'tst-' machines are removed from the post_flight hostname blacklist this task will be _finished_! \m/

Morgan Phillips [:mrrrgn]

Assignee

Comment 39

•

11 years ago

Attached patch test-noreboot.diff — Details — Splinter Review

To turn off tester rebooting again.

Attachment #8542648 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Comment 40

•

11 years ago

Comment on attachment 8542648 [details] [diff] [review] test-noreboot.diff Review of attachment 8542648 [details] [diff] [review]: ----------------------------------------------------------------- lets see what burns

Attachment #8542648 - Flags: review?(bugspam.Callek) → review+

Phil Ringnalda (:philor)

Comment 41

•

11 years ago

Good luck telling what burns - between bug 989048 and bug 1109932 exploding and the bizarre "anything can fail and probably will" bug 1114541 and the month-long rash of bug 1073442 consistently happening once a slave started hitting it until the instance was terminated, which seems to have suddenly stopped at 7am Monday, I had myself convinced that we already were doing no-reboot on tst-linux*, and that it was the reason why absolutely no result on those slaves could be believed. So if this does wind up having any effect on the running of tests other than complete and permanent failure, you absolutely will not be able to tell that from sheriff reports, because we currently can't tell the difference between tst-linux* results and random() results.

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Attachment #8542648 - Flags: checked-in+

Morgan Phillips [:mrrrgn]

Assignee

Comment 42

•

11 years ago

Attached patch enable-test-reboots.diff — Details — Splinter Review

Philor, I have this patch ready and am watching the trees. I added you for info just to call your attention to the way that reboots are working now. The buildbot logs say "reboot skipped" but that's a result of the configs not being changed to reflect our current situation. Now, after every job, several checks are performed to decide if a machine should be rebooted. If its hostname matches a blacklist regex it will always be rebooted. An overall view of how many machines are rebooting at any time can be found here: https://stats.taskcluster.net/grafana/#/dashboard/db/runner

Attachment #8542746 - Flags: feedback?(philringnalda)

Phil Ringnalda (:philor)

Comment 43

•

11 years ago

Comment on attachment 8542746 [details] [diff] [review] enable-test-reboots.diff Got it, thanks. (Other than an inability to see anything on taskcluster, since "login with your @mozilla.com address" would require that I first quit my current job, and then get hired by MoCo.)

Attachment #8542746 - Flags: feedback?(philringnalda) → feedback+

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Status: REOPENED → RESOLVED

Closed: 11 years ago → 11 years ago

Resolution: --- → FIXED

Pete Moore [:pmoore][:pete]

Updated

•

10 years ago

QA Contact: pmoore → mshal

You need to log in before you can comment on or make changes to this bug.