Closed
Bug 1103123
Opened 11 years ago
Closed 11 years ago
Stop rebooting Linux slaves after buildbot jobs
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mrrrgn, Assigned: mrrrgn)
References
Details
Attachments
(16 files, 3 obsolete files)
|
42 bytes,
text/plain
|
Details | |
|
4.29 KB,
patch
|
rail
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
47 bytes,
text/x-github-pull-request
|
rail
:
review+
mrrrgn
:
checked-in+
|
Details | Review |
|
2.75 KB,
patch
|
rail
:
review+
dustin
:
review-
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
7.19 KB,
patch
|
dustin
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
2.19 KB,
patch
|
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.95 KB,
patch
|
coop
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.54 KB,
patch
|
rail
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
4.45 KB,
patch
|
rail
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
47 bytes,
text/x-github-pull-request
|
catlee
:
review+
|
Details | Review |
|
982 bytes,
patch
|
rail
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.70 KB,
patch
|
rail
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
1006 bytes,
patch
|
catlee
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
2.34 KB,
patch
|
rail
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.18 KB,
patch
|
Callek
:
review+
mrrrgn
:
checked-in+
|
Details | Diff | Splinter Review |
|
1.28 KB,
patch
|
philor
:
feedback+
|
Details | Diff | Splinter Review |
To primary things need to happen:
- Modify the cleanup task so that it does not run before every job.
- Create a runner task which kills processes which were not present just after boot.
| Assignee | ||
Comment 1•11 years ago
|
||
Attachment #8527733 -
Flags: review?
Attachment #8527733 -
Flags: feedback?(rail)
| Assignee | ||
Comment 2•11 years ago
|
||
I built a tool called 'cleanslate' for killing all processes which weren't live just after system boot. This seems like a good start at cleaning up after jobs w/o rebooting.
Comment 3•11 years ago
|
||
Comment on attachment 8527733 [details]
https://github.com/mrrrgn/build-cleanslate
I created a branch with some comments: https://github.com/rail/build-cleanslate/compare/comments?expand=1
Attachment #8527733 -
Flags: review?
Attachment #8527733 -
Flags: feedback?(rail)
| Assignee | ||
Comment 4•11 years ago
|
||
Deploy cleanslate to CentOS machines and create a cleanslate puppet module.
Attachment #8528563 -
Flags: review?(rail)
| Assignee | ||
Comment 5•11 years ago
|
||
Attachment #8528565 -
Flags: review?(catlee)
Comment 6•11 years ago
|
||
Comment on attachment 8528563 [details] [diff] [review]
cleanslate.puppet.diff
lgtm
Attachment #8528563 -
Flags: review?(rail) → review+
| Assignee | ||
Updated•11 years ago
|
Attachment #8528565 -
Attachment is obsolete: true
Attachment #8528565 -
Flags: review?(catlee)
| Assignee | ||
Updated•11 years ago
|
Attachment #8528563 -
Flags: checked-in+
| Assignee | ||
Comment 7•11 years ago
|
||
Sorry to keep hounding you with reviews! >.<
Attachment #8529395 -
Flags: review?(rail)
Comment 8•11 years ago
|
||
Comment on attachment 8529395 [details] [review]
https://github.com/mozilla/build-runner/pull/16
Merged
Attachment #8529395 -
Flags: review?(rail) → review+
Comment 9•11 years ago
|
||
(In reply to Morgan Phillips [:mrrrgn] from comment #7)
> Created attachment 8529395 [details] [review]
> https://github.com/mozilla/build-runner/pull/16
>
> Sorry to keep hounding you with reviews! >.<
No worries!
| Assignee | ||
Comment 10•11 years ago
|
||
This deploys the halt task to all runner platforms. It's turning out to be important in my local tests, since I've noticed some weird errors popping up after 4-5 runs in some cases.
Attachment #8529435 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #8529435 -
Flags: review?(rail) → review+
| Assignee | ||
Comment 11•11 years ago
|
||
Attachment #8530877 -
Flags: review?(dustin)
Comment 12•11 years ago
|
||
Comment on attachment 8529435 [details] [diff] [review]
runner.halt.diff
Review of attachment 8529435 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/buildslave/manifests/startup/runner.pp
@@ +33,5 @@
> }
> }
> }
> include runner::tasks::buildbot
> + include runner::tasks::halt
Is this class defined in a different, as-yet-unapplied, patch?
Attachment #8529435 -
Flags: review-
Comment 13•11 years ago
|
||
Comment on attachment 8530877 [details] [diff] [review]
idlelizer.diff
Review of attachment 8530877 [details] [diff] [review]:
-----------------------------------------------------------------
::: slave/buildslave/idleizer.py
@@ +76,5 @@
> + if err.check(AttributeError):
> + log.msg("Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature.")
> + else:
> + log.msg('callRemote("shutdown") failed')
> + log.err(err)
Our masters are running 0.8.2 .. does this even work? If it does work (because we backported support, as I seem to recall we did), we could just remove this conditional; if it doesn't, then we have other problems!
@@ +100,5 @@
> self._setTimer(self.max_disconnected_time)
>
> def registerActivity(self):
> if self._state == CONNECTED:
> + if sys.platform == 'linux2' and not self._remote_shutdown_requested:
Why just on Linuxes? Is that temporary?
Attachment #8530877 -
Flags: review?(dustin) → review+
| Assignee | ||
Comment 14•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> Comment on attachment 8530877 [details] [diff] [review]
> idlelizer.diff
>
> Review of attachment 8530877 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: slave/buildslave/idleizer.py
> @@ +76,5 @@
> > + if err.check(AttributeError):
> > + log.msg("Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature.")
> > + else:
> > + log.msg('callRemote("shutdown") failed')
> > + log.err(err)
>
> Our masters are running 0.8.2 .. does this even work? If it does work
> (because we backported support, as I seem to recall we did), we could just
> remove this conditional; if it doesn't, then we have other problems!
>
> @@ +100,5 @@
> > self._setTimer(self.max_disconnected_time)
> >
> > def registerActivity(self):
> > if self._state == CONNECTED:
> > + if sys.platform == 'linux2' and not self._remote_shutdown_requested:
>
> Why just on Linuxes? Is that temporary?
It's temporary. It will also work on OSX currently; but I didn't want to roll this out all at once.
| Assignee | ||
Comment 15•11 years ago
|
||
Attachment #8530916 -
Flags: checked-in+
| Assignee | ||
Updated•11 years ago
|
Attachment #8529435 -
Flags: checked-in+
| Assignee | ||
Comment 16•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #13)
> Comment on attachment 8530877 [details] [diff] [review]
> idlelizer.diff
>
> Review of attachment 8530877 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: slave/buildslave/idleizer.py
> @@ +76,5 @@
> > + if err.check(AttributeError):
> > + log.msg("Master does not support slave initiated shutdown. Upgrade master to 0.8.3 or later to use this feature.")
> > + else:
> > + log.msg('callRemote("shutdown") failed')
> > + log.err(err)
>
> Our masters are running 0.8.2 .. does this even work? If it does work
> (because we backported support, as I seem to recall we did), we could just
> remove this conditional; if it doesn't, then we have other problems!
>
> @@ +100,5 @@
> > self._setTimer(self.max_disconnected_time)
> >
> > def registerActivity(self):
> > if self._state == CONNECTED:
> > + if sys.platform == 'linux2' and not self._remote_shutdown_requested:
>
> Why just on Linuxes? Is that temporary?
I'll remove the un-necessary conditional in another commit.
| Assignee | ||
Updated•11 years ago
|
Attachment #8529395 -
Flags: checked-in+
| Assignee | ||
Updated•11 years ago
|
Attachment #8530877 -
Flags: checked-in+
| Assignee | ||
Comment 17•11 years ago
|
||
Attachment #8531303 -
Flags: review?(dustin)
| Assignee | ||
Updated•11 years ago
|
Attachment #8531303 -
Flags: review?(dustin) → review?(coop)
Updated•11 years ago
|
Attachment #8531303 -
Flags: review?(coop) → review+
| Assignee | ||
Updated•11 years ago
|
Attachment #8531303 -
Flags: checked-in+
| Assignee | ||
Comment 18•11 years ago
|
||
I accidentally submitted the 'update buildslave' patch with the wrong reviewer in the commit message. The reviewer was coop not dustin.
| Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
| Assignee | ||
Updated•11 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
| Assignee | ||
Comment 19•11 years ago
|
||
Attachment #8533296 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #8533296 -
Flags: review?(rail) → review+
| Assignee | ||
Comment 20•11 years ago
|
||
This will allow us to turn on off no-reboots selectively. In this version only build machines will go into no reboot mode.
Attachment #8533302 -
Flags: review?(rail)
Comment 21•11 years ago
|
||
Comment on attachment 8533302 [details] [diff] [review]
postflight.diff
Review of attachment 8533302 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/runner/files/post_flight.py
@@ +16,5 @@
> +
> +
> +def get_hostname():
> + sysname, nodename, release, version, machine = os.uname()
> + return nodename.split('.')[0]
I think we use socket.gethostname() for this.
@@ +22,5 @@
> +
> +def get_recent_builds(slavename):
> + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> + request = urllib2.Request(url)
> + results = urllib2.urlopen(request)
Please set some reasonable timeout here. Otherwise this may hang forever.
@@ +23,5 @@
> +def get_recent_builds(slavename):
> + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> + request = urllib2.Request(url)
> + results = urllib2.urlopen(request)
> + processed_results = json.loads(results.read())
This may raise at least ValueError and the script would exit(1) in this case. This means RETRY for runner. Not sure if this is what you want here.
@@ +37,5 @@
> + rm_files = ['/var/tmp/cleanslate']
> + for f in rm_files:
> + if os.path.exists(f):
> + print('removing: %s' % f)
> + os.remove(f)
Possible race condition and uncaught exception (IOError and OSError IIRc). Can you wrap os.remove() with try/except and don't call os.path.exists?
@@ +45,5 @@
> +def get_hostname_blacklist():
> + '''
> + A list of hostname expressions which will coerce a halt.
> + '''
> + return ['^t.*'] # halt all test machines
this would also blacklist try-linux64.
["^tst-.*", "^t-.*", "^talos-.*"] should match per http://hg.mozilla.org/build/buildbot-configs/file/f2255d82ce6d/mozilla-tests/production_config.py
@@ +56,5 @@
> + '''
> + for expression in blacklist:
> + if re.match(expression, entry):
> + return True
> + return False
A nit, the same can be done as
return any(re.match(e, entry) for e in blacklist)
@@ +71,5 @@
> + halt()
> +
> + # stagger web requests in case of retries
> + time.sleep(random.randint(1, 5))
> + build_data = get_recent_builds(hostname)
I'm not sure how sleeping without retrying may help here... Also 5 seconds won't stop slaves from hammering if you add retrying.
@@ +75,5 @@
> + build_data = get_recent_builds(hostname)
> +
> + if not build_data:
> + # The exception should coerce a retry
> + raise Exception('Failed to find data for %s' % hostname)
This will retry the script, correct?
Attachment #8533302 -
Flags: review?(rail) → review-
| Assignee | ||
Comment 22•11 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #21)
> Comment on attachment 8533302 [details] [diff] [review]
> postflight.diff
>
> Review of attachment 8533302 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: modules/runner/files/post_flight.py
> @@ +16,5 @@
> > +
> > +
> > +def get_hostname():
> > + sysname, nodename, release, version, machine = os.uname()
> > + return nodename.split('.')[0]
>
> I think we use socket.gethostname() for this.
>
> @@ +22,5 @@
> > +
> > +def get_recent_builds(slavename):
> > + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> > + request = urllib2.Request(url)
> > + results = urllib2.urlopen(request)
>
> Please set some reasonable timeout here. Otherwise this may hang forever.
>
> @@ +23,5 @@
> > +def get_recent_builds(slavename):
> > + url = '%s%s%s' % (BUILD_API_URL_BASE, slavename, BUILD_API_URL_SUFFIX)
> > + request = urllib2.Request(url)
> > + results = urllib2.urlopen(request)
> > + processed_results = json.loads(results.read())
>
> This may raise at least ValueError and the script would exit(1) in this
> case. This means RETRY for runner. Not sure if this is what you want here.
>
> @@ +37,5 @@
> > + rm_files = ['/var/tmp/cleanslate']
> > + for f in rm_files:
> > + if os.path.exists(f):
> > + print('removing: %s' % f)
> > + os.remove(f)
>
> Possible race condition and uncaught exception (IOError and OSError IIRc).
> Can you wrap os.remove() with try/except and don't call os.path.exists?
>
> @@ +45,5 @@
> > +def get_hostname_blacklist():
> > + '''
> > + A list of hostname expressions which will coerce a halt.
> > + '''
> > + return ['^t.*'] # halt all test machines
>
> this would also blacklist try-linux64.
> ["^tst-.*", "^t-.*", "^talos-.*"] should match per
> http://hg.mozilla.org/build/buildbot-configs/file/f2255d82ce6d/mozilla-tests/
> production_config.py
>
> @@ +56,5 @@
> > + '''
> > + for expression in blacklist:
> > + if re.match(expression, entry):
> > + return True
> > + return False
>
> A nit, the same can be done as
>
> return any(re.match(e, entry) for e in blacklist)
>
> @@ +71,5 @@
> > + halt()
> > +
> > + # stagger web requests in case of retries
> > + time.sleep(random.randint(1, 5))
> > + build_data = get_recent_builds(hostname)
>
> I'm not sure how sleeping without retrying may help here... Also 5 seconds
> won't stop slaves from hammering if you add retrying.
>
> @@ +75,5 @@
> > + build_data = get_recent_builds(hostname)
> > +
> > + if not build_data:
> > + # The exception should coerce a retry
> > + raise Exception('Failed to find data for %s' % hostname)
>
> This will retry the script, correct? It will.
Sleeping was a silly idea; there's now a timeout of 12 seconds for open sockets. I addressed the other changes as well.
Anything that happens outside of calling halt() should result in a retry, so not handling most of the exceptions will be fine. Problems will show up in the logs and hit max-retries if it's not something intermittent (like a single timeout).
Comment 23•11 years ago
|
||
"defensive programming" / "let-it-crash"
++!
| Assignee | ||
Comment 24•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> "defensive programming" / "let-it-crash"
> ++!
Defensive programming is always good; but in this case let it crash makes the most since. The idea is that, if anything seems even slightly awry we should consider forcing a reboot. Letting exceptions raise here will trigger a retry, then, eventually a reboot thanks to the fact that it's a process being managed by runner. The only cases worth handling are cases where we want to reboot immediately instead of retrying.
| Assignee | ||
Comment 25•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #23)
> "defensive programming" / "let-it-crash"
> ++!
*sense (& any other grammer derps) :( :( :(
| Assignee | ||
Comment 26•11 years ago
|
||
Attachment #8533302 -
Attachment is obsolete: true
Attachment #8533386 -
Flags: review?(rail)
Comment 27•11 years ago
|
||
Comment on attachment 8533386 [details] [diff] [review]
postflight.diff
,@;@,
,@;@;@;@;@;@/ )@;@;
,;@;@;@;@;@;@|_/@' e\
(|@;@:@\@;@;@;@:@( \
'@;@;@|@;@;@;@;'`"--'
'@;@;/;@;/;@;'
) // | ||
\ \\ | ||
\ \\ ) \\ it!
`"` `"``
Attachment #8533386 -
Flags: review?(rail) → review+
| Assignee | ||
Comment 28•11 years ago
|
||
Attachment #8533811 -
Flags: review?(catlee)
| Assignee | ||
Comment 29•11 years ago
|
||
Attachment #8534378 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #8534378 -
Flags: review?(rail) → review+
| Assignee | ||
Comment 30•11 years ago
|
||
Attachment #8534397 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #8534397 -
Flags: review?(rail) → review+
| Assignee | ||
Comment 31•11 years ago
|
||
This will stop rebooting talos machines (should be applied on a Monday)
Attachment #8535844 -
Flags: review?(catlee)
| Assignee | ||
Updated•11 years ago
|
Attachment #8533296 -
Flags: checked-in+
| Assignee | ||
Updated•11 years ago
|
Attachment #8533386 -
Flags: checked-in+
Comment 32•11 years ago
|
||
Comment on attachment 8535844 [details] [diff] [review]
runner.noreboot.talos.diff
Review of attachment 8535844 [details] [diff] [review]:
-----------------------------------------------------------------
lgtm, as long is joel is ok with it!
Attachment #8535844 -
Flags: review?(catlee) → review+
Comment 33•11 years ago
|
||
Comment on attachment 8533811 [details] [review]
https://github.com/mozilla/build-runner/pull/19
merged on github a while ago
Attachment #8533811 -
Flags: review?(catlee) → review+
| Assignee | ||
Updated•11 years ago
|
Attachment #8534378 -
Flags: checked-in+
| Assignee | ||
Updated•11 years ago
|
Attachment #8534397 -
Flags: checked-in+
| Assignee | ||
Updated•11 years ago
|
Attachment #8535844 -
Flags: checked-in+
| Assignee | ||
Comment 34•11 years ago
|
||
Attachment #8538746 -
Flags: review?(rail)
Comment 35•11 years ago
|
||
Comment on attachment 8538746 [details] [diff] [review]
runner.service_restart.diff
Review of attachment 8538746 [details] [diff] [review]:
-----------------------------------------------------------------
::: modules/runner/templates/tasks/restart_services.erb
@@ +4,5 @@
> +
> +RESTART_SERVICES="<%= @service_names %>"
> +
> +for service_name in ${RESTART_SERVICES}; do
> + if [[ $(service ${service_name} status) != '' ]]; then
Can you check the exit code instead? The approach above may not work on Ubuntu since the command always prints something to stdout.
Attachment #8538746 -
Flags: review?(rail) → review-
| Assignee | ||
Comment 36•11 years ago
|
||
Attachment #8538746 -
Attachment is obsolete: true
Attachment #8539342 -
Flags: review?(rail)
| Assignee | ||
Comment 37•11 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #35)
> Comment on attachment 8538746 [details] [diff] [review]
> runner.service_restart.diff
>
> Review of attachment 8538746 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: modules/runner/templates/tasks/restart_services.erb
> @@ +4,5 @@
> > +
> > +RESTART_SERVICES="<%= @service_names %>"
> > +
> > +for service_name in ${RESTART_SERVICES}; do
> > + if [[ $(service ${service_name} status) != '' ]]; then
>
> Can you check the exit code instead? The approach above may not work on
> Ubuntu since the command always prints something to stdout.
Good point, I swapped it out with something more sensible. Thanks! :)
Updated•11 years ago
|
Attachment #8539342 -
Flags: review?(rail) → review+
| Assignee | ||
Updated•11 years ago
|
Assignee: nobody → winter2718
| Assignee | ||
Updated•11 years ago
|
Attachment #8539342 -
Flags: checked-in+
| Assignee | ||
Comment 38•11 years ago
|
||
So, now with service restarts in place I'm going to stop reboots on a large swath of testers before proceeding. Once the 't-' and 'tst-' machines are removed from the post_flight hostname blacklist this task will be _finished_! \m/
| Assignee | ||
Comment 39•11 years ago
|
||
To turn off tester rebooting again.
Attachment #8542648 -
Flags: review?(bugspam.Callek)
Comment 40•11 years ago
|
||
Comment on attachment 8542648 [details] [diff] [review]
test-noreboot.diff
Review of attachment 8542648 [details] [diff] [review]:
-----------------------------------------------------------------
lets see what burns
Attachment #8542648 -
Flags: review?(bugspam.Callek) → review+
Comment 41•11 years ago
|
||
Good luck telling what burns - between bug 989048 and bug 1109932 exploding and the bizarre "anything can fail and probably will" bug 1114541 and the month-long rash of bug 1073442 consistently happening once a slave started hitting it until the instance was terminated, which seems to have suddenly stopped at 7am Monday, I had myself convinced that we already were doing no-reboot on tst-linux*, and that it was the reason why absolutely no result on those slaves could be believed. So if this does wind up having any effect on the running of tests other than complete and permanent failure, you absolutely will not be able to tell that from sheriff reports, because we currently can't tell the difference between tst-linux* results and random() results.
| Assignee | ||
Updated•11 years ago
|
Attachment #8542648 -
Flags: checked-in+
| Assignee | ||
Comment 42•11 years ago
|
||
Philor, I have this patch ready and am watching the trees. I added you for info just to call your attention to the way that reboots are working now. The buildbot logs say "reboot skipped" but that's a result of the configs not being changed to reflect our current situation.
Now, after every job, several checks are performed to decide if a machine should be rebooted. If its hostname matches a blacklist regex it will always be rebooted. An overall view of how many machines are rebooting at any time can be found here: https://stats.taskcluster.net/grafana/#/dashboard/db/runner
Attachment #8542746 -
Flags: feedback?(philringnalda)
Comment 43•11 years ago
|
||
Comment on attachment 8542746 [details] [diff] [review]
enable-test-reboots.diff
Got it, thanks.
(Other than an inability to see anything on taskcluster, since "login with your @mozilla.com address" would require that I first quit my current job, and then get hired by MoCo.)
Attachment #8542746 -
Flags: feedback?(philringnalda) → feedback+
| Assignee | ||
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
QA Contact: pmoore → mshal
You need to log in
before you can comment on or make changes to this bug.
Description
•