Closed Bug 1028191 Opened 6 years ago Closed 5 years ago

Stop rebooting after every job


(Release Engineering :: General, defect)

Not set


(Not tracked)



(Reporter: catlee, Assigned: mrrrgn)




(2 files)

Currently we reboot after almost every job on buildbot in order to do things like:

- Make sure we re-puppetize
- Clean up temporary files
- Make sure no old processes are running
- Clean up memory fragmentation

However, by rebooting, we cause some problems:
- We lose the filesystem cache between every job. In AWS this turns into lots of extra IO to read back the same files over and over after each reboot
- We waste 2-5 minutes per job doing a reboot
- Extra load on puppet masters

We can address nearly all of the issues we reboot for in pre-flight checks:

- Check if there are puppet (or AMI) changes that need to be applied
- We can still clean up temporary files
- We can kill stray processes
- I don't think memory fragmentation is an issue any more. We used to have problems on 32-bit linux machines that were up for a long time. Eventually they weren't able to link large libraries. All our build machines are 64-bit now I believe.

This will require that 'runner' be in charge of starting and stopping buildbot. I imagine we'd do something like this:
- Run through pre-flight checks
- Start buildbot
- Watch twistd.log, make sure buildbot actually starts and connects to a master
- Initiate graceful shutdown of buildbot after X minutes (30?). There are ways to do this locally (e.g. by touching a shutdown.stamp file) instead of poking the buildbot master.
- Run any post-flight tasks
- Go back to beginning
Duplicate of this bug: 660080
Assignee: nobody → iconnolly
CentOS only for the moment.

As per meeting with :dustin and :catlee, Ubuntu is next - followed by Mac and Windows.
Attachment #8460425 - Flags: review?(dustin)
Comment on attachment 8460425 [details] [diff] [review]

I didn't look too deeply at this patch yet, but I worry that it forces the runner-startup model on all buildslave's, including servo and possibly qa, etc. Even if they don't use the releng::slave toplevel, which is where runner is setup/used.

I also see modules/buildslave/manifests/startup/runner.pp doesn't have any requirements on runner itself being present... 

nor does it actually install any runner tasks, (which I would expect it to, via an include or class {} )  its like it merely removes the init.d style of buildbot startup.  (in which case its misnamed, and should be "remove initd startup)
Attachment #8460425 - Flags: feedback-
>>> (in which case its misnamed, and should be "remove initd startup)

That's a fair point, let me pull this bit of this more overarching bug into its own specific bug and I'll address the rest of your comment there.
Depends on: 1042340
Attachment #8460425 - Flags: review?(dustin)
Depends on: 1042358
I just came across this bug... Is this for builders only, or test machines as well? If this includes test machines (especially talos), I wonder how much impact longer running (possibly hung/errored/memory leaking) system processes might have on performance numbers. The nice thing about rebooting between tests is that you're (in theory) always starting with the same baseline system statistics.  In reality that isn't quite true since I've seen some tests scribble tmp files all over the filesystem instead of just in designated places like /tmp, but it IS true for processes.
I'd like to do this for test machines as well. We'll have to see what impact these long running services have. Perhaps we can restart the services between test runs. Or perhaps we can reboot every 5th job, or after X hours of uptime.

In any case, I think there's lots of room for improvement here.
Depends on: 1052581
Depends on: 1103123
Blocks: 1103126
Depends on: 1118125
Attachment #8554848 - Flags: review?(bugspam.Callek)
Comment on attachment 8554848 [details] [diff] [review]

I don't see any justification here (in this bug) about the test types we're now turning into no-reboot.

Please provide that for archaeology if nothing else.
Attachment #8554848 - Flags: feedback-
(In reply to Justin Wood (:Callek) from comment #9)
> Comment on attachment 8554848 [details] [diff] [review]
> clear.reboot.settings.diff
> I don't see any justification here (in this bug) about the test types we're
> now turning into no-reboot.
> Please provide that for archaeology if nothing else.

This patch was about housekeeping. The android/emulator entries in the blacklist are rather redundant because the problem tests are also mochitests and reftests. It also seems like a good idea to remove the restart_services task, since it does nothing at this point.
Comment on attachment 8554848 [details] [diff] [review]

Review of attachment 8554848 [details] [diff] [review]:
----------------------------------------------------------------- change is fine, but restart_services.pp needs to be addressed ;-)

::: modules/runner/manifests/tasks/restart_services.pp
@@ -17,5 @@
> -    }
> -    runner::task {
> -        "${runlevel}-restart_services":
> -            content  => template("${module_name}/tasks/restart_services.erb");
> -    }

note if the goal is to remove "/opt/runner/tasks.d/99-restart_services" this fails to do so. Please add a file { } with absent here for that, all-else can go though.

We can then rip that out completely in 48 hours or so (48 hours to account for still-prod machines needing to reboot)
Attachment #8554848 - Flags: review?(bugspam.Callek) → review-
I will add that. Though, it's sort of a null operation in my mind. Especially since this change only affects aws machines. :}
Isn't the /opt/runner/tasks.d/ directory purged?
We regen amis every day, so and it's created anew.
^^ I can't English lately. :o s/and//
So, technically we have stopped rebooting after every job. There is a very long tail for this however, because various jobs interact with one another in nefarious ways. We have a minefield of > <num_job_types>^2 possible job combinations.

Right now I'm considering creating another tracking task and adding specific bugs to it as blockers.
The scope of this ticket has been achieved, as in, we do not reboot after _every_ job now. 

That said, in lieu of the work that has been done, there is still some optimizing left which should be addressed in another ticket. That ticket is:
Closed: 5 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.