Closed Bug 1028191 Opened 11 years ago Closed 10 years ago

Stop rebooting after every job

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: mrrrgn)

References

Details

Attachments

(2 files)

manage_buildbot.diff 11 years ago Ian Connolly [:ianconnolly] 8.32 KB, patch	Callek : feedback-	Details \| Diff \| Splinter Review
clear.reboot.settings.diff 11 years ago Morgan Phillips [:mrrrgn] 3.56 KB, patch	Callek : review- Callek : feedback-	Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Reporter

Description

•

11 years ago

Currently we reboot after almost every job on buildbot in order to do things like: - Make sure we re-puppetize - Clean up temporary files - Make sure no old processes are running - Clean up memory fragmentation However, by rebooting, we cause some problems: - We lose the filesystem cache between every job. In AWS this turns into lots of extra IO to read back the same files over and over after each reboot - We waste 2-5 minutes per job doing a reboot - Extra load on puppet masters We can address nearly all of the issues we reboot for in pre-flight checks: - Check if there are puppet (or AMI) changes that need to be applied - We can still clean up temporary files - We can kill stray processes - I don't think memory fragmentation is an issue any more. We used to have problems on 32-bit linux machines that were up for a long time. Eventually they weren't able to link large libraries. All our build machines are 64-bit now I believe. This will require that 'runner' be in charge of starting and stopping buildbot. I imagine we'd do something like this: - Run through pre-flight checks - Start buildbot - Watch twistd.log, make sure buildbot actually starts and connects to a master - Initiate graceful shutdown of buildbot after X minutes (30?). There are ways to do this locally (e.g. by touching a shutdown.stamp file) instead of poking the buildbot master. - Run any post-flight tasks - Go back to beginning

Ian Connolly [:ianconnolly]

Updated

•

11 years ago

Assignee: nobody → iconnolly

Ian Connolly [:ianconnolly]

Comment 2

•

11 years ago

Attached patch manage_buildbot.diff — Details — Splinter Review

CentOS only for the moment. As per meeting with :dustin and :catlee, Ubuntu is next - followed by Mac and Windows.

Attachment #8460425 - Flags: review?(dustin)

Justin Wood (:Callek)

Comment 3

•

11 years ago

Comment on attachment 8460425 [details] [diff] [review] manage_buildbot.diff I didn't look too deeply at this patch yet, but I worry that it forces the runner-startup model on all buildslave's, including servo and possibly qa, etc. Even if they don't use the releng::slave toplevel, which is where runner is setup/used. I also see modules/buildslave/manifests/startup/runner.pp doesn't have any requirements on runner itself being present... nor does it actually install any runner tasks, (which I would expect it to, via an include or class {} ) its like it merely removes the init.d style of buildbot startup. (in which case its misnamed, and should be "remove initd startup)

Attachment #8460425 - Flags: feedback-

Ian Connolly [:ianconnolly]

Comment 4

•

11 years ago

>>> (in which case its misnamed, and should be "remove initd startup) That's a fair point, let me pull this bit of this more overarching bug into its own specific bug and I'll address the rest of your comment there.

Ian Connolly [:ianconnolly]

Updated

•

11 years ago

Depends on: 1042340

Ian Connolly [:ianconnolly]

Updated

•

11 years ago

Attachment #8460425 - Flags: review?(dustin)

Ian Connolly [:ianconnolly]

Updated

•

11 years ago

Depends on: 1042358

Pete Moore [:pmoore][:pete]

Updated

•

11 years ago

Blocks: 1043839

Amy Rich [:arr] [:arich]

Comment 5

•

11 years ago

I just came across this bug... Is this for builders only, or test machines as well? If this includes test machines (especially talos), I wonder how much impact longer running (possibly hung/errored/memory leaking) system processes might have on performance numbers. The nice thing about rebooting between tests is that you're (in theory) always starting with the same baseline system statistics. In reality that isn't quite true since I've seen some tests scribble tmp files all over the filesystem instead of just in designated places like /tmp, but it IS true for processes.

Chris AtLee [:catlee]

Reporter

Comment 6

•

11 years ago

I'd like to do this for test machines as well. We'll have to see what impact these long running services have. Perhaps we can restart the services between test runs. Or perhaps we can reboot every 5th job, or after X hours of uptime. In any case, I think there's lots of room for improvement here.

Ian Connolly [:ianconnolly]

Updated

•

11 years ago

Depends on: 1052581

Morgan Phillips [:mrrrgn]

Assignee

Comment 7

•

11 years ago

http://i0.kym-cdn.com/photos/images/newsfeed/000/043/786/1266464746097.jpg

Assignee: ian → winter2718

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Depends on: 1103123

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Blocks: 1103126

Morgan Phillips [:mrrrgn]

Assignee

Updated

•

11 years ago

Depends on: 1118125

Morgan Phillips [:mrrrgn]

Assignee

Comment 8

•

11 years ago

Attached patch clear.reboot.settings.diff — Details — Splinter Review

Attachment #8554848 - Flags: review?(bugspam.Callek)

Justin Wood (:Callek)

Comment 9

•

11 years ago

Comment on attachment 8554848 [details] [diff] [review] clear.reboot.settings.diff I don't see any justification here (in this bug) about the test types we're now turning into no-reboot. Please provide that for archaeology if nothing else.

Attachment #8554848 - Flags: feedback-

Morgan Phillips [:mrrrgn]

Assignee

Comment 10

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #9) > Comment on attachment 8554848 [details] [diff] [review] > clear.reboot.settings.diff > > I don't see any justification here (in this bug) about the test types we're > now turning into no-reboot. > > Please provide that for archaeology if nothing else. This patch was about housekeeping. The android/emulator entries in the blacklist are rather redundant because the problem tests are also mochitests and reftests. It also seems like a good idea to remove the restart_services task, since it does nothing at this point.

Justin Wood (:Callek)

Comment 11

•

10 years ago

Comment on attachment 8554848 [details] [diff] [review] clear.reboot.settings.diff Review of attachment 8554848 [details] [diff] [review]: ----------------------------------------------------------------- postflight.py change is fine, but restart_services.pp needs to be addressed ;-) ::: modules/runner/manifests/tasks/restart_services.pp @@ -17,5 @@ > - } > - runner::task { > - "${runlevel}-restart_services": > - content => template("${module_name}/tasks/restart_services.erb"); > - } note if the goal is to remove "/opt/runner/tasks.d/99-restart_services" this fails to do so. Please add a file { } with absent here for that, all-else can go though. We can then rip that out completely in 48 hours or so (48 hours to account for still-prod machines needing to reboot)

Attachment #8554848 - Flags: review?(bugspam.Callek) → review-

Morgan Phillips [:mrrrgn]

Assignee

Comment 12

•

10 years ago

I will add that. Though, it's sort of a null operation in my mind. Especially since this change only affects aws machines. :}

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

10 years ago

Isn't the /opt/runner/tasks.d/ directory purged?

Morgan Phillips [:mrrrgn]

Assignee

Comment 14

•

10 years ago

We regen amis every day, so and it's created anew.

Morgan Phillips [:mrrrgn]

Assignee

Comment 15

•

10 years ago

^^ I can't English lately. :o s/and//

Morgan Phillips [:mrrrgn]

Assignee

Comment 16

•

10 years ago

So, technically we have stopped rebooting after every job. There is a very long tail for this however, because various jobs interact with one another in nefarious ways. We have a minefield of > <num_job_types>^2 possible job combinations. Right now I'm considering creating another tracking task and adding specific bugs to it as blockers.

Morgan Phillips [:mrrrgn]

Assignee

Comment 17

•

10 years ago

The scope of this ticket has been achieved, as in, we do not reboot after _every_ job now. That said, in lieu of the work that has been done, there is still some optimizing left which should be addressed in another ticket. That ticket is: https://bugzilla.mozilla.org/show_bug.cgi?id=1142088

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ian Connolly [:ianconnolly]

Comment 18

•

10 years ago

http://rack.2.mshcdn.com/media/ZgkyMDEzLzA4LzA1LzYyL2FuY2hvcm1hbi42NjJkYS5naWYKcAl0aHVtYgk4NTB4ODUwPgplCWpwZw/e36d14bd/1c0/anchorman.jpg \o/

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.