Closed Bug 970738 Opened 10 years ago Closed 10 years ago

Build some jacuzzis

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

References

Details

Attachments

(2 files, 2 obsolete files)

We can speed up build times & reduce disk space by reducing the set of machines that certain build types are runnable on.

(small pools of machines == jacuzzis)
Catlee, Rail and I talked about this this morning. We want to start trying different allocation methods as soon as possible. Because of this, we're going to use static files on a server immediately. These will be replaced by a proper service later.

The server will have a few different endpoints:
* /$builder. A GET to this will return a the list of slaves that should be used for this builder.
* /fallback/$pool. A GET to this will return a list of slaves that can be used for any builder. It is intended to be used as a fallback mechanism.

I will work on generating static content for some builders for these endpoints. We will adjust allocations to try out different builder/slave combinations throughout this week. To start with, we'll try 1 on demand and 2 spot instances for some builders on Cedar.

Each Buildbot master will talk to the server as part of nextSlave. It will make a GET request to /$builder. If the response is 2xx it will only consider the slaves returned in the body for the build. If the response is anything else, it will make a GET request to /fallback/$pool to obtain the list of fallback slaves. Responses from all unique endpoints will be cached for some period of time (5-10min?) If neither /$builder nor /fallback/$pool return a useful response, the master will consider all slaves associated with that builder. Catlee is going to work on this part.

watch_pending.py will also be updated to talk to this service to figure out which slaves should be started for a pending job. It will use the same logic as the Buildbot master for fallback. Rail will be working on this part.
Depends on: 971463
So the jacuzzi client-side code is going to interact in fun ways with this:
http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla/master_common.py#l188

I was seeing this situation in staging:
- builder A had a relatively new request, compared to builder B.
- builder A had slave S1 allocated to it in the jacuzzi service
- builders A,B have same sets of slaves configured
- prioritizeBuilders therefore dropped A from the list of builders. Because it has dropped builders, it triggers the builder loop again, which happens after the processing below is done
- builder B tries to find slaves, can't find any since S1 is the only one connected and it's allocated to builder A
- prioritizeBuilders runs again. repeat ad nauseam, and generate lots of log data in the meanwhile.
Attached patch jacuzzi client for nextSlave (obsolete) — Splinter Review
Attachment #8378637 - Flags: feedback?(nthomas)
Attachment #8378637 - Flags: feedback?(jhopkins)
Attachment #8378637 - Flags: feedback?(bhearsum)
Attachment #8378638 - Flags: feedback?(nthomas)
Attachment #8378638 - Flags: feedback?(jhopkins)
Attachment #8378638 - Flags: feedback?(bhearsum)
Comment on attachment 8378637 [details] [diff] [review]
jacuzzi client for nextSlave

Looks good.  As discussed, a couple of possible improvements:

- fetch a single 'cache control' file to indicate that all caches need to be updated (fetched).
- store all data in a single file (compressed) and poll that for changes

Either of these changes would eliminate the need to poll many files.
Attachment #8378637 - Flags: feedback?(jhopkins) → feedback+
Comment on attachment 8378638 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders

Looks good.  I might suggest adding a couple of comments indicating the intention of this code.
Attachment #8378638 - Flags: feedback?(jhopkins) → feedback+
(In reply to John Hopkins (:jhopkins) from comment #6)
> - store all data in a single file (compressed) and poll that for changes

This might be something we can add in addition to the existing endpoints, but I think it's important to keep at least the /builders/* and /machines/* endpoints around so that other tools can make use of them. I suspect this is something that we'll tackle when we write the real server.
Comment on attachment 8378637 [details] [diff] [review]
jacuzzi client for nextSlave

Review of attachment 8378637 [details] [diff] [review]:
-----------------------------------------------------------------

::: misc.py
@@ +331,5 @@
> +        if exc_info:
> +            log.err()
> +            log.msg("Jacuzzi %i: %s" % (id(self), msg))
> +        else:
> +            log.msg("Jacuzzi %i: %s" % (id(self), msg))

These messages might be a little confusing as is. If I saw "Jacuzzi 123456: blah blah" in a log I would think that 123456 is the id of a combination of builders/slaves. AFAICT, the same id will be used for a bunch of different jacuzzis. Maybe s/Jacuzzi/JacuzziAllocator/?

@@ +343,5 @@
> +        if self.allocated_cache:
> +            cache_time, slaves = self.allocated_cache
> +            if cache_time > time.time():
> +                # TODO: This could get spammy
> +                self.log("fresh cache: %s" % slaves)

I'm OK with spammy to start with. It's better than being too quiet.

@@ +373,5 @@
> +        # Check the cache for this builder
> +        self.log("checking cache for builder %s" % str(buildername))
> +        c = self.cache.get(buildername)
> +        if c:
> +            self.log("cache hit")

Is it still a cache it if the cache has expired?

@@ +407,5 @@
> +                self.cache[buildername] = (time.time() + self.CACHE_MAXAGE, slaves)
> +                # Filter the list of available slaves by the set the service
> +                # returned to us
> +                return [s for s in available_slaves if s.slave.slavename in slaves]
> +            except urllib2.HTTPError, e:

You might be able to make this block simpler by passing available_slaves to get_allocated_slaves, and letting it filter. If you do that, I think you can get rid of the fake 404 stuff and just call it in the "if missing_cache" block above, and in here.

@@ +414,5 @@
> +                # of our list of available slaves
> +                if e.code == 404:
> +                    try:
> +                        if not fake_404:
> +                            self.log("got 404; falling back to looking for allocated slaves")

Unless I'm misreading, this isn't true. The only difference between a fake 404 and a real 404 is whether or not you cache anything in self.missing_cache.

@@ +628,5 @@
>  
> +_nextAWSSlave_wait_sort = safeNextSlave(J(_nextAWSSlave(aws_wait=60, recentSort=True)))
> +_nextAWSSlave_wait_sort_skip_spot = safeNextSlave(J(_nextAWSSlave(aws_wait=60, recentSort=True,
> +                                                    skip_spot=True)))
> +_nextAWSSlave_nowait = safeNextSlave(_nextAWSSlave())

Wow, such wraps.
Attachment #8378637 - Flags: feedback?(bhearsum) → feedback+
Attachment #8378638 - Flags: feedback?(bhearsum) → feedback+
Comment on attachment 8378637 [details] [diff] [review]
jacuzzi client for nextSlave

Review of attachment 8378637 [details] [diff] [review]:
-----------------------------------------------------------------

+1 to the comments Ben made leading to simplification. Helper functions for common code blocks may help too.

::: misc.py
@@ +342,5 @@
> +        self.log("checking cache allocated slaves")
> +        if self.allocated_cache:
> +            cache_time, slaves = self.allocated_cache
> +            if cache_time > time.time():
> +                # TODO: This could get spammy

cache_time implies creation time to me, so maybe cache_expiry_time.
Attachment #8378637 - Flags: feedback?(nthomas) → feedback+
Comment on attachment 8378638 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders

Review of attachment 8378638 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla/master_common.py
@@ +187,5 @@
> +            try:
> +                # Filter the available slaves through the jacuzzi bubbles..
> +                slaves = J.get_slaves(b[1].name, slaves)
> +            except Exception:
> +                twlog.err("handled exception talking to jacuzzi; trying to carry on")

Deliberately twlog instead of log here ?
Attachment #8378638 - Flags: feedback?(nthomas) → feedback+
Comment on attachment 8378638 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders

Review of attachment 8378638 [details] [diff] [review]:
-----------------------------------------------------------------

::: mozilla/master_common.py
@@ +187,5 @@
> +            try:
> +                # Filter the available slaves through the jacuzzi bubbles..
> +                slaves = J.get_slaves(b[1].name, slaves)
> +            except Exception:
> +                twlog.err("handled exception talking to jacuzzi; trying to carry on")

Yes. The log() method here doesn't handle exceptions.
Depends on: 976055
Attachment #8378637 - Attachment is obsolete: true
Attachment #8380643 - Flags: review?(bhearsum)
Attachment #8378638 - Attachment is obsolete: true
Attachment #8380645 - Flags: review?(bhearsum)
Comment on attachment 8380645 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders

Review of attachment 8380645 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine with your pastebin fix.
Attachment #8380645 - Flags: review?(bhearsum) → review+
Comment on attachment 8380643 [details] [diff] [review]
jacuzzi client for nextSlave

Review of attachment 8380643 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me. Turn on the bubbles!
Attachment #8380643 - Flags: review?(bhearsum) → review+
Comment on attachment 8380645 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders

https://hg.mozilla.org/build/buildbot-configs/rev/743f092b924b
Attachment #8380645 - Flags: checked-in+
in production.
URL-encoding of builder names went into production.
Blocks: 977611
Catlee mentioned yesterday that we should put b2g_b2g-inbound_hamachi_eng_dep in a jacuzzi and remove the non-eng one. I did that this morning.
Depends on: 980470
Depends on: 984870
Depends on: 1013568
I think we're all done here.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: