Closed Bug 1515415 Opened 6 years ago Closed 5 years ago

Consider a separate pool of macos machines for PGO profile runs

Categories

(Infrastructure & Operations :: RelOps: Hardware, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mshal, Unassigned)

References

Details

To enable PGO for macOS, we're moving to a 3-task model. This allows us to cross-compile a PGO-instrumented build (1st task), run the build on a macOS machine to generate the profile information (2nd task), and then use that profile data for a final PGO-optimized cross-compiled build (3rd task). For macOS this presents a problem not found in our Linux or Windows environments, since the only machines that we can run the build on are actual mac hardware. Currently this means test machines, which are level-1. So with the 3-task model, the taskgraph would do: build (level-3 builder) -> run build (level-1 tester) -> optimized build (level-3 builder) -> talos / tests (level-1 testers) Moving data from level-1 to level-3 presents potential security concerns, so we are wondering what it would take to have a separate pool of machines that could be level-3 dedicated to generating PGO profiles. That way we'd have this model: build (level-3 builder) -> run build (level-3 PGO runner) -> optimized build (level-3 builder) -> talos / tests (level-1 testers) We still need to determine how to size the pool if we go with this route. chmanchester is working on getting the profileserver running on mac so we can have an idea of how long it takes (on our Linux builds the task is roughly 5-10 minutes, but this could be different on mac hardware).
I ran a quick ActiveData query the other day to find out how many PGO builds we do on a daily basis and it looks like if we limit it to a single platform (linux64) we usually max out somewhere around 250 PGO builds/day: https://activedata.allizom.org/tools/query.html#query_id=L8PhmK1e
(In reply to Michael Shal [:mshal] from comment #0) > > we are wondering what it would take to have a separate pool of machines that > could be level-3 dedicated to generating PGO profiles Shouldn't be too difficult; we'll want to make sure we re-image them when we move the minis to the new pool, JIC. In times past, we would have a separate VLAN for builders vs testers, but we've since implemented default-deny host firewalls to prevent cross talk so I think we should be ok; throwing a NI at ulfr for his opinion.
Flags: needinfo?(jvehent)
I think the default deny host policy is good enough to isolate worker types, as long as there's minimum risk to have them oscillate between types via puppet misconfiguration.
Flags: needinfo?(jvehent)

(In reply to Ted Mielczarek [:ted] [:ted.mielczarek] from comment #1)

I ran a quick ActiveData query the other day to find out how many PGO builds
we do on a daily basis and it looks like if we limit it to a single platform
(linux64) we usually max out somewhere around 250 PGO builds/day:
https://activedata.allizom.org/tools/query.html#query_id=L8PhmK1e

I was able to run the profile server on try, and the profile-run job takes 25-30 minutes. About 20 minutes of this is checking out m-c, and while we may be able to optimize this, sparse-checkout wasn't as helpful here as I thought is might be (see bug 1517831).

It seems like we should just package up the files necessary for the profile-run and upload them as an artifact of the build job. Then downloading and extracting that from the profile-run job would be fast.

From chmanchester yesterday in #build:

I was able to test a bit more and the repo cache is indeed working on the macOS workers... getting a cache hit was tricky but once I do the checkout is about 3 minutes :)

I'll leave it to him to give more info about the full task runtime, but that should make things pretty tractable.

With an appropriate sparse profile defined and a cache hit the checkout time is negligible and the task runtime is 5 minutes. Re-running the query from comment 1 the job count has gone up and routinely hovers around 500 per day.

:fubar, is this enough to go on? Can you find someone to implement this?

Flags: needinfo?(klibby)

Jake, can you work with the build team and Joel to get a PGO pool set up? We've had some spikes in OSX pending counts recently, so we want to make sure we don't take too many from the regular pool. And make sure they get re-imaged as part of it!

It's worth noting that these PGO builds should give us a pretty good speed up on OSX.

Flags: needinfo?(klibby) → needinfo?(jwatkins)
Depends on: 1530732
Flags: needinfo?(jwatkins)

This was done in bug 1530732.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.