Open Bug 1646868 Opened 5 years ago Updated 5 years ago

parallel builds in cc-rs and webrender shader compilation end up starved for jobserver tokens

Categories

(Firefox Build System :: General, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: froydnj, Unassigned)

Details

The cc-rs crate supports a parallel feature for executing compilations of multiple files in parallel, using a jobserver. The build-parallel crate, used by our webrender shader compilation, does much the same thing (basically copy-and-pasting cc-rs code), but generalizing it for more-arbitrary computations.

But both of these crates do something locally good and globally bad: they drop the jobserver token they're using before they start to compile:

https://github.com/alexcrichton/cc-rs/blob/77c94412c68bddd29d3ea86aa00c0aaecaaa4904/src/lib.rs#L991-L997
https://github.com/jrmuizel/build-parallel/blob/fc61f40243364689b093e7eea785f3f903e33115/src/lib.rs#L37-L43

This means that other things get the chance to step in and grab the jobserver token before the build.rs scripts using cc-rs and build-parallel get to continue. And for whatever reason, in practice, this means that there's a really long wait before those scripts reacquire the necessary tokens; I can watch the glslopt-rs build script reliably sit around for minutes during a build waiting on a jobserver token. Which means that the long pole of the build process -- i.e. compiling Rust code -- gets even longer. (I think -- obviously the glslopt-rs build script will complete faster once there are jobserver slots available, and it's not like other parts of the build aren't making progress, but I think on the whole the process will take longer we drag out Rust compilation longer than necessary.)

This issue is likely less of a problem for webrender, because by the time webrender gets around to being compiled (in a full build), there's hardly anything competing for jobserver tokens.

I don't know what the right solution is. I guess we could fix the cc-rs and build-parallel crates to not release the first jobserver token, so they could at least make progress when the jobserver is very busy. But it's hard to see how they could be modified to take advantage of many available jobserver slots without blocking unnecessarily in the case that those slots are not easily available.

Bah, Bugzilla reloaded and lost my CC list of people who might be interested or offer commentary on this bug.

What does make do?

Flags: needinfo?(nfroyd)

Note it's also a problem in cargo itself. See https://github.com/rust-lang/cargo/issues/7786

Thanks for the cc! I think this should probably be pretty easy to fix in cc, I was basically just lazy at the time and didn't feel like writing everything to juggle the implicit token.

Otherwise this is one of the major drawbacks of the make jobserver, there's just no way to have any sort of priority.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #2)

What does make do?

I'm not sure what you want to know here. "How does make handle starting N new jobs and obtaining jobserver tokens for all of them"?

Flags: needinfo?(nfroyd)

Does make hold job server token until a compilation is complete? Does it reuse a token that it has or give it up before starting a new job?

Severity: -- → S3
Priority: -- → P3

(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)

Does make hold job server token until a compilation is complete? Does it reuse a token that it has or give it up before starting a new job?

If I understand the code at:

http://git.savannah.gnu.org/cgit/make.git/tree/src/job.c#n251
http://git.savannah.gnu.org/cgit/make.git/tree/src/job.c#n1114
http://git.savannah.gnu.org/cgit/make.git/tree/src/job.c#n1825

(no line ranges on savannah, sadface)

correctly, I believe make will only reuse the implicit token that it starts out with; otherwise, it will relinquish the tokens that it obtains.

I don't know exactly how the kernel operates, but I could see:

http://git.savannah.gnu.org/cgit/make.git/tree/src/posixos.c#n277

constantly bumping make to the front of the queue waiting on the jobserver acquisition fd, as it constantly resubmits a request to know when the fd is available; compare with jobserver-rs, which just waits forever, both on the poll and the subsequent read:

https://github.com/alexcrichton/jobserver-rs/blob/master/src/unix.rs#L115-L169

I don't know if it'd be possible to make jobserver-rs aware of which flavor of make we're dealing with and use a different strategy depending on the blocking-ness of the jobserver fds, or if that would make much of a difference, or if just using ppoll would help any.

I think this is how we fix the problem: jobserver-rs needs to grow another API, Client::acquire_with_timeout(timeout: Duration) -> io::Result<Status> where Status is enum Status { TimedOut, Success(Acquired) }. (I think we can roll being interrupted into TimedOut, because I don't think clients would actually handle those cases any differently.)

Once we have that, the core of cc-rs becomes something like:

N = number of jobs we have to do
stuff jobs into some sort of work queue to share between us and threads
done = false (share between us and threads)
spawn N-1 threads whose work will be described later

# we have an implicit jobserver token here
loop {
  get work
  if there's no work {
    break;
  }

  execute work
}

# signal that threads should stop
done = true;
join all threads

Each thread does:

loop {
  if done {
    break;
  }

  acquire jobserver token with short (1sec) timeout
  if timed out {
    continue;
  }

  # now we have a jobserver token
  get work
  if there's no work {
    break;
  }

  execute work

  # drop jobserver token here
}

The timeout is just so we don't have to handle threads being forever blocked while trying to get a jobserver token. I'm kind of concerned that if cc-rs is handling a lot of files that we have a lot of threads constantly waking up and trying to do this, which steals some CPU time from other things trying to get useful work done. Another way to do it would have the core loop look like:

loop {
  check for work
  if no work {
    break;
  }

  if one task left {
    execute work
    break;
  }

  try to acquire jobserver token (with timeout)
  if timed out {
    execute one task
    continue;
  }

  # do two tasks
  spawn thread and send above token + work to it
  execute one task
}

wait on outstanding threads

and then individual threads would just do the work they were given, drop the jobserver token, and exit.

The worst-case performance here is pretty bad if we wind up timing out for every jobserver token acquisition attempt and the amount of work in tasks is relatively short.

Continuing to think out loud, maybe the best approach is a hybrid model: we have a central workqueue and the "main" script grabs jobs from that. But we also spawn a "jobserver acquisition" thread who tries to acquire a jobserver token continually, tries to grab work from the central workqueue, and then sends the token + any work to a new thread to do the actual work. Then we're (ideally) not tanking performance and we don't have a thundering herd of threads waking up every second while waiting for a jobserver token. I guess the "jobserver acquisition" thread is just the jobserver::HelperThread abstraction, so we wouldn't even need a new API for jobserver-rs! (One wrinkle is that the jobserver-rs docs warn about HelperThread because its cleanup mechanism is pretty gross on Unix, and I don't know that we want to introduce that grossness into arbitrary build scripts via cc-rs. Maybe we could add timeouts under the hood and make it less gross somehow?)

Ideally we can convince the borrow checker that all of this is reasonable. :)

(The alternative to timeouts is to have another "jobserver-like" pipe fd that child threads can wait on and then we write to that to wake everybody up (something similar can be done on Windows, I think; I don't know if it's easily accomplished in wasm). But that would introduce a bunch of goo into the jobserver crate itself and doesn't feel very general.)

Alex, do you have thoughts on the proposed hybrid model?

Flags: needinfo?(acrichton)

Yeah I'm down for whatever works in the cc crate. What's implemented there currently was just the easiest to land and had the smallest impact on code-size and was something I was relatively sure was correct.

I do think that we want to avoid thundering herd issues and spawning threads before a jobserver token is acquired. In cases where the jobserver is used to limit memory use it's important to not take extra memory before the token is acquired.

Otherwise though one wrinkle is that in theory it's not possible to implement a timeout right now. Windows works just fine and on Unix if the jobserver fds are nonblocking it's also fine, but the problem is that if the jobserver fds are blocking then there's no way to time out a read. I'm not sure how often make is used with blocking fds still (IIRC it's only older versions doing that).

FWIW the compiler correctly manages the jobserver token, ensuring that it doesn't spawn work until it acquires an extra token and additionally makes use of the implicit token without trying to release it. It's unfortunately some pretty gnarly logic. I'm fine adding that logic to cc if it seems worth it though, but I unfortunately don't have a ton of time to invest in it right now :(

Flags: needinfo?(acrichton)
You need to log in before you can comment on or make changes to this bug.