Closed Bug 1398355 Opened 4 years ago Closed 3 years ago

Nerf filesystem consistency defaults

Categories

(Taskcluster :: Workers, enhancement, P5)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

eatmydata (https://github.com/stewartsmith/libeatmydata) is an LD_PRELOAD library and helper command that essentially turns expensive I/O primitives like fsync() into no-ops.

My testing shows that it can drastically speed up various operations.

For example, when building the desktop1604-test Docker image, the non-download part of an `apt-get install` which pulls in ~1745 packages goes from ~585s to ~180s (31% of original)! (dpkg will fsync things as part of and in-between package installs as I understand it.) My machine has a reasonably fast SSD and I don't perceive I/O wait to be a problem. So I'm guessing the savings in automation will be even greater.

We should consider using eatmydata where I/O correctness isn't required and we don't care about potential for data loss. Candidates for eatmydata include:

* Docker image building (the entire task)
* Toolchain tasks
* VCS checkouts (if repo corruption occurs robustcheckout should be able to recover)
* Task setup and teardown. Tooltool downloads. Archive extraction. etc.
* Parts of the Firefox build (compiling, test archive generation, symbols generation, but not `make check`)

Where we shouldn't use eatmydata:

* In tests (unless we're absolutely sure misbehaving I/O patterns won't interfere with test accuracy - and even then)
I don't think this will matter much for Mercurial clones or checkouts because Mercurial doesn't abuse fsync().
Jonas and I were discussing this IRL. We think adjusting the ext4 mount options in docker-worker to nerf filesystem safety is a better approach because it is global. "nobarrier" might be sufficient. We may also want to data=writeback to make journaling faster.

And we may want to tune the page cache settings so Linux doesn't wait on cached data to flush before allowing more I/O.
This now appears to be a docker-worker bug.

I just touched this docker-worker code for bug 1415725. Once those PR's are accepted, I could look at this.

Another wrinkle here is invalidating test results. But unless we're testing performance or filesystem robustness, I'm not sure how tweaking things would invalidate test results. From the perspective of everything in userland, the kernel preserves POSIX semantics around filesystem state regardless of what various buffers and caches are doing under the hood. Worst case, we may need a per-worker or per-task setting to control behavior. And per-task is difficult, since multiple tasks could be running simultaneously. Probably better to make it per-worker.
Component: Task Configuration → Docker-Worker
Summary: Use eatmydata when I/O correctness isn't required → Nerf filesystem consistency defaults
Thinking about this more, we can implement this as a per-task flag in TaskGraph. `run-task` can manage the use of eatmydata by looking at an environment variable, etc.
agree, eatmydata (which is LD_PRELOAD) should probably be done in-tree, rather than at worker-level.

But we could still tweak file system parameters to disable journaling, etc..
Priority: -- → P5
Component: Docker-Worker → Workers
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.