Consider flushing I/O as part of running Talos tests

NEW
Unassigned

Status

enhancement
2 years ago
2 years ago

People

(Reporter: gps, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

I performed a Try push that effectively disables fsync in SQLite databases used by Firefox. The results were interesting.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=8577f7af1caf16525ab9c8db608c2da62c491cb9&framework=1&showOnlyImportant=1&selectedTimeRange=172800

We observe a significant (~50 MB / 17% ) reduction in I/O during tp5n nonmain_normal_fileio. This is telling me that disabling fsync() is resulting in fewer I/O operations reaching the filesystem. That's to be expected.

What's more interesting I think is that we see significant responsiveness regressions in e.g. tp6_facebook pgo 1_thread e10s.

What I think is happening is that when we disable fsync() in SQLite, a bunch of I/O writes buffer in the operating system's filesystem cache. Then, some other process (likely in Firefox) triggers an fsync(). This fsync() forces a flush of pending writes. Since there is more data that must be flushed, this fsync() takes longer than it would if SQLite were doing fsync()'s. Something somewhere is waiting on an I/O operation to complete. And this extra waiting is causing the responsiveness Talos numbers to increase.

Since I/O isn't immediate unless an fsync() is in play (which Firefox coincidentally does a lot of during normal operations), what we could be seeing is I/O from previous tests "bleeding over" into subsequent tests. For example, if we have Talos tests A and B running in the same process, A could incur a lot of write I/O. Test B performs a fsync() and this flushes data left over from test A. In other words, test B is measuring remnants of test A and the measurements from B may be "contaminated."

I think we should look into:

* Forcing an I/O flush between Talos tests/subtests
* Measuring I/O occurred during flushing of a measured test in addition to I/O during the test itself. This will help us isolate incurred versus deferred I/O and will help paint a better picture of overall I/O patterns

Forcing an I/O flush between tests/subtests could be difficult. If there is process separation, a fflush() would work. However, if Firefox is running, to do this right would require some kind of mechanism within Firefox itself to force flush any pending writes (which may be queued behind timers, etc).
You need to log in before you can comment on or make changes to this bug.