Open Bug 1627071 Opened 6 months ago Updated 13 days ago

[meta] Startup Task DAG Phase 1: Disk IO Orchestration

Categories

(Firefox :: General, enhancement)

enhancement
Not set
normal

Tracking

()

People

(Reporter: dthayer, Unassigned)

References

(Depends on 19 open bugs, Blocks 1 open bug)

Details

(Keywords: meta, Whiteboard: [fxperf])

Attachments

(1 file)

Attached file ref-hw-startup.zip

View the attached file at https://procmon-analyze.github.io/ to see a visualization of startup IO on reference hardware. tl;dr: There is almost no time that we are not doing reads during startup, and during much of it we are requesting multiple, simultaneous reads, with windows of time where we are simultaneously requesting seven different reads. This likely causes unnecessary seeking which would hurt IO throughput (see here - though take with a grain of salt as it was posted 5 years ago).

Overall, during startup we read about 200MB off disk. Half of that is libxul, and half of that is everything else. Of the "everything else", a good bit of it are reads that we're actually getting from the cache. Thus, if our startup reads were optimally laid out, we should expect startup to take less than twice the time it takes to read libxul, and yet it takes about four times as long (on average, on my 2017 reference hardware).

So what is the plan? Conceptually, we want to extend and strengthen the basic concept of the existing URLPreloader: read things in advance, in an organized fashion. However, we want to expand upon it:

  1. We need to expand the coverage. We want to ensure that as many files as possible are loaded sequentially by themselves (not random access), and sequentially with other files (not concurrently). Notable offenders today are DLLs and sqlite databases, though the latter are quite a bit more complicated to fix.

  2. We need to add safeguards to ensure that we don't add reads in the future that go untracked.

  3. We need to identify as many places as possible where files can be merged together into one file which can be read all at once, sequentially off disk.

  4. We need to aggressively compress what we can, so that we read fewer bytes overall. The CPU trade-off should be minimal with a compression algorithm like lz4 or zlib.

  5. While we're here and analyzing our disk IO, we need to simply purge as much of it as possible out of the startup path, regardless of whether it is on the main thread or not.

I just saw some chatter in the #dom channel about the script preloader, but I'm posting here for posterity. Kris, the idea has been floating in my head for a little bit to unify the storage of the script preloader and the startup cache. I.e., have the script preloader just consume the startup cache for its on-disk storage. This would make it likely to be one contiguous chunk on disk which we could prefetch, and we could get the startup cache's compression for free (the script cache looks fairly compressible), saving a handful of megabytes of startup IO. It wouldn't shake the world, but the script preloader's reads consistently show up taking a visible chunk of time in profiles such as that attached in comment 0. Does this seem reasonable? This would depend on bug 1627142, allowing the startup cache to be used from more threads / processes.

Flags: needinfo?(kmaglione+bmo)
Blocks: 1621535
Summary: [meta] Minimize total IO across all threads (not just main) during startup → [meta] Startup Task DAG Phase 1: Disk IO Orchestration

That won't work, for a lot of reasons. One is that we need separate cache files in different processes. Another is that we intend in the future to use the memory mapped XDR data in the preloader as the actual memory backing for bytecode of decoded scripts. And for security reasons, we'll never be able to allow child processes to store things in the ordinary startup cache.

If anything, I'd like to move things the other way, and use the preloader's storage for scripts that are currently stored in the startup cache to the pre-loader.

Either way, I wouldn't expect unifying the two to improve performance. The IO ordering of the preloader cache is already carefully optimized, and the file is already spread over multiple filesystem blocks. If there's anything that we could do to increase its IO efficiency, it would likely be changing the flags on the pre-loaded region of the mapped file so the OS does more aggressive ordered pre-fetching.

Flags: needinfo?(kmaglione+bmo)

As a side-note, it's probably best not to make assumptions that merging data into fewer files (especially when it spans multiple filesystem blocks) and reading it sequentially will have a positive effect. If it comes to it, I've been worried that reading data sequentially does more harm than good, since it means that the OS only knows about one pending read operation at a time, and can't plan its hard drive access to minimize unnecessary seeks.

I think that ideally what we want is to have as much information as possible available about all of the files/parts of files that we need to read at startup, and at what order, and at any given time have some number of read requests pending for the ones we know that we'll need soonest, so that the OS can try to fulfill them in the order that's most efficient, rather than the order we happened to initiate them.

(In reply to Kris Maglione [:kmag] from comment #2)

That won't work, for a lot of reasons. One is that we need separate cache files in different processes.

Could you clarify why? It's not remarkably critical, because we could split out the startup cache files just as easily as we can split out the script cache files, but the writes are brokered by the parent process anyway, so it should not be feasible for a content process to write into a buffer which the parent process will load as a script. Am I misunderstanding something here?

Another is that we intend in the future to use the memory mapped XDR data in the preloader as the actual memory backing for bytecode of decoded scripts.

Is this just to avoid a copy? On a spinny disk on a system with enough memory that it won't have to page out a bunch of things to make room, the time savings from halving the number of bytes read off disk by compressing it should far outweigh the malloc and memcpy, no?

Either way, I wouldn't expect unifying the two to improve performance. The IO ordering of the preloader cache is already carefully optimized, and the file is already spread over multiple filesystem blocks. If there's anything that we could do to increase its IO efficiency, it would likely be changing the flags on the pre-loaded region of the mapped file so the OS does more aggressive ordered pre-fetching.

From a performance perspective, my only real concern is ensuring that we're not fetching from both at the same time on different threads, causing unnecessary seeks. However, from a maintenance perspective it feels like much of the startup cache and script preloader code could be unified, because at a storage level (which is the only level for the startup cache) they're both trying to do the same thing, in very slightly different ways.

(In reply to Kris Maglione [:kmag] from comment #3)

As a side-note, it's probably best not to make assumptions that merging data into fewer files (especially when it spans multiple filesystem blocks) and reading it sequentially will have a positive effect. If it comes to it, I've been worried that reading data sequentially does more harm than good, since it means that the OS only knows about one pending read operation at a time, and can't plan its hard drive access to minimize unnecessary seeks.

I think that ideally what we want is to have as much information as possible available about all of the files/parts of files that we need to read at startup, and at what order, and at any given time have some number of read requests pending for the ones we know that we'll need soonest, so that the OS can try to fulfill them in the order that's most efficient, rather than the order we happened to initiate them.

I think that could be true with a better OS IO scheduler, but it really doesn't seem to be the case in practice. More measurement on different hardware will certainly be more prudent here, and I should not be drawing strong conclusions yet, so I will do more and post back with findings, but so far it has very much looked like when we read things sequentially we have a higher throughput on Windows 10.

(Another legitimate reason why this might be the case even with a better scheduler is that the OS really doesn't know if we care more about the latency or the throughput of the reads, and in our case during startup on spinny disks we generally care about throughput.)

(In reply to Doug Thayer [:dthayer] from comment #4)

(In reply to Kris Maglione [:kmag] from comment #2)

That won't work, for a lot of reasons. One is that we need separate cache files in different processes.

Could you clarify why? It's not remarkably critical, because we could split out the startup cache files just as easily as we can split out the script cache files, but the writes are brokered by the parent process anyway, so it should not be feasible for a content process to write into a buffer which the parent process will load as a script. Am I misunderstanding something here?

For a lot of reasons. The scripts that are used by child processes are in a separate mmapped region that's shared across all child processes. They don't see the data that belongs to the parent process, for a number of reasons. And, for efficiency reasons, the data that they actually need to access is all ordered and contiguous.

Having the child send the data to store in the cache to the parent is... complicated. For security reasons, we can only accept data from the child before any untrusted code has run in that process, since data sent from a compromised process and stored in the cache would wind up running in unrelated processes, which it would then compromise. The preloader cache currently handles that. We could in theory make the startup cache handle it too, but either way we'd need data segregation.

Another is that we intend in the future to use the memory mapped XDR data in the preloader as the actual memory backing for bytecode of decoded scripts.

Is this just to avoid a copy? On a spinny disk on a system with enough memory that it won't have to page out a bunch of things to make room, the time savings from halving the number of bytes read off disk by compressing it should far outweigh the malloc and memcpy, no?

It's to avoid duplicating the bytecode of those scripts in every content process. If we're on a system with a spinning disk, then the last thing we want is to start swapping because we're low on memory.

Either way, I wouldn't expect unifying the two to improve performance. The IO ordering of the preloader cache is already carefully optimized, and the file is already spread over multiple filesystem blocks. If there's anything that we could do to increase its IO efficiency, it would likely be changing the flags on the pre-loaded region of the mapped file so the OS does more aggressive ordered pre-fetching.

From a performance perspective, my only real concern is ensuring that we're not fetching from both at the same time on different threads, causing unnecessary seeks. However, from a maintenance perspective it feels like much of the startup cache and script preloader code could be unified, because at a storage level (which is the only level for the startup cache) they're both trying to do the same thing, in very slightly different ways.

From a performance perspective, I'm more concerned that we are fetching it from multiple threads at the same time, because it gives the OS more leeway to optimize the seeks based on the locations of the data for all outstanding reads. The data we're talking about is going to be spread over multiple filesystem blocks, so there's no guarantee that it will be contiguous on disk even if it's in the same file. And it's needed across timespans which are far longer than a seek time. The OS has a much better ability to optimize that data access than we do, so I'd rather we second guess it as little as possible.

That said, there are flags that we can set on those mmapped regions to let the OS know that we expect to read it quickly, and in order, so that it will prefetch any unavailable chunks adjacent to the last read as soon as it gets the chance, which it might be worth looking into. We do something like this for omnijar already.

(In reply to Doug Thayer [:dthayer] from comment #5)

I think that could be true with a better OS IO scheduler, but it really doesn't seem to be the case in practice. More measurement on different hardware will certainly be more prudent here, and I should not be drawing strong conclusions yet, so I will do more and post back with findings, but so far it has very much looked like when we read things sequentially we have a higher throughput on Windows 10.

The difference is that we currently do a bunch of blocking read operations on multiple threads, and don't prioritize them. But if we're talking about optimizing IO and pre-fetching things, it's a completely different matter. We can absolutely ask the OS for the several of the ones that we need the most at the same time, and the OS and the hard drive firmware will absolutely try to optimize their reads to minimize seeks.

Minimizing the number of files we read will definitely help, at least in part because so many of those files are spread across multiple filesystem blocks, whereas when they're merged they may take up only one. And their data is probably somewhat more likely to be contiguous, depending on disk fragmentation. But in the case of large files like the preloader and startup caches, I'm less convinced.

(In reply to Kris Maglione [:kmag] from comment #6)

It's to avoid duplicating the bytecode of those scripts in every content process. If we're on a system with a spinning disk, then the last thing we want is to start swapping because we're low on memory.

Sorry, what I had in mind was a shmem similar to the shared preferences stuff, which we decompress into and pass along to child processes via the same command line mechanism. It seems like that would be having our cake and eating it too?

Depends on: 1628903

(In reply to Doug Thayer [:dthayer] from comment #8)

(In reply to Kris Maglione [:kmag] from comment #6)

It's to avoid duplicating the bytecode of those scripts in every content process. If we're on a system with a spinning disk, then the last thing we want is to start swapping because we're low on memory.

Sorry, what I had in mind was a shmem similar to the shared preferences stuff, which we decompress into and pass along to child processes via the same command line mechanism. It seems like that would be having our cake and eating it too?

Responding to myself: a potential downside here would be that if the OS wanted to page out the memory backing this, it would have to actually write it to disk, rather than just clearing it out knowing the original backing file is still there. Personally that doesn't strike me as much of a problem. The size of this file is largeish in the context of the amount we read off of disk during startup, but it is much smaller in the context of how much memory we generally use.

(In reply to Doug Thayer [:dthayer] from comment #8)

(In reply to Kris Maglione [:kmag] from comment #6)

It's to avoid duplicating the bytecode of those scripts in every content process. If we're on a system with a spinning disk, then the last thing we want is to start swapping because we're low on memory.

Sorry, what I had in mind was a shmem similar to the shared preferences stuff, which we decompress into and pass along to child processes via the same command line mechanism. It seems like that would be having our cake and eating it too?

Kris, could we do this? I'm looking at this having had to poke at some other startupcache stuff, and I've noticed:

  1. the scriptcache stops caching very early (browser-delayed-startup-finished) and the startupcache effectively "takes over" for scripts loaded later
  2. we duplicate caching on first run, and then chuck out half the cached scripts from the startupcache on the second run, as they're now fetched from the script cache instead.

I'd like to solve these by leaning more heavily on the script cache (ie keep it caching things until later in startup, and stop caching those scripts in the startupcache) but I'm worried about disk size (and thus IO) impact as the startupcache is compressed and the script cache is not.

Flags: needinfo?(kmaglione+bmo)
Depends on: 1631884
Depends on: 1631949
Depends on: 1631954
Depends on: 1631964
See Also: → 1635575
Depends on: 1635620
Depends on: 1637714
Depends on: 1595994
Depends on: 1364091
Depends on: 1548590
Depends on: 1638421
Depends on: 1640087
Depends on: 1648259

No answer, and there's other work around the startupcache, so clearing needinfo.

Flags: needinfo?(kmaglione+bmo)
You need to log in before you can comment on or make changes to this bug.