Open Bug 1548033 Opened 2 years ago Updated 2 months ago

Only prefetch the parts of XUL.dll that we actually will need

Categories

(Toolkit :: Startup and Profile System, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: dthayer, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: main-thread-io, perf, Whiteboard: [fxperf:p2])

Attachments

(5 files, 4 obsolete files)

If we disable our prefetch code entirely, then on Windows we end up with something like 50% of XUL.dll in the system file cache after Firefox has completely loaded and we've browsed through a few tabs. This means we're unnecessarily reading 50% of XUL.dll, since we might never use it.

Performance still seems to be better when we do prefetch XUL.dll, since we can issue one large IO request rather than many small IO requests as we page in the missing parts of XUL.dll that we need. However, we should be able to have our cake and eat it too by assembling the list of XUL.dll chunks in automation that we are certain we'll need, and only prefetching those. This information could just be stuffed into dependentlibs.list after the corresponding dll entry.

This could save us something like 60MB of startup IO if we apply it across all dependentlibs.list entries.

Alternatively we could try disabling prefetch on Windows entirely - effectively saying "Superfetch take the wheel". We could potentially see wins that way; I haven't tested enough to be sure. But we certainly wouldn't see the wins on the first startup after install, so it's probably worth it to do the prefetch logic ourselves.

You probably already know this, but throwing it out anyway: We already optimize the omnijar based on data gathered during PGO profiling, so that's probably a good place to gather the XUL data too.

(In reply to Aaron Klotz [:aklotz] from comment #1)

You probably already know this, but throwing it out anyway: We already optimize the omnijar based on data gathered during PGO profiling, so that's probably a good place to gather the XUL data too.

I had heard that - but if you have a link on hand to the source for this that would save some digging!

Attached image xulloads.PNG (obsolete) —

Aaron, in addition to the above question - assuming we do this, it would be nice if we could optimize the parts of xul.dll that we use during startup to be contiguous, so effectively the back half of the dll is just the half that we don't prefetch. Do you know if that kind of thing is feasible to do based on PGO profiling?

I've made a visualization of the parts of xul.dll that we actually load right now and it's not remarkably organized. See attached.

Flags: needinfo?(aklotz)

(This is from a local build, not a PGO build - if we already do this for some reason(?), then I guess ignore me)

Whiteboard: [fxperf] → [fxperf:p2]

(In reply to Doug Thayer [:dthayer] from comment #2)

(In reply to Aaron Klotz [:aklotz] from comment #1)

You probably already know this, but throwing it out anyway: We already optimize the omnijar based on data gathered during PGO profiling, so that's probably a good place to gather the XUL data too.

I had heard that - but if you have a link on hand to the source for this that would save some digging!

I don't have it off hand, sorry.

(In reply to Doug Thayer [:dthayer] from comment #3)

Created attachment 9062367 [details]
Aaron, in addition to the above question - assuming we do this, it would be nice if we could optimize the parts of xul.dll that we use during startup to be contiguous, so effectively the back half of the dll is just the half that we don't prefetch. Do you know if that kind of thing is feasible to do based on PGO profiling?

dmajor is probably the person to ask.

Flags: needinfo?(aklotz)

Created attachment 9062367 [details]
Aaron, in addition to the above question - assuming we do this, it would be nice if we could optimize the parts of xul.dll that we use during startup to be contiguous, so effectively the back half of the dll is just the half that we don't prefetch. Do you know if that kind of thing is feasible to do based on PGO profiling?

dmajor is probably the person to ask.

dmajor, do you have any thoughts on this? Would we be able to control the linker output based on profiling to ensure that regions of xul.dll which are used during startup are laid out contiguously? Or is this a silly idea?

Flags: needinfo?(dmajor)

dmajor, do you have any thoughts on this? Would we be able to control the
linker output based on profiling to ensure that regions of xul.dll which are
used during startup are laid out contiguously? Or is this a silly idea?

We already do precisely this. :-) Bug 1444171.

Although, there was some talk recently about maybe needing to ditch the order files for the sake of upcoming work on IR-level PGO. I haven't been paying super close attention though; froydnj would know the latest details better.

Flags: needinfo?(dmajor)
Attached image xulreads.PNG (obsolete) —

Nathan, could you clarify what we can expect from the work in bug 1444171? I'm attaching a visualization of the parts of xul.dll that we actually load, measured by procmon on a Windows 2012 x64 shippable opt build. Black represents pages of the file that we read, white represents pages we did not read. What I would hope is that it was all organized such that what we need for a typical startup is all contiguous, so we could just prefetch N chunks (where N is small) from xul.dll with PrefetchVirtualMemory. Is that an unrealistic expectation?

Attachment #9062367 - Attachment is obsolete: true
Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #8)

Nathan, could you clarify what we can expect from the work in bug 1444171? I'm attaching a visualization of the parts of xul.dll that we actually load, measured by procmon on a Windows 2012 x64 shippable opt build. Black represents pages of the file that we read, white represents pages we did not read. What I would hope is that it was all organized such that what we need for a typical startup is all contiguous, so we could just prefetch N chunks (where N is small) from xul.dll with PrefetchVirtualMemory. Is that an unrealistic expectation?

That certainly seems like a reasonable expectation to me.

We theoretically record the first 25M (not necessarily unique) calls made, and then uniquify those into a ~25K lines file. But the linker complains about ~1/3 of those--possibly eliminated via ICF or aggressive profile-driven inlining?--so we only wind up with ~17K functions. I think libxul contains ~150K symbols, so I'd expect a much denser line than your visualization shows. But it's possible we might not be capturing enough (25M calls to 25K unique symbols sounds pretty bad), or something is going wrong applying the ordering file.

Flags: needinfo?(nfroyd)

Another thing to keep in mind is that the order file only helps us with the code section, not data or anything else. It might be interesting to see that picture overlaid with a breakdown of the sections.

Attached image xulreads.PNG (obsolete) —

(In reply to David Major [:dmajor] from comment #10)

Another thing to keep in mind is that the order file only helps us with the code section, not data or anything else. It might be interesting to see that picture overlaid with a breakdown of the sections.

Attaching a visualization of the sections, as reported by dumpbin.

Attachment #9064657 - Attachment is obsolete: true

Is it possible to also show the pages (maybe as a separate sparkline) occupied by the functions reported in the order file? Ideally, all of those pages should show up at the start of the graph...maybe that would tell us that we need to record more calls or even make call recording use mutexes (which would be not so great for performance...).

Or this graph is just telling us that we call way too much code in the first place.

(In reply to Nathan Froyd [:froydnj] from comment #12)

Is it possible to also show the pages (maybe as a separate sparkline) occupied by the functions reported in the order file? Ideally, all of those pages should show up at the start of the graph...maybe that would tell us that we need to record more calls or even make call recording use mutexes (which would be not so great for performance...).

Or this graph is just telling us that we call way too much code in the first place.

I've been trying to use a breakpad file to do this, but I can't seem to get the symbol names to line up well just using undname. How can I get a pdb for xul.dll for the PGO build, or can I? I believe that's what we use to generate the breakpad file to begin with, correct?

Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #13)

(In reply to Nathan Froyd [:froydnj] from comment #12)

Is it possible to also show the pages (maybe as a separate sparkline) occupied by the functions reported in the order file? Ideally, all of those pages should show up at the start of the graph...maybe that would tell us that we need to record more calls or even make call recording use mutexes (which would be not so great for performance...).

Or this graph is just telling us that we call way too much code in the first place.

I've been trying to use a breakpad file to do this, but I can't seem to get the symbol names to line up well just using undname. How can I get a pdb for xul.dll for the PGO build, or can I? I believe that's what we use to generate the breakpad file to begin with, correct?

That's correct.

Builds have a "target-crashreporter-symbols.full.zip" file available to download. For Windows, the files are CAB files (just because), but not named as such, so you'll have to rename and extract the files and whatnot.

Flags: needinfo?(nfroyd)

You can send .pd_ files through expand -r without renaming to cab.

Also, if you find two xul.pdb's in the archive, always take the smaller one (the larger is xul-gtest).

Attached image xulreads.PNG (obsolete) —

So, I couldn't sort out a way to efficiently get the information I need from the pdb without writing a custom tool that reads the pdb directly, which is not something I want to spend the time doing right now. So I settled on massaging the symbols until they lined up well enough with the breakpad format.[1] I ended up being able to find about a third of the symbols in the breakpad file, and I manually sampled the symbols that I couldn't find to see why, and as far as I could tell they simply weren't in the breakpad file, so I assume they were optimized out?[2]

Anyway, the results are attached. The top bar is filled in for every page that contains a symbol found in the order file (as far as I could determine by using the breakpad file to get the address). For all[3] of the symbols I could find, they are indeed to be found at the start of the .text section, so that part seems to be working correctly, but it is a very very small part of what we actually use. Also, I think the gap in the second bar, immediately after the symbols we do find, suggests that this is actually the complete list of symbols present in both the breakpad file and the order file.

[1] I'm not sure why I'm seeing these differences. I assume breakpad does some custom transformation of the symbols other than just these undname flags.
[2] As far as I can tell, breakpad doesn't include inlining information. Is this correct? Edit: just saw bug 524410, looks like we do now for Linux, but not Windows.
[3] EDIT: not quite ALL. There are a few stragglers which I don't have a great explanation for.

Attachment #9064908 - Attachment is obsolete: true
Attached image xulreads.PNG

(Attaching a more correct graph, which is less relaxed about the symbols matching up)

Attachment #9065758 - Attachment is obsolete: true

I think this profile shows well the cost of the current preloading: https://perfht.ml/2Z4myCB

I think this profile shows well the cost of the current preloading: https://perfht.ml/2Z4myCB

It can be misleading to talk about these things in isolation, because somebody might look at that graph and be tempted to assume that we can remove 2.4s from startup by investing in this bug. (We can't.)

There's a certain amount of work that's unavoidable even with a perfect binary layout -- I'm sure that's uncontroversial.
But in reality we're not going to be perfect, which means that even some of the "avoidable" work is not in scope to eliminate here. The really interesting questions are: (1) how close can we get to perfect? and (2) how quickly does performance deteriorate with imperfection? If our predictions are off by one page in a thousand, does the small-block overhead that the OS incurs to fill in the missing pieces outweigh the win from reducing our prefetch? What about one page in a hundred? In ten?

I don't think we'll ever know those answers precisely, but there are some experiments we could do to try to get a rough understanding. For example, two limitations of our order file were the 25M buffer limit and the fact that order data was collected on a pre-PGO build, which lets later optimizations change functions around. We could try a custom build with an absurd buffer limit and a separate cygprofile phase after PGO to collect better order data. If that still doesn't make xulreads.png look any less random, then I don't think it's practical to work on this further.

For question (2) we could try not prefetching the .rodata section. By eyeball we use about 20% of it. It would be interesting to know whether at that point it's worth turning off prefetch.

Would it make sense to prefetch only the bare minimum from xul.dll during early startup, and then prefetch a lot more off main thread without blocking the startup critical path?

How do you define bare minimum?

(In reply to David Major [:dmajor] from comment #21)

How do you define bare minimum?

It could be what we need to reach the point where we are showing the early blank first paint, or it could be what we need to reach the first paint of the browser UI.

Any scheme of "prefetch only these specific parts of xul.dll" needs to overcome the problems that I wrote in comment 19: we need to know which parts those are, and our current instrumentation is not doing a great job of capturing that information accurately.

Priority: -- → P3

I had some thoughts on this recently. I'm wondering if instead of instrumenting xul in order to record which functions are called, if we could simply record all of our reads from xul.dll using procmon (like I did to generate the graphs above), and determine via symbol files which symbols are in those pages. Then we simply make an order file which includes all of those. It's not a proper ordering, within that range, but it does get us to the point where everything we do need to load is laid out contiguously, so we can just read that chunk and not have to read (or rather, physically skip over, taking time) the remaining ~40MB.

OK, a try build that adds about 100mb to libxul on my local machine is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8fb6b41260461f5e43323031972f1e9c02942f9c

The equivalent m-c one is https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=31b56def2a8d&searchStr=windows%2Cship .

Mike suggested he could try profiling this on the reference device to see what difference this makes.

Flags: needinfo?(mconley)

Is comment 25 posted on the right bug? It's not clear to me how that ties into the previous comments (or maybe there's some external context that I am missing).

(In reply to :dmajor from comment #26)

Is comment 25 posted on the right bug? It's not clear to me how that ties into the previous comments (or maybe there's some external context that I am missing).

Yeah - the idea is we want to look into how much impact simply reducing the amount that we read from libxul (without actually changing the work that the loader has to do) will change things. The simplest way we could think of to get some kind of approximation of the impact of removing N megabytes of IO is to see the impact of adding M megabytes.

(Trying to make the code that we use during startup contiguous could have additional benefits beyond simply reducing the amount we need to prefetch from libxul, but we still would like to have a better understanding of how much the sheer quantity of bytes that we read from libxul impacts startup.)

Testing these builds on the 2018 reference device using the frame recording cold startup test* added approximately 3 seconds to both time to first blank paint and time to about:home settled, as compared to the control build.

  • Which is typically pretty noisy, but here the signal was quite clear
Flags: needinfo?(mconley)

Got some base profiles on the 2018 reference device on cold startup comparing the largexul build that Gijs posted in comment in 25 to a control build:

Control: https://perfht.ml/39zMhss
Treatment: https://perfht.ml/2IvXAWq

Bah, dthayer pointed out that my sampling interval was different between the base and normal profiler. I'm going to set both to sample every 10ms.

Blocks: 1627071

So, historically the problem with this has been that if we only prefetch the parts of libxul that we need, even though there's around 40MB we no longer read, we still have to physically traverse most of the binary due to how scattered our usage of it is (see comment 17). We could try to better generate an order file based on profiling information, but that only controls the layout of functions anyway.

An alternative would be to observe which pages we actually read off disk from xul.dll during startup during PGO, and use that information to physically rearrange xul.dll on disk via NtFsControlFile with FSCTL_MOVE_FILE such that all needed clusters from xul.dll are physically contiguous. Then in Firefox startup we just prefetch that discontiguous set of clusters from xul.dll, which should result in a contiguous read off disk.

Molly, does this sound insane? Is there somewhere (the updater, or the maintenance service?) which would have the necessary privileges to do this? We would need to open the volume that we installed Firefox on in order to get the bitmap of open clusters via FSCTL_GET_VOLUME_BITMAP, and then use FSCTL_MOVE_FILE to move it to the first open contiguous region available large enough to accomodate it.

I have an application which does something close to this, and it doesn't take an abominably long time on reference hardware, but I figured I should check if it's even in the cards before doing extensive performance testing of the result.

Flags: needinfo?(mhowell)

(In reply to Doug Thayer [:dthayer] from comment #32)

We could try to better generate an order file based on profiling information, but that only controls the layout of functions anyway.

Out of curiosity, does bug 1632542 improve the graph at all?

Flags: needinfo?(dothayer)

(In reply to :dmajor from comment #33)

(In reply to Doug Thayer [:dthayer] from comment #32)

We could try to better generate an order file based on profiling information, but that only controls the layout of functions anyway.

Out of curiosity, does bug 1632542 improve the graph at all?

I'll give it a shot! I suspect no, though, as we were pretty quickly hitting kSamplesCapacity in cygprofile.cc when I last tried this, and the samples buffer was mostly full of duplicates. I do have a patch lying around somewhere where I was able to collect many more samples by directly adding them to a concurrent hashset rather than collecting an array and deduplicating that later. So maybe combining that with your changes to the process could get us somewhere? If I gave you a modified cygprofile.cc file would you be able to wire it up with your local workflow and generate an order file from it?

Flags: needinfo?(dothayer) → needinfo?(dmajor)

(I'm not sure why I needed it to be a lockfree hashset. You could probably just wrap a std::unordered_set in a mutex and the expensive part would still be the hashing. In any case I do recall it getting a bit more into the order file... I'm not sure where it fell off my mind after that...)

(In reply to Doug Thayer [:dthayer] from comment #32)

Molly, does this sound insane?

Maybe a bit. ;) It's certainly a little unorthodox, but that's alright. One worry I have is that a disk defragmenter might come along and accidentally undo all this by relocating the file for its own purposes; I don't know how many systems run those things frequently anymore though.

Is there somewhere (the updater, or the maintenance service?) which would have the necessary privileges to do this?

Hmm, that leads to questions about where we put the code that does this work and when and how it gets run, there's gonna be a lot to talk about with that. There's explanations in here of a couple details that I know Doug already knows; just making sure everyone has the background.

For organizational purposes what I like the best would be to have the code live in the maintenance service (it runs as SYSTEM) and be invoked by starting it with a specific command. We could call that easily enough from both the installer and during a couple different phases of update (I'll get to that). The only problem there is the maintenance service is optional; you can choose to not install it at all, or to disable it by pref.

If we don't want to live with having to skip all this when the service isn't available, then putting the code in the updater would be an option, but I'd rather avoid adding the complexity there if possible. Other options would include a firefox.exe command-line parameter that only does this, or an entire bespoke binary. That also limits our options for when we run it to things that would already have elevation; I wouldn't want to have a UAC prompt show up just for this.

That does bring us to the question of when to call this code. Now, I'm assuming we can't do that while Firefox is running; I haven't checked, I don't know for sure, but rearranging files that have open handles sounds not good.

We would want to do this during initial installation I'd guess, but that's no problem, the installer can just kick it off as a normal install step.

Updates are an issue though; I'm also assuming we would need to run this on every update, because the parameters that determine what the right ordering actually is (i.e., what comes out of the PGO run) aren't predictable for a given build. If it runs very very fast, then we could always do this during the restart that applies the update, and that would be easy to do, but I'm afraid it won't be fast enough to make that work out.

I think the best option that leaves us is putting this at the end of staging an update; that's when we make a copy of the whole application and apply the new patch to the copy, all in the background while the now-old version is running, before prompting to restart. So we could apply this optimization to that copy right after we're done patching the files and before it gets moved into place to become the updated installation.

Occasionally in the past we have had to disable staging because of some instability around shutdown, but that's uncommon, and in that case we do the entire update process during the restart, all we've done already is download it, so that restart is gonna take a while anyway; maybe that means we could slot this thing in there.

Someday we'll have background updates too, that happen entirely while Firefox isn't running at all, and we can run this during those as well; there's not much to talk about there, I don't think it really adds more complexity.

Also note that not every installation ever gets admin privileges available to it (either because the user doesn't have them, or they just don't allow us to elevate), so this won't always be able to run anyway for that reason regardless of anything else.

Flags: needinfo?(mhowell)

I agree about the when, but regarding the where, I'd rather not put this in the maintenance service. I don't see a clear winner of the other options, but given that the maintsvc's purpose is to be run as System at the command of unprivileged users, every new command it supports is a potential security issue. Mechanically, the elevated updater won't be able to run the service if it is in turn being run from the service (since there is only one instance of a service), so it would have to be kicked off by the unelevated updater after the elevated one has already exited, or I guess the maintenance service could be run without the participation of the service manager if we're already System.

(In reply to Molly Howell (she/her) [:mhowell] from comment #36)

(In reply to Doug Thayer [:dthayer] from comment #32)

Molly, does this sound insane?

Maybe a bit. ;) It's certainly a little unorthodox, but that's alright. One worry I have is that a disk defragmenter might come along and accidentally undo all this by relocating the file for its own purposes; I don't know how many systems run those things frequently anymore though.

If I'm not mistaken, the defragmenter in Windows 10 is enabled and runs periodically by default (while it was manual in older Windows versions).

(In reply to Doug Thayer [:dthayer] from comment #34)

If I gave you a modified cygprofile.cc file would you be able to wire it up with your local workflow and generate an order file from it?

I can try that. I'll also try extracting an order file from the PGO training profile.

Flags: needinfo?(dmajor)
Attached file cygprofile.cpp

Modified cygprofile.cpp which will collect up to 1 << 18 distinct functions.

While I was looking into this it occurred to me that our existing order file machinery doesn't instrument Rust code, so I've fixed that in the try pushes below. Also I noticed that the order file using 25M samples (1419 KB) is really not that much worse than one using 200M samples (1673 KB), which itself is actually slightly larger than the hash table version (1643 KB), although there is some noise between runs.

Baseline commit for all builds below is m-c c9955025d4a5353568a56a1048292c665312fa95.

In theory, that list should be increasingly good quality, though I'm curious to know whether that's actually true.

Attached image XULreads.PNG

Here's a map of XUL reads with dmajor's patches + readahead disabled. Green lines are early reads, blue middle, red late. It's definitely better! This may be actionable - we could break this into a handful of chunks and skip over maybe 10-20 MB worth of mostly contiguous blocks. I'm still wondering if we could get our reads more contiguous, though, as there are still a lot of gaps.

Which of the builds was that for? If you've got time, I'd be curious to see how the different ones stack up against each other.

(In reply to :dmajor from comment #43)

Which of the builds was that for? If you've got time, I'd be curious to see how the different ones stack up against each other.

That was for the bottom one on your list. I'll try to get a picture of the rest of them this afternoon / evening.

Attached image XULreads.PNG

UL reads in the order given by dmajor.

dmajor, where is the core PGO run script? What precisely is the window of time that it records? I'm noticing that the bottom section does have the biggest contiguous chunk at the start of the file, but there is still a lot left out, and what is left out generally seems to skew red, meaning it showed up later in startup.

Flags: needinfo?(dbaron)

I think maybe you meant to direct that at me. The PGO training is invoked through mach python build/pgo/profileserver.py, of which the majority of the work is in build/pgo/index.html.

There isn't really a notion of window of time, the instrumentation captures essentially any function that runs in the life of the process.

(In reply to :dmajor from comment #47)

I think maybe you meant to direct that at me.

That's embarrassing. Yes.

The PGO training is invoked through mach python build/pgo/profileserver.py, of which the majority of the work is in build/pgo/index.html.

Thank you!

Flags: needinfo?(dbaron)

One thing worth mentioning is that the PGO profile contains the names of functions from the profile-gen phase. They don't always line up to the same functions that we get out at the end of the profile-use phase -- notably, inlining decisions will change based on profile data (and not having all that instrumentation bloating the functions). The try push 38939489b5d5 has a whole lot of warning LNK4037 ("this function in the order file doesn't exist in the binary").

Looking at build/pgo/index.html, maybe it would be worth it to load about:home like a typical startup before this? It's not clear to me whether it does that already or not.

Also, I'm working my way through stacks for the PGO stuff, and I'm seeing a lot of WebRender (both the C++ and the rust parts) and stylo code. Is WebRender enabled for the pgo run? I don't know why stylo wouldn't be making the cut...

(In reply to :dmajor from comment #50)

One thing worth mentioning is that the PGO profile contains the names of functions from the profile-gen phase. They don't always line up to the same functions that we get out at the end of the profile-use phase -- notably, inlining decisions will change based on profile data (and not having all that instrumentation bloating the functions). The try push 38939489b5d5 has a whole lot of warning LNK4037 ("this function in the order file doesn't exist in the binary").

Yeah, I expected that much. Would it be possible to relink a binary just based on a procmon profile like this, instead of instrumentation? Like we have a phase where we just clear the system file cache of xul.dll, then record a startup with procmon, and move all symbols in pages that were read to the front, in the order that we observed their pages being read?

(In reply to Doug Thayer [:dthayer] from comment #51)

Looking at build/pgo/index.html, maybe it would be worth it to load about:home like a typical startup before this? It's not clear to me whether it does that already or not.

I think https://searchfox.org/mozilla-central/rev/dc4560dcaafd79375b9411fdbbaaebb0a59a93ac/build/pgo/profileserver.py#139-145 will load about:home but won't necessarily wait for it to finish loading, and also passes cmdline args which is not typical. We'd probably need some python fu to instead only load about:home like a normal startup, and run without the quitter extension, quitting from python when we detect that about:home has fully finished loading.

(In reply to Doug Thayer [:dthayer] from comment #51)

Looking at build/pgo/index.html, maybe it would be worth it to load about:home like a typical startup before this? It's not clear to me whether it does that already or not.

The easiest bang per buck might be to just add it as an URL that we navigate to in the list. Is that close enough to what happens when you load about:home in the normal way?

Also, I'm working my way through stacks for the PGO stuff, and I'm seeing a lot of WebRender (both the C++ and the rust parts) and stylo code. Is WebRender enabled for the pgo run? I don't know why stylo wouldn't be making the cut...

Hm, not sure I understand. By stacks, do you mean you're looking at functions that aren't in the profile/orderfile? I believe WebRender is disabled in training. One thing that might be affecting stylo is that Rust code generates some mangled symbol names that include a hash at the end, it's possible those aren't matching between profile-gen and profile-use.

Yeah, I expected that much. Would it be possible to relink a binary just based on a procmon profile like this, instead of instrumentation? Like we have a phase where we just clear the system file cache of xul.dll, then record a startup with procmon, and move all symbols in pages that were read to the front, in the order that we observed their pages being read?

Assuming that procmon is automatable, I don't see any technical reason why not, but it would add a lot of additional steps to every shippable build. We might have to do an altogether new build in order to get the linker to give the desired ordering (since we don't have BOLT/Propeller on Windows) and it's not clear that it would be worth the trouble. Might need numbers from a prototype to be convincing.

If procmon isn't sufficiently automatable, xperf definitely is.

(since we don't have BOLT/Propeller on Windows)

Come to think about it, what you're describing is not so far off from "implement BOLT for Windows". AIUI the reason this hasn't been done already is the reliance on Linux's perf, but if you use procmon as a rough approximation, that might be an interesting take.

(In reply to :dmajor from comment #53)

The easiest bang per buck might be to just add it as an URL that we navigate to in the list. Is that close enough to what happens when you load about:home in the normal way?

I would think that would be fine.

Hm, not sure I understand. By stacks, do you mean you're looking at functions that aren't in the profile/orderfile? I believe WebRender is disabled in training. One thing that might be affecting stylo is that Rust code generates some mangled symbol names that include a hash at the end, it's possible those aren't matching between profile-gen and profile-use.

Procmon records stacks of all of the reads that come in. So you can actually see the stack of the code that trips the page fault that causes the read to the dll. We built this tool to better visualize all of these reads and easily inspect the stacks. I'll upload some logs here in case anyone wants to plug them in and play around with it.

Assuming that procmon is automatable, I don't see any technical reason why not, but it would add a lot of additional steps to every shippable build. We might have to do an altogether new build in order to get the linker to give the desired ordering (since we don't have BOLT/Propeller on Windows) and it's not clear that it would be worth the trouble. Might need numbers from a prototype to be convincing.

Well, this is the most robust route I can think of for getting xul contiguous enough to only load the 60% of it that we use. And given that I've seen just prefetching xul.dll take 10 seconds on reference hardware, I'm inclined to believe it will be a big win. I'll try to poke around here at massaging all the symbols together.

To explore these logs, go to https://procmon-analyze.github.io/, and put the json file in the json hole and the xml file in the xml hole. It will think for about 20 seconds, show something to you, and then think for 10s, and then be usable.

You need to log in before you can comment on or make changes to this bug.