Open Bug 1548033 Opened 4 months ago Updated 3 months ago

Only prefetch the parts of XUL.dll that we actually will need

Categories

(Toolkit :: Startup and Profile System, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: dthayer, Unassigned)

References

Details

(Keywords: main-thread-io, perf, Whiteboard: [fxperf:p2])

Attachments

(1 file, 4 obsolete files)

If we disable our prefetch code entirely, then on Windows we end up with something like 50% of XUL.dll in the system file cache after Firefox has completely loaded and we've browsed through a few tabs. This means we're unnecessarily reading 50% of XUL.dll, since we might never use it.

Performance still seems to be better when we do prefetch XUL.dll, since we can issue one large IO request rather than many small IO requests as we page in the missing parts of XUL.dll that we need. However, we should be able to have our cake and eat it too by assembling the list of XUL.dll chunks in automation that we are certain we'll need, and only prefetching those. This information could just be stuffed into dependentlibs.list after the corresponding dll entry.

This could save us something like 60MB of startup IO if we apply it across all dependentlibs.list entries.

Alternatively we could try disabling prefetch on Windows entirely - effectively saying "Superfetch take the wheel". We could potentially see wins that way; I haven't tested enough to be sure. But we certainly wouldn't see the wins on the first startup after install, so it's probably worth it to do the prefetch logic ourselves.

You probably already know this, but throwing it out anyway: We already optimize the omnijar based on data gathered during PGO profiling, so that's probably a good place to gather the XUL data too.

(In reply to Aaron Klotz [:aklotz] from comment #1)

You probably already know this, but throwing it out anyway: We already optimize the omnijar based on data gathered during PGO profiling, so that's probably a good place to gather the XUL data too.

I had heard that - but if you have a link on hand to the source for this that would save some digging!

Attached image xulloads.PNG (obsolete) —

Aaron, in addition to the above question - assuming we do this, it would be nice if we could optimize the parts of xul.dll that we use during startup to be contiguous, so effectively the back half of the dll is just the half that we don't prefetch. Do you know if that kind of thing is feasible to do based on PGO profiling?

I've made a visualization of the parts of xul.dll that we actually load right now and it's not remarkably organized. See attached.

Flags: needinfo?(aklotz)

(This is from a local build, not a PGO build - if we already do this for some reason(?), then I guess ignore me)

Whiteboard: [fxperf] → [fxperf:p2]

(In reply to Doug Thayer [:dthayer] from comment #2)

(In reply to Aaron Klotz [:aklotz] from comment #1)

You probably already know this, but throwing it out anyway: We already optimize the omnijar based on data gathered during PGO profiling, so that's probably a good place to gather the XUL data too.

I had heard that - but if you have a link on hand to the source for this that would save some digging!

I don't have it off hand, sorry.

(In reply to Doug Thayer [:dthayer] from comment #3)

Created attachment 9062367 [details]
Aaron, in addition to the above question - assuming we do this, it would be nice if we could optimize the parts of xul.dll that we use during startup to be contiguous, so effectively the back half of the dll is just the half that we don't prefetch. Do you know if that kind of thing is feasible to do based on PGO profiling?

dmajor is probably the person to ask.

Flags: needinfo?(aklotz)

Created attachment 9062367 [details]
Aaron, in addition to the above question - assuming we do this, it would be nice if we could optimize the parts of xul.dll that we use during startup to be contiguous, so effectively the back half of the dll is just the half that we don't prefetch. Do you know if that kind of thing is feasible to do based on PGO profiling?

dmajor is probably the person to ask.

dmajor, do you have any thoughts on this? Would we be able to control the linker output based on profiling to ensure that regions of xul.dll which are used during startup are laid out contiguously? Or is this a silly idea?

Flags: needinfo?(dmajor)

dmajor, do you have any thoughts on this? Would we be able to control the
linker output based on profiling to ensure that regions of xul.dll which are
used during startup are laid out contiguously? Or is this a silly idea?

We already do precisely this. :-) Bug 1444171.

Although, there was some talk recently about maybe needing to ditch the order files for the sake of upcoming work on IR-level PGO. I haven't been paying super close attention though; froydnj would know the latest details better.

Flags: needinfo?(dmajor)
Attached image xulreads.PNG (obsolete) —

Nathan, could you clarify what we can expect from the work in bug 1444171? I'm attaching a visualization of the parts of xul.dll that we actually load, measured by procmon on a Windows 2012 x64 shippable opt build. Black represents pages of the file that we read, white represents pages we did not read. What I would hope is that it was all organized such that what we need for a typical startup is all contiguous, so we could just prefetch N chunks (where N is small) from xul.dll with PrefetchVirtualMemory. Is that an unrealistic expectation?

Attachment #9062367 - Attachment is obsolete: true
Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #8)

Nathan, could you clarify what we can expect from the work in bug 1444171? I'm attaching a visualization of the parts of xul.dll that we actually load, measured by procmon on a Windows 2012 x64 shippable opt build. Black represents pages of the file that we read, white represents pages we did not read. What I would hope is that it was all organized such that what we need for a typical startup is all contiguous, so we could just prefetch N chunks (where N is small) from xul.dll with PrefetchVirtualMemory. Is that an unrealistic expectation?

That certainly seems like a reasonable expectation to me.

We theoretically record the first 25M (not necessarily unique) calls made, and then uniquify those into a ~25K lines file. But the linker complains about ~1/3 of those--possibly eliminated via ICF or aggressive profile-driven inlining?--so we only wind up with ~17K functions. I think libxul contains ~150K symbols, so I'd expect a much denser line than your visualization shows. But it's possible we might not be capturing enough (25M calls to 25K unique symbols sounds pretty bad), or something is going wrong applying the ordering file.

Flags: needinfo?(nfroyd)

Another thing to keep in mind is that the order file only helps us with the code section, not data or anything else. It might be interesting to see that picture overlaid with a breakdown of the sections.

Attached image xulreads.PNG (obsolete) —

(In reply to David Major [:dmajor] from comment #10)

Another thing to keep in mind is that the order file only helps us with the code section, not data or anything else. It might be interesting to see that picture overlaid with a breakdown of the sections.

Attaching a visualization of the sections, as reported by dumpbin.

Attachment #9064657 - Attachment is obsolete: true

Is it possible to also show the pages (maybe as a separate sparkline) occupied by the functions reported in the order file? Ideally, all of those pages should show up at the start of the graph...maybe that would tell us that we need to record more calls or even make call recording use mutexes (which would be not so great for performance...).

Or this graph is just telling us that we call way too much code in the first place.

(In reply to Nathan Froyd [:froydnj] from comment #12)

Is it possible to also show the pages (maybe as a separate sparkline) occupied by the functions reported in the order file? Ideally, all of those pages should show up at the start of the graph...maybe that would tell us that we need to record more calls or even make call recording use mutexes (which would be not so great for performance...).

Or this graph is just telling us that we call way too much code in the first place.

I've been trying to use a breakpad file to do this, but I can't seem to get the symbol names to line up well just using undname. How can I get a pdb for xul.dll for the PGO build, or can I? I believe that's what we use to generate the breakpad file to begin with, correct?

Flags: needinfo?(nfroyd)

(In reply to Doug Thayer [:dthayer] from comment #13)

(In reply to Nathan Froyd [:froydnj] from comment #12)

Is it possible to also show the pages (maybe as a separate sparkline) occupied by the functions reported in the order file? Ideally, all of those pages should show up at the start of the graph...maybe that would tell us that we need to record more calls or even make call recording use mutexes (which would be not so great for performance...).

Or this graph is just telling us that we call way too much code in the first place.

I've been trying to use a breakpad file to do this, but I can't seem to get the symbol names to line up well just using undname. How can I get a pdb for xul.dll for the PGO build, or can I? I believe that's what we use to generate the breakpad file to begin with, correct?

That's correct.

Builds have a "target-crashreporter-symbols.full.zip" file available to download. For Windows, the files are CAB files (just because), but not named as such, so you'll have to rename and extract the files and whatnot.

Flags: needinfo?(nfroyd)

You can send .pd_ files through expand -r without renaming to cab.

Also, if you find two xul.pdb's in the archive, always take the smaller one (the larger is xul-gtest).

Attached image xulreads.PNG (obsolete) —

So, I couldn't sort out a way to efficiently get the information I need from the pdb without writing a custom tool that reads the pdb directly, which is not something I want to spend the time doing right now. So I settled on massaging the symbols until they lined up well enough with the breakpad format.[1] I ended up being able to find about a third of the symbols in the breakpad file, and I manually sampled the symbols that I couldn't find to see why, and as far as I could tell they simply weren't in the breakpad file, so I assume they were optimized out?[2]

Anyway, the results are attached. The top bar is filled in for every page that contains a symbol found in the order file (as far as I could determine by using the breakpad file to get the address). For all[3] of the symbols I could find, they are indeed to be found at the start of the .text section, so that part seems to be working correctly, but it is a very very small part of what we actually use. Also, I think the gap in the second bar, immediately after the symbols we do find, suggests that this is actually the complete list of symbols present in both the breakpad file and the order file.

[1] I'm not sure why I'm seeing these differences. I assume breakpad does some custom transformation of the symbols other than just these undname flags.
[2] As far as I can tell, breakpad doesn't include inlining information. Is this correct? Edit: just saw bug 524410, looks like we do now for Linux, but not Windows.
[3] EDIT: not quite ALL. There are a few stragglers which I don't have a great explanation for.

Attachment #9064908 - Attachment is obsolete: true
Attached image xulreads.PNG

(Attaching a more correct graph, which is less relaxed about the symbols matching up)

Attachment #9065758 - Attachment is obsolete: true

I think this profile shows well the cost of the current preloading: https://perfht.ml/2Z4myCB

I think this profile shows well the cost of the current preloading: https://perfht.ml/2Z4myCB

It can be misleading to talk about these things in isolation, because somebody might look at that graph and be tempted to assume that we can remove 2.4s from startup by investing in this bug. (We can't.)

There's a certain amount of work that's unavoidable even with a perfect binary layout -- I'm sure that's uncontroversial.
But in reality we're not going to be perfect, which means that even some of the "avoidable" work is not in scope to eliminate here. The really interesting questions are: (1) how close can we get to perfect? and (2) how quickly does performance deteriorate with imperfection? If our predictions are off by one page in a thousand, does the small-block overhead that the OS incurs to fill in the missing pieces outweigh the win from reducing our prefetch? What about one page in a hundred? In ten?

I don't think we'll ever know those answers precisely, but there are some experiments we could do to try to get a rough understanding. For example, two limitations of our order file were the 25M buffer limit and the fact that order data was collected on a pre-PGO build, which lets later optimizations change functions around. We could try a custom build with an absurd buffer limit and a separate cygprofile phase after PGO to collect better order data. If that still doesn't make xulreads.png look any less random, then I don't think it's practical to work on this further.

For question (2) we could try not prefetching the .rodata section. By eyeball we use about 20% of it. It would be interesting to know whether at that point it's worth turning off prefetch.

Would it make sense to prefetch only the bare minimum from xul.dll during early startup, and then prefetch a lot more off main thread without blocking the startup critical path?

How do you define bare minimum?

(In reply to David Major [:dmajor] from comment #21)

How do you define bare minimum?

It could be what we need to reach the point where we are showing the early blank first paint, or it could be what we need to reach the first paint of the browser UI.

Any scheme of "prefetch only these specific parts of xul.dll" needs to overcome the problems that I wrote in comment 19: we need to know which parts those are, and our current instrumentation is not doing a great job of capturing that information accurately.

Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.