Closed Bug 1367416 Opened 7 years ago Closed 7 years ago

WinTaskbar::Initialize() causes a library to be loaded on the main thread during startup

Categories

(Core :: Widget: Win32, enhancement, P1)

x86_64
Windows 10
enhancement

Tracking

()

RESOLVED FIXED
mozilla57
Performance Impact high
Tracking Status
firefox57 --- fixed

People

(Reporter: florian, Assigned: agashlin)

References

Details

(Whiteboard: [tpi:+][fce-active-legacy])

Attachments

(1 file, 3 obsolete files)

See this profile: https://perfht.ml/2rzBbRt where it's blocking the main thread for 88ms.

The relevant code is http://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/widget/windows/WinTaskbar.cpp#205

Could we do this off main thread and have an API returning a promise (or taking a callback)?

The calling code would be totally OK with an async API: http://searchfox.org/mozilla-central/rev/2933592c4a01b634ab53315ce2d0e43fccb82181/browser/base/content/browser.js#192
Priority: -- → P3
Whiteboard: [qf] → [qf][tpi:+]
Quantum Flow triage decided 88 ms during startup was qf:p1.
Whiteboard: [qf][tpi:+] → [qf:p1][tpi:+]
Priority: P3 → P1
It looks like this functionality was pref'd off way back in bug 605338, but the pref isn't checked before initializing (which
Assignee: nobody → agashlin
(oops, didn't mean to submit yet)

I'm going to look into adding the check to only load when the pref is on initially or comes up, but I'll also look into async init.
Whiteboard: [qf:p1][tpi:+] → [qf:p1][tpi:+][fce-active]
I've identified two places where WinTaskbar is used early, causing a load of explorerframe.dll: AeroPeek (disabled by default but still loaded) and Jump Lists. This patch delays the initial loading of those with requestIdleCallback.

Does this seem like a reasonable approach?

These were already run subsequent to _delayedStartup and _onFirstWindowLoaded, was that maybe already late enough in the startup process?

I don't know of what I could measure to show whether this is a significant improvement; it still shows up in the profile, as expected.
Flags: needinfo?(florian)
Comment on attachment 8886370 [details] [diff] [review]
Use requestIdleCallback when initializing two modules

Never mind, this is not going to work as is, if nothing else a lot of issues showed up on try https://treeherder.mozilla.org/#/jobs?repo=try&revision=edd60e3a0b77a0d7bdc7dc770bcf508b87e3311a

I'm working on a background load solution: a preload (explicitly LoadLibraryA before hitting CoCreateInstance) seems to work but doesn't move the load off of the main thread yet, that will be the next step.
Attachment #8886370 - Attachment is obsolete: true
Flags: needinfo?(florian)
Attached patch Startup time test (obsolete) — Splinter Review
I've been using this test to measure the startup time, it prints to the browser console the time since process start at of the end of _delayedStartup.

My testing has been run using Sysinternals RamMap to empty the Windows standby list before each run in order to actually load everything from disk, simulating a cold start. On the Acer Aspire E 15 reference hardware I've been getting a mean and median around 6200ms.
Attachment #8888509 - Attachment is patch: true
Attachment #8888509 - Attachment mime type: text/x-patch → text/plain
This is a patch to load explorerframe.dll off of the main thread. As most of this is new to me I need someone to look over it carefully, but my test results so far have not been encouraging, so it may not be worth approaching this way.

This appears to have no significant effect on time to first paint as measured by ts_paint (see my try submission [1] for numbers, I also observe no significant difference on reference hardware, aprox 870ms with or without this patch).

Using the cold startup time test (see comment 6) it has a major negative effect, however. On reference hardware, compared with the average 6200ms startup from unmodified mozilla-central, this takes 7400ms (median and mean are close, data here [2]), 19% slower. It could be that running the load in the background is adversely impacting later loads on the main thread which would otherwise have been run in series.

I'm surprised by how strong this effect seems to be.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9deec038008b41f996d95ce8267c763b017b7a43
[2] https://docs.google.com/a/mozilla.com/spreadsheets/d/1PvPfGKUL2RE2uyXq7oHoWuwsPkaqgmTf5wOocJjWdpE/edit?usp=sharing
Attachment #8888527 - Flags: feedback?(jmathies)
I didn't expect any dll preloading to be driven by JS code, as I think the hope with off main thread preloading is that the load can happen in the background while the CPU is busy with startup on the main thread; so ideally the preloading would be started before we start running any JS code.

I think Ehsan mentioned DLL preloading in the past so I would ask him if he has ideas here. Also, I think :kmag has successfully done some off main thread preloading of files to speed up startup, so he may have ideas about where to start, and maybe even pointers to reusable code.
Flags: needinfo?(ehsan)
Thanks for the pointers Florian. It makes sense that it would be better to start this somewhere earlier in the load, hopefully someone can point me to where to start. If it is early enough we should be able to avoid even checking whether the preload has finished.
Flags: needinfo?(kmaglione+bmo)
Attachment #8888527 - Flags: feedback?(jmathies)
I'm not sure I understand what question is being asked here, sorry.  Can you please be more specific?  Based on a quick look at the patch I don't understand what is wrong with it, besides it never freeing the loaded explorerframe.dll handle.
Flags: needinfo?(ehsan)
The question is whether it would be better to do this background load earlier (instead of triggered from within _delayedStartup where it is currently). If so, I'd appreciate a pointer to a part of startup where it would make sense to start the load, maybe there is already a thread that it would naturally fit into? I don't understand the startup stages well enough to guess yet.

Thanks for getting back about this so quickly!
Flags: needinfo?(ehsan)
Oh, I see.  Yes, I think it would be nice to kick this off earlier if we can.

Eric, the need for preloading various DLLs on a background thread like this has come up in various startup bugs.  Should we add some XPCOM facility where we spin up on background thread on Windows early on startup to preload a known set of DLLs?
Flags: needinfo?(ehsan) → needinfo?(erahm)
To follow up on my earlier statistics, it looks like the effect is just due to how long it takes to read a given copy of the binaries. I ran the same test with two identical copies and saw a similar 20% difference between them, and a simple tar > /dev/null of those two directories consistently takes 28% longer for the slower of the two. So I think I'll have to trash my cold start testing methodology.
(In reply to Adam Gashlin [:agashlin] from comment #13)
> To follow up on my earlier statistics, it looks like the effect is just due
> to how long it takes to read a given copy of the binaries. I ran the same
> test with two identical copies and saw a similar 20% difference between
> them, and a simple tar > /dev/null of those two directories consistently
> takes 28% longer for the slower of the two. So I think I'll have to trash my
> cold start testing methodology.

Note that LoadLibrary also resolves relocations, runs DllMain(), etc.  So the overhead is more than that.  We had ReadAheadFile() for priming the file cache sequentially before passing control to the system which we do for xul.dll for example, if that is what you mean?

Also, from a private email thread, Jim was also interested in helping coming up with the design.
Good point Ehsan, just reading with ReadAheadFile might be a better idea (I think that's what you're suggesting). We could use LoadLibraryEx with LOAD_LIBRARY_AS_DATAFILE for this, I think.

But what I meant was the test I was running, to measure startup time for a cold boot, was only consistent for a specific installation, e.g. directory A would be slower than directory B even if they both contain identical data. So it isn't useful for comparison. If we want to show that overall startup time isn't impacted by moving the load into the background we'd need another test, or we could use a pref or env var that we flip instead of having two different installs. But startup may be too dependent on random disk activity and filesystem issues for this to ever produce reliable numbers, given that we're talking about tens of ms out of thousands for startup.

I like the idea of a thread responsible for background preloading DLLs we know we'll need, this could be used to resolve bug 1362382 and bug 1360167 as well, I expect. But I don't know where in startup this might belong. I found the dependentlibs.list system doing preload of our own dlls [1], but that seems too early if we want to take advantage of the CPU being busy as Florian suggests. Can someone point me in the right direction?

[1] http://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/xpcom/glue/standalone/nsXPCOMGlue.cpp#155
Flags: needinfo?(jmathies)
(In reply to Adam Gashlin [:agashlin] from comment #15)
> Good point Ehsan, just reading with ReadAheadFile might be a better idea (I
> think that's what you're suggesting). We could use LoadLibraryEx with
> LOAD_LIBRARY_AS_DATAFILE for this, I think.

Just to clarify, I wasn't necessarily suggesting doing anything fancier than just a plain LoadLibrary call, just on a background thread.  :-)  Anything on top of that is definitely nice improvements to have, but it's also extra work to measure, verify, etc.

> But what I meant was the test I was running, to measure startup time for a
> cold boot, was only consistent for a specific installation, e.g. directory A
> would be slower than directory B even if they both contain identical data.
> So it isn't useful for comparison. If we want to show that overall startup
> time isn't impacted by moving the load into the background we'd need another
> test, or we could use a pref or env var that we flip instead of having two
> different installs. But startup may be too dependent on random disk activity
> and filesystem issues for this to ever produce reliable numbers, given that
> we're talking about tens of ms out of thousands for startup.

Oh I see what you mean now, apologies for my misunderstanding before!

I guess I haven't really thought that much about this aspect before.  Over the past few months I've seen a lot of profiles where we were spending hundreds of milliseconds loading various DLLs, so I guess in a way I'm already quite biased towards the solution of moving the loads off the main thread and am sold on the advantage.  But measuring this is actually not that hard, you can do it more directly, by profiling cold startup before and after the fix, looking in the profile for the name of the DLL in question and comparing the amount of time on the main thread that LoadLibrary took.  If our optimizations works successfully, the tens or hundreds of milliseconds in the former case should drop down to almost nothing.  :-)  And if we want to gather telemetry around it, we can also do that in our patched_ldrLoadDll function <https://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/mozglue/build/WindowsDllBlocklist.cpp#590>, so there are options there also.

I'll leave the hard parts to Jim (thanks Jim!) :-)
With respect to ReadAheadFile, we also have a ReadAheadLib variant specifically for this use case (though it still runs on the calling thread).

(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #16)
> I guess I haven't really thought that much about this aspect before.  Over
> the past few months I've seen a lot of profiles where we were spending
> hundreds of milliseconds loading various DLLs, so I guess in a way I'm
> already quite biased towards the solution of moving the loads off the main
> thread and am sold on the advantage.

I have discussed with jimm the possibility of writing a generic mechanism to deal with this situation instead of manually writing the code to do this every time we encounter it.
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #12)
> Oh, I see.  Yes, I think it would be nice to kick this off earlier if we can.
> 
> Eric, the need for preloading various DLLs on a background thread like this
> has come up in various startup bugs.  Should we add some XPCOM facility
> where we spin up on background thread on Windows early on startup to preload
> a known set of DLLs?

It's not totally clear to me what the issue is. Are we currently delay loading explorerframe.dll? Have we tested just linking normally and letting the OS do it's thing? Does ReadAhead(Lib) actually have a perf win? TBH I'm surprised ReadAheadLib doesn't actually do a LoadLibrary on Windows.

All that said I'm not against a more generic system for this, I'd just like to see some actual numbers showing improvements with various approaches. Our current approach to this type of thing seems to be to punt runnables at the stream transport service, I'm not sure if we need a more elaborate system.
Flags: needinfo?(erahm)
We don't directly link to anything in explorerframe.dll, it comes up when we do CoCreateInstance which ends up loading the DLL somewhere deep in combase.

Thanks for the pointer to the stream transport service, that might be the easiest way to handle this for now.
Flags: needinfo?(jmathies)
Here's a couple similar bugs:

Bug 1360167 - WindowsUIUtils::GetInTabletMode blocks the UI thread during startup
Bug 1362382 - ::RegisterDragDrop is expensive during startup

Kirk has a patch up in bug 1360167. If we go for a generic solution it'd be nice if whatever we come up with would work in that situation as well.
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #18)
> It's not totally clear to me what the issue is. Are we currently delay
> loading explorerframe.dll?

explorerframe.dll is LoadLibrary()d by a system DLL.

> Have we tested just linking normally and letting the OS do it's thing?

explorerframe.dll doesn't directly expose public APIs and therefore is not guaranteed to be available on all Windows releases. LoadLibrary() is really the only option here.

> Does ReadAhead(Lib) actually have a perf win?

Yes. It's what we use for priming the libraries in dependentlibs.list, including xul.dll itself. We had a significant perf win from that and we have achieved perf wins in other areas with it. Back when I was on the perf team, demonstrating a perf win was required in order to get an r+ for using it.

> I'm surprised ReadAheadLib doesn't actually do a LoadLibrary on Windows.

I rewrote it a few years ago so I am qualified to answer this.

LoadLibrary opens the dll as a memory-mapped file. We wanted ReadAheadLib to use FILE_FLAG_SEQUENTIAL_SCAN to enable sequential prefetch during the I/O of the DLL, so we needed more direct control of the file handle. That's why we use CreateFile instead of LoadLibrary. As our perf tests at the time indicated, it is good enough for the purposes of priming the page cache.

Starting with Windows 8, we can actually do a nicer implementation: We could reimplement ReadAheadLib as a LoadLibrary + PrefetchVirtualMemory. This hasn't been an urgent priority for me since the existing scheme already works.
Attachment #8888509 - Attachment is obsolete: true
Attachment #8888527 - Attachment is obsolete: true
Attachment #8891532 - Flags: review?(jmathies)
I've submitted a new patch in attachment 8891532 [details] that I think solves the problem simply. This version uses a thread started from XRE_mainRun to kick off the load of explorerframe.dll indirectly by creating an ITaskbarList4 which is then immediately released. This is early enough that by the win WinTaskbar::Initialize rolls around the DLL has already been read into the process.

I've confirmed that this essentially eliminates time spent in WinTaskbar::Initialize on cold start, and it doesn't seem to measurably impact startup time as measured ts_paint (see my try submission at https://treeherder.mozilla.org/#/jobs?repo=try&revision=6950b38ca6a9b88a3337211192a211099dda6f67 , I think the build failures are due to some automation misconfiguration).

Sorry if I've submitted this incorrectly, this is my first time trying mozreview.
(In reply to Adam Gashlin [:agashlin] from comment #23)
> I've submitted a new patch in attachment 8891532 [details] that I think
> solves the problem simply. This version uses a thread started from
> XRE_mainRun to kick off the load of explorerframe.dll indirectly by creating
> an ITaskbarList4 which is then immediately released. This is early enough
> that by the win WinTaskbar::Initialize rolls around the DLL has already been
> read into the process.
> 
> I've confirmed that this essentially eliminates time spent in
> WinTaskbar::Initialize on cold start, and it doesn't seem to measurably
> impact startup time as measured ts_paint (see my try submission at
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=6950b38ca6a9b88a3337211192a211099dda6f67 , I think
> the build failures are due to some automation misconfiguration).
> 
> Sorry if I've submitted this incorrectly, this is my first time trying
> mozreview.

Do we need to create the com instance or can we just load LoadLibraryA("explorerframe.dll") to address this?

Also I'm curious, from my understanding the jump list code isn't supposed to execute until long after startup. Are we catching up to that somehow? We still get jank on the main thread when it executes so what we do here helps avoid that. But I'm surprised this code showed up in startup profiles.
(In reply to Jim Mathies [:jimm] from comment #24)
> Do we need to create the com instance or can we just load
> LoadLibraryA("explorerframe.dll") to address this?

In my tests just using LoadLibrary still lets WinTaskbar::Initialize() profile tens of ms on occasion (4 of 10 test runs showed 10ms or more), while if we create a dummy COM instance early it always profiles as 2ms or less. Also with the instance the duration of the first hit of Win7Features [1] drops to 20ms instead of 50ms (from an original 150ms). I think this is because LoadLibrary doesn't necessarily read the whole thing during initial linking, while creating an instance reads the relevant parts of the library.

There is also the advantage that I don't have the decide whether or how to do FreeLibrary after LoadLibrary; I don't know whether letting the handle leak from LoadLibrary is acceptable but it will be tricky to figure out where to free it, this way Windows gets to deal with all that.

Still, for simplicity we might want to just do LoadLibrary (and FreeLibrary?), and it is less likely to adversely impact other threads. I don't think we want to do ReadAheadLib, though, as I doubt we use most of this 4MB DLL, but I haven't tried it yet.

> Also I'm curious, from my understanding the jump list code isn't supposed to
> execute until long after startup. Are we catching up to that somehow? We
> still get jank on the main thread when it executes so what we do here helps
> avoid that. But I'm surprised this code showed up in startup profiles.

There are two places that cause WinTaskbar::Initialize to show up early: AeroPeek setup (mentioned in comment 1, run from _delayedStartup) and Jump List setup (in _onFirstWindowLoaded [2], which happens immediately after _delayedStartup finishes). In both cases this is caused by a check for WinTaskbar availability, which can only be done the first time by creating the COM object.

I'm not sure what you mean specifically by "long after startup", though, I'm also confused by what is "catching up" with what.

[1] http://searchfox.org/mozilla-central/rev/09c065976fd4f18d4ad764d7cb4bbc684bf56714/browser/base/content/browser.js#1671
[2] http://searchfox.org/mozilla-central/rev/09c065976fd4f18d4ad764d7cb4bbc684bf56714/browser/components/nsBrowserGlue.js#899
Attachment #8891532 - Flags: review?(jmathies)
Comment on attachment 8891532 [details]
Bug 1367416 - Read ahead DLLs on a background thread

On further thought an experimentation I'm going to replace the COM load with ReadAheadLib, and I will probably also add the read ahead for bug 1362382's dataexchange.dll at the same time as the two can use the same mechanism.
Attachment #8891532 - Attachment is obsolete: true
Attachment #8891532 - Flags: review?(jmathies)
My tests [1] show that this new patch in attachment 8891532 [details] largely eliminates the profile for WinTaskbar::Initialize (current bug) and UpdateTabletModeState (bug 1360167), and there is a smaller effect for EnableDragDrop (bug 1362382). I haven't been able to measure any impact on overall startup time with local Talos runs for ts_paint, a try run is in progress [2].

[1] https://docs.google.com/a/mozilla.com/spreadsheets/d/1Lmq0i7NsGcKXp_dPjz1FKO0SNNtiAj5AiTpvuo7NLMM/edit?usp=sharing
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=30a9971258fe611745066ba1e8ab2cfccca70d88
Comment on attachment 8891532 [details]
Bug 1367416 - Read ahead DLLs on a background thread

https://reviewboard.mozilla.org/r/162648/#review169364

I wonder if there's some way to craft a test that checks to make sure this is working? Easy to have this break and we'd never know it since we don't see these perf hits in talos (is this correct?).
Attachment #8891532 - Flags: review?(jmathies) → review+
Thanks Jim!

As far as I can tell there's nothing in talos that is measurably affected by these, and even on cold start it doesn't show up over noise in overall startup time.

The tests I've run involve manually using RAM Map's "Empty standby list" and then looking in the gecko profiler. I don't know if we have any infrastructure in place for profiler-based testing.

Alternately we could zero in on the known first uses of these libs and explicitly time them, but this still relies on simulating a cold start with RAM Map, and it seems particularly fragile in case order changes.

Including RAM Map in tests, or a utility with the same function, or writing our own, is probably going to be an ordeal, and would likely throw off the timing of other tests running later on the same machine.

So I don't see a good route to automated testing.
Flags: needinfo?(kmaglione+bmo)
Here is a startup profile giving more ideas of dll files that could do with preloading off main thread: https://perfht.ml/2uCKuNT Probably for follow-ups though.
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/cb33a188655b
Read ahead DLLs on a background thread r=jimm
Keywords: checkin-needed
Do we know for sure that setting that background thread to low priority is still effective?

I have seen cases where the background thread is starved to the point that the main thread effectively has to load the DLLs anyway because it won the race.
It has been effective in my cold start tests on the reference hardware, at least. It is possible that it will lose the race, I don't have any interlocking in this patch to prevent that. Are you suggesting PR_PRIORITY_NORMAL will make this much less likely?

I'm curious about the starvation cases that you've seen, are there other instances of this pattern (I tried to locate one rather than create a new thread), or are you profiling the current patch?
Depends on: 1388645
https://hg.mozilla.org/mozilla-central/rev/cb33a188655b
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
I share Aaron's concern about the thread priority here, I have seen in many profiles that threads can easily get starved, especially on lower spec'ed hardware when the system is under load (including the Acer reference hardware machine.)

Given that the background thread created here doesn't do a ton of work on the CPU, I think we should probably create it with PR_PRIORITY_NORMAL, unless if we have some evidence that doing so has some harmful side effect I'm not thinking about right now...
Ok, I agree. Given that this has already landed I opened new bug 1389171 for what should be a minimal change.
Depends on: 1389171
(In reply to Adam Gashlin [:agashlin] from comment #37)
> Ok, I agree. Given that this has already landed I opened new bug 1389171 for
> what should be a minimal change.

Thanks a lot.  :-)
Blocks: 1397706
Whiteboard: [qf:p1][tpi:+][fce-active] → [qf:p1][tpi:+][fce-active-legacy]
Performance Impact: --- → P1
Whiteboard: [qf:p1][tpi:+][fce-active-legacy] → [tpi:+][fce-active-legacy]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: