Ensure we're properly prefetching Windows DLLs with ReadAheadLib
Categories
(Toolkit :: Startup and Profile System, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox68 | --- | fixed |
People
(Reporter: alexical, Assigned: alexical)
References
(Blocks 1 open bug)
Details
(Whiteboard: [fxperf:p1])
Attachments
(6 files)
I did a quick test on the 2017 reference hardware with a build with/without the DLL read-ahead here. See the attached graph for details. The red is with the read-ahead (our current situation), and the blue is without. These are all cold startups on the machine, with a script that sleeps for 30 seconds, starts Firefox, waits for 120 seconds, collects the ping, and restarts the machine. The values are the time until blank window shown (via the blankWindowShown property in the main ping's simpleMeasurements.)
Median with read-ahead:
8729 ms
Median without read-ahead:
6885 ms
This is roughly a 21% improvement.
Other startup indicators like firstPaint are a little more complicated and require more testing, but I don't see any evidence of the situation regressing.
I'm going to continue testing this, but I wanted to get the bug posted first in case anyone has any ideas/suggestions.
Comment 1•6 years ago
|
||
Bug 1376994 comment 4 and bug 1529558 might be interesting to you.
Comment 2•6 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #0)
a script that sleeps for 30 seconds, starts Firefox
Waiting for some CPU and disk idleness might give you more stable results, https://bitbucket.org/fqueze/frame-recording/src/3e356441e7d3d3737817e1bb7dec960977cb815e/fakeInput/fakeinput.py?at=default&fileviewer=file-view-default#fakeinput.py-109 is what I was doing for my frame recording measurements last year.
Assignee | ||
Comment 3•6 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #1)
Bug 1376994 comment 4 and bug 1529558 might be interesting to you.
Yeah - my Windows machine is not available this second, but I wanted to take a look at what the file system cache looks like for xul.dll when we preload it vs not. I remember seeing interesting things when looking at it in RAMMap for the warmup service (i.e., significant differences in the standby column for xul.dll when preloading vs just loading and using the dll as normal). I'll toy around with PrefetchVirtualMemory, though I remember not seeing much difference when I was playing with PrefetchVirtualMemory vs explicit reads for the warmup service, but that was kind of a different case.
(In reply to Florian Quèze [:florian] from comment #2)
Waiting for some CPU and disk idleness might give you more stable results, https://bitbucket.org/fqueze/frame-recording/src/3e356441e7d3d3737817e1bb7dec960977cb815e/fakeInput/fakeinput.py?at=default&fileviewer=file-view-default#fakeinput.py-109 is what I was doing for my frame recording measurements last year.
Ah, yeah that's a good idea, thanks!
Assignee | ||
Comment 4•6 years ago
•
|
||
(In reply to Doug Thayer [:dthayer] from comment #3)
Yeah - my Windows machine is not available this second, but I wanted to take a look at what the file system cache looks like for xul.dll when we preload it vs not. I remember seeing interesting things when looking at it in RAMMap for the warmup service (i.e., significant differences in the standby column for xul.dll when preloading vs just loading and using the dll as normal).
Update on this:
After clearing the cache, opening Firefox and opening a few web pages (for the cases where we actually load xul.dll), RAMMap's numbers look like this:
Total Standby Active
ReadAhead + Load .../xul.dll 148300 K 103008 K 45292 K
Just Load .../xul.dll 68424 K 19488 K 48936 K
Just ReadAhead .../xul.dll 96868 K 96868 K 0 K
ReadAhead + Load
is what we currently do. Just Load
is where we load the DLL with no ReadAhead, and Just ReadAhead
is where we just do the ReadAhead on the dll file and then exit the program.
This suggests that we are unnecessarily reading about 80 MB, as that's the excess Standby file cache we're left with.
Assignee | ||
Comment 5•6 years ago
|
||
Also, it should be noted that when I was experimenting with the warmup service, I saw no measurable improvement when simply reading xul.dll on startup, and yet saw significant improvement when actually loading xul.dll on startup, which makes me think dll loading takes a special path through the filesystem cache that effectively bypasses the cache used by simple data reads.
Comment 6•6 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #5)
Also, it should be noted that when I was experimenting with the warmup service, I saw no measurable improvement when simply reading xul.dll on startup, and yet saw significant improvement when actually loading xul.dll on startup, which makes me think dll loading takes a special path through the filesystem cache that effectively bypasses the cache used by simple data reads.
We have a mailing list with Microsoft, where you could ask whether this is the case.
Updated•6 years ago
|
Updated•6 years ago
|
Assignee | ||
Comment 7•6 years ago
|
||
I'm attaching a chart with the results of my performance testing of this.
Key results:
With ReadAhead disabled (referred to as "test"), the blankWindowShown metric immediately improves. This makes sense, as we show the blank window very early, so we haven't had to do very much IO on xul.dll to get to that point.
However, immediately after switching, firstPaint and delayedStartupFinished regress a small amount. This I suspect is due to the cost of doing many small reads rather than one large read. (Even though we read far less overall.) However, it's unclear how the one large read actually impacts the many small reads, given that DLL loads and file reads seem to be somewhat disjoint in their caching(?). I have a thread open in the mozilla/microsoft mailing list trying to answer this.
But, after an hour or two, firstPaint and delayedStartupFinished dramatically improve, bringing them well below the control build. I suspect that this coincides with the Windows prefetch file being updated, but this is all a bit labor intensive so I haven't done a test to catch it in the act. However, the prefetch files are substantially different for the test and the control build - xul.dll is not even present in the prefetch file for the control build.
So it seems at present what we're talking about is an unambiguous performance win in the long term, but a performance tradeoff for the first run after an install (trading overall startup time for an earlier blank paint). I'm not so happy about that tradeoff, so I'd like to investigate what it would take to mimic what the prefetch file does using startup data collected during automation for official builds.
Assignee | ||
Updated•6 years ago
|
Comment 8•6 years ago
|
||
(This is partially a reply to your post on the Microsoft list)
You may want to use (Measure-Command { ./load.exe }).TotalMilliseconds
to avoid confusion if the time goes over 1 second.
I wonder if the standby list ends up caching some pre-processed structures related to the library besides the raw data, which actually takes the bulk of the load time, or if the RamMap "Empty Standby List" just happens to clobber these as well?
Are you using SEC_IMAGE
when mapping the file for prefetch? In my tests of PrefetchVirtualMemory
, using PAGE_READONLY | SEC_IMAGE
with CreateFileMapping
causes a later LoadLibraryEx
to only add 16KB more standby, vs. 2MB for PAGE_EXECUTE_READ
or PAGE_READONLY
. I think this lets the mapping work the same way as module loading.
Assignee | ||
Comment 9•6 years ago
|
||
(In reply to Adam Gashlin (he/him) [:agashlin] from comment #8)
(This is partially a reply to your post on the Microsoft list)
You may want to use
(Measure-Command { ./load.exe }).TotalMilliseconds
to avoid confusion if the time goes over 1 second.I wonder if the standby list ends up caching some pre-processed structures related to the library besides the raw data, which actually takes the bulk of the load time, or if the RamMap "Empty Standby List" just happens to clobber these as well?
Are you using
SEC_IMAGE
when mapping the file for prefetch? In my tests ofPrefetchVirtualMemory
, usingPAGE_READONLY | SEC_IMAGE
withCreateFileMapping
causes a laterLoadLibraryEx
to only add 16KB more standby, vs. 2MB forPAGE_EXECUTE_READ
orPAGE_READONLY
. I think this lets the mapping work the same way as module loading.
Ah, yeah that resolved it. I'm going to do a test now on the reference hardware with CreateFileMapping / PrefetchVirtualMemory instead of the existing solution here.
Assignee | ||
Comment 10•6 years ago
|
||
This seems a little too good to be true, so I hope I tested everything properly... The data on the left is the control build, the data on the right is with my new ReadAheadLib implementation applied. There's actually a delay between when I switched to the new implementation and when we see the performance numbers improve. I'm fairly confident that this is, again, just due to the prefetch file updating.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 11•6 years ago
|
||
This is just to make it simpler to use PrefetchVirtualMemory in subsequent patches.
Assignee | ||
Comment 12•6 years ago
|
||
There shouldn't be any need to do this for content processes as
the DLL should already be in the system file cache.
Depends on D26016
Assignee | ||
Comment 13•6 years ago
|
||
Our old approach was reading this into a region of unexecutable memory,
meaning when we actually loaded the library it needed to be loaded into
executable memory. This fixes that. I suspect that any gains we were
seeing previously were due to hardware effects.
Depends on D26017
Comment 14•6 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #13)
I suspect that any gains we were seeing previously were due to hardware effects.
One thing to keep in mind with respect to the ReadAhead*
stuff on Windows is that, the last time that we made significant changes there and actually tested this, our supported Windows versions ranged from Windows XP to Windows 7. A lot has changed under the hood in Windows since then, particularly with respect to image loading.
Since newer Windows versions support kernel-based process mitigations involving DLL search path, digital signature checks, and so forth, it would not surprise me at all to see that what might have worked well on the versions of Windows that were in use during 2012-2013, is no longer as effective.
Assignee | ||
Comment 15•6 years ago
|
||
(In reply to Aaron Klotz [:aklotz] from comment #14)
One thing to keep in mind with respect to the
ReadAhead*
stuff on Windows is that, the last time that we made significant changes there and actually tested this, our supported Windows versions ranged from Windows XP to Windows 7. A lot has changed under the hood in Windows since then, particularly with respect to image loading.Since newer Windows versions support kernel-based process mitigations involving DLL search path, digital signature checks, and so forth, it would not surprise me at all to see that what might have worked well on the versions of Windows that were in use during 2012-2013, is no longer as effective.
Yeah, I suspect you're right about that. Especially given that Chromium seems to have the same problem: https://cs.chromium.org/chromium/src/chrome/app/main_dll_loader_win.cc?type=cs&q=PreReadFile&sq=package:chromium&g=0&l=60 (This function is only used before loading DLLs).
Updated•6 years ago
|
Updated•6 years ago
|
Comment 16•6 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/6f2e7c819c11 Extract nsZipArchive's madvise and similar to a helper r=glandium,froydnj https://hg.mozilla.org/integration/autoland/rev/508ee4cf9ea2 Only readahead DLLs in parent process r=glandium https://hg.mozilla.org/integration/autoland/rev/af07f58d18cc Use CreateFileMapping with SEC_IMAGE to prefetch Windows DLLs r=aklotz
Comment 17•6 years ago
|
||
Failure log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239843797&repo=autoland&lineNumber=1068
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239844974&repo=autoland&lineNumber=970
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239837504&repo=autoland&lineNumber=1095
Backout link: https://hg.mozilla.org/integration/autoland/rev/d1b15e1a37f0f7098c0800ac89239540712210a6
03:38:08 INFO - TEST-SKIP | dom/canvas/test/webgl-mochitest/test_webgl_color_buffer_float.html | took 0ms
03:38:08 INFO - Running manifest: dom\canvas\test\crash\mochitest.ini
03:38:09 INFO - Z:\task_1555032975\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
03:38:09 INFO - MochitestServer : launching [u'Z:\task_1555032975\build\tests\bin\xpcshell.exe', '-g', 'Z:\task_1555032975\build\application\firefox', '-f', 'Z:\task_1555032975\build\tests\bin\components\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\users\\task_1555032975\\appdata\\local\\temp\\tmpal5qlf.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'Z:\task_1555032975\build\tests\mochitest\server.js']
03:38:09 INFO - runtests.py | Server pid: 5556
03:38:09 INFO - runtests.py | Websocket server pid: 4336
03:38:09 INFO - runtests.py | SSL tunnel pid: 396
03:38:09 INFO - [5556, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, NS_ERROR_UNEXPECTED) failed with result 0x80004005: file z:/build/build/src/extensions/cookie/nsPermissionManager.cpp, line 1026
03:38:09 INFO - Couldn't convert chrome URL: chrome://branding/locale/brand.properties
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Could not get the program name for a cubeb stream.: 'NS_SUCCEEDED(rv)', file z:/build/build/src/dom/media/CubebUtils.cpp, line 359
03:38:09 INFO - runtests.py | Running with scheme: http
03:38:09 INFO - runtests.py | Running with e10s: True
03:38:09 INFO - runtests.py | Running with serviceworker_e10s: False
03:38:09 INFO - runtests.py | Running with socketprocess_e10s: False
03:38:09 INFO - runtests.py | Running tests: start.
03:38:09 INFO -
03:38:09 INFO - Application command: Z:\task_1555032975\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\task_1555032975\appdata\local\temp\tmpal5qlf.mozrunner
03:38:09 INFO - runtests.py | Application pid: 1792
03:38:09 INFO - TEST-INFO | started process GECKO(1792)
03:41:09 INFO - runtests.py | Waiting for browser...
03:41:09 INFO - TEST-INFO | Main app process: exit c0000005
03:41:09 INFO - Buffered messages finished
03:41:09 ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 3221225477
03:41:09 INFO - runtests.py | Application ran for: 0:03:00.033000
03:41:09 INFO - zombiecheck | Reading PID log: c:\users\task_1555032975\appdata\local\temp\tmplxgwwdpidlog
03:41:09 INFO - Traceback (most recent call last):
03:41:09 INFO - File "Z:\task_1555032975\build\tests\mochitest\runtests.py", line 2830, in doTests
03:41:09 INFO - e10s=options.e10s
03:41:09 INFO - File "Z:\task_1555032975\build\tests\mochitest\runtests.py", line 2261, in runApp
03:41:09 INFO - self.marionette.start_session()
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
03:41:09 INFO - m._handle_socket_failure()
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\marionette.py", line 650, in _handle_socket_failure
03:41:09 INFO - reraise(exc, val, tb)
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
03:41:09 INFO - return func(*args, **kwargs)
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1108, in start_session
03:41:09 INFO - self.raise_for_port(timeout=timeout)
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\marionette.py", line 569, in raise_for_port
03:41:09 INFO - self.host, self.port))
03:41:09 INFO - timeout: Timed out waiting for connection on 127.0.0.1:2828!
03:41:09 ERROR - Automation Error: Received unexpected exception while running application
Assignee | ||
Comment 18•6 years ago
|
||
Comment 19•6 years ago
|
||
Pushed by dothayer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9f73549db05e Extract nsZipArchive's madvise and similar to a helper r=glandium,froydnj https://hg.mozilla.org/integration/autoland/rev/05fe1e422455 Only readahead DLLs in parent process r=glandium https://hg.mozilla.org/integration/autoland/rev/2c3837b46068 Use CreateFileMapping with SEC_IMAGE to prefetch Windows DLLs r=aklotz
Comment 20•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/9f73549db05e
https://hg.mozilla.org/mozilla-central/rev/05fe1e422455
https://hg.mozilla.org/mozilla-central/rev/2c3837b46068
Comment 21•6 years ago
|
||
== Change summary for alert #20459 (as of Sun, 14 Apr 2019 18:02:30 GMT) ==
Improvements:
99% tp5n main_startup_fileio windows10-64-shippable-qr opt e10s stylo 108,569,977.67 -> 830,355.08
99% tp5n main_startup_fileio windows7-32-shippable opt e10s stylo 99,014,800.83 -> 773,859.33
99% tp5n main_startup_fileio windows10-64-shippable opt e10s stylo 99,513,661.75 -> 912,079.00
87% tp5n nonmain_startup_fileio windows10-64-shippable-qr opt e10s stylo 8,873,559.17 -> 1,160,403.08
85% tp5n nonmain_startup_fileio windows10-64-shippable opt e10s stylo 8,132,712.08 -> 1,241,729.58
56% tp5n nonmain_startup_fileio windows7-32-shippable opt e10s stylo 2,698,315.92 -> 1,186,592.17
6% startup_about_home_paint windows10-64-shippable-qr opt e10s stylo 690.79 -> 647.42
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20459
Assignee | ||
Comment 22•6 years ago
|
||
Summary of what I've observed after landing this (other than the above):
- Talos shows about a 16% improvement on cpstartup on Windows.
- Telemetry numbers show significant reduction in the 95th percentile for startup measures, and little change in the median. For example, the 95th percentile for SIMPLE_MEASURES_FIRSTPAINT shows a dip from 22s to about 17s, which is about a 23% improvement, while the median shows no observable change yet. The mean does visibly dip, however.
- Telemetry numbers seem to show more significant improvements for later startup measures. The effect on SIMPLE_MEASURES_DELAYEDSTARTUPFINISHED seems more pronounced than the effect on SIMPLE_MEASURES_BLANKWINDOWSHOWN, for example. I think this does make some amount of sense, as the change means that we correctly load all of xul.dll into the correct region of memory, meaning pages that were previously being accessed later in startup when the read head may have been somewhere else are now being cached early with PrefetchVirtualMemory, theoretically allowing the read head to take the optimal path.
Given the large impact this has on the 95th percentile, and a conjecture that the release population generally has worse hardware than the nightly population, I would expect to see bigger improvements on release.
Description
•