Ensure we're properly prefetching Windows DLLs with ReadAheadLib

RESOLVED FIXED in Firefox 68

Status

()

enhancement
P2
normal
RESOLVED FIXED
2 months ago
25 days ago

People

(Reporter: dthayer, Assigned: dthayer)

Tracking

(Depends on 1 bug, Blocks 2 bugs)

unspecified
mozilla68
Unspecified
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox68 fixed)

Details

(Whiteboard: [fxperf:p1])

Attachments

(6 attachments)

Assignee

Description

2 months ago

I did a quick test on the 2017 reference hardware with a build with/without the DLL read-ahead here. See the attached graph for details. The red is with the read-ahead (our current situation), and the blue is without. These are all cold startups on the machine, with a script that sleeps for 30 seconds, starts Firefox, waits for 120 seconds, collects the ping, and restarts the machine. The values are the time until blank window shown (via the blankWindowShown property in the main ping's simpleMeasurements.)

Median with read-ahead:
8729 ms

Median without read-ahead:
6885 ms

This is roughly a 21% improvement.

Other startup indicators like firstPaint are a little more complicated and require more testing, but I don't see any evidence of the situation regressing.

I'm going to continue testing this, but I wanted to get the bug posted first in case anyone has any ideas/suggestions.

Bug 1376994 comment 4 and bug 1529558 might be interesting to you.

(In reply to Doug Thayer [:dthayer] from comment #0)

a script that sleeps for 30 seconds, starts Firefox

Waiting for some CPU and disk idleness might give you more stable results, https://bitbucket.org/fqueze/frame-recording/src/3e356441e7d3d3737817e1bb7dec960977cb815e/fakeInput/fakeinput.py?at=default&fileviewer=file-view-default#fakeinput.py-109 is what I was doing for my frame recording measurements last year.

Assignee

Comment 3

2 months ago

(In reply to Florian Quèze [:florian] from comment #1)

Bug 1376994 comment 4 and bug 1529558 might be interesting to you.

Yeah - my Windows machine is not available this second, but I wanted to take a look at what the file system cache looks like for xul.dll when we preload it vs not. I remember seeing interesting things when looking at it in RAMMap for the warmup service (i.e., significant differences in the standby column for xul.dll when preloading vs just loading and using the dll as normal). I'll toy around with PrefetchVirtualMemory, though I remember not seeing much difference when I was playing with PrefetchVirtualMemory vs explicit reads for the warmup service, but that was kind of a different case.

(In reply to Florian Quèze [:florian] from comment #2)

Waiting for some CPU and disk idleness might give you more stable results, https://bitbucket.org/fqueze/frame-recording/src/3e356441e7d3d3737817e1bb7dec960977cb815e/fakeInput/fakeinput.py?at=default&fileviewer=file-view-default#fakeinput.py-109 is what I was doing for my frame recording measurements last year.

Ah, yeah that's a good idea, thanks!

Assignee

Comment 4

2 months ago

(In reply to Doug Thayer [:dthayer] from comment #3)

Yeah - my Windows machine is not available this second, but I wanted to take a look at what the file system cache looks like for xul.dll when we preload it vs not. I remember seeing interesting things when looking at it in RAMMap for the warmup service (i.e., significant differences in the standby column for xul.dll when preloading vs just loading and using the dll as normal).

Update on this:

After clearing the cache, opening Firefox and opening a few web pages (for the cases where we actually load xul.dll), RAMMap's numbers look like this:

                                         Total        Standby      Active
ReadAhead + Load       .../xul.dll       148300 K     103008 K     45292 K
Just Load              .../xul.dll       68424 K      19488 K      48936 K
Just ReadAhead         .../xul.dll       96868 K      96868 K      0 K

ReadAhead + Load is what we currently do. Just Load is where we load the DLL with no ReadAhead, and Just ReadAhead is where we just do the ReadAhead on the dll file and then exit the program.

This suggests that we are unnecessarily reading about 80 MB, as that's the excess Standby file cache we're left with.

Assignee

Comment 5

2 months ago

Also, it should be noted that when I was experimenting with the warmup service, I saw no measurable improvement when simply reading xul.dll on startup, and yet saw significant improvement when actually loading xul.dll on startup, which makes me think dll loading takes a special path through the filesystem cache that effectively bypasses the cache used by simple data reads.

(In reply to Doug Thayer [:dthayer] from comment #5)

Also, it should be noted that when I was experimenting with the warmup service, I saw no measurable improvement when simply reading xul.dll on startup, and yet saw significant improvement when actually loading xul.dll on startup, which makes me think dll loading takes a special path through the filesystem cache that effectively bypasses the cache used by simple data reads.

We have a mailing list with Microsoft, where you could ask whether this is the case.

Priority: -- → P2
Whiteboard: [fxperf] → [fxperf:p2]
Assignee

Comment 7

2 months ago

I'm attaching a chart with the results of my performance testing of this.

Key results:

With ReadAhead disabled (referred to as "test"), the blankWindowShown metric immediately improves. This makes sense, as we show the blank window very early, so we haven't had to do very much IO on xul.dll to get to that point.

However, immediately after switching, firstPaint and delayedStartupFinished regress a small amount. This I suspect is due to the cost of doing many small reads rather than one large read. (Even though we read far less overall.) However, it's unclear how the one large read actually impacts the many small reads, given that DLL loads and file reads seem to be somewhat disjoint in their caching(?). I have a thread open in the mozilla/microsoft mailing list trying to answer this.

But, after an hour or two, firstPaint and delayedStartupFinished dramatically improve, bringing them well below the control build. I suspect that this coincides with the Windows prefetch file being updated, but this is all a bit labor intensive so I haven't done a test to catch it in the act. However, the prefetch files are substantially different for the test and the control build - xul.dll is not even present in the prefetch file for the control build.

So it seems at present what we're talking about is an unambiguous performance win in the long term, but a performance tradeoff for the first run after an install (trading overall startup time for an earlier blank paint). I'm not so happy about that tradeoff, so I'd like to investigate what it would take to mimic what the prefetch file does using startup data collected during automation for official builds.

Assignee

Updated

2 months ago
Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Whiteboard: [fxperf:p2] → [fxperf:p1]

(This is partially a reply to your post on the Microsoft list)

You may want to use (Measure-Command { ./load.exe }).TotalMilliseconds to avoid confusion if the time goes over 1 second.

I wonder if the standby list ends up caching some pre-processed structures related to the library besides the raw data, which actually takes the bulk of the load time, or if the RamMap "Empty Standby List" just happens to clobber these as well?

Are you using SEC_IMAGE when mapping the file for prefetch? In my tests of PrefetchVirtualMemory, using PAGE_READONLY | SEC_IMAGE with CreateFileMapping causes a later LoadLibraryEx to only add 16KB more standby, vs. 2MB for PAGE_EXECUTE_READ or PAGE_READONLY. I think this lets the mapping work the same way as module loading.

Assignee

Comment 9

2 months ago

(In reply to Adam Gashlin (he/him) [:agashlin] from comment #8)

(This is partially a reply to your post on the Microsoft list)

You may want to use (Measure-Command { ./load.exe }).TotalMilliseconds to avoid confusion if the time goes over 1 second.

I wonder if the standby list ends up caching some pre-processed structures related to the library besides the raw data, which actually takes the bulk of the load time, or if the RamMap "Empty Standby List" just happens to clobber these as well?

Are you using SEC_IMAGE when mapping the file for prefetch? In my tests of PrefetchVirtualMemory, using PAGE_READONLY | SEC_IMAGE with CreateFileMapping causes a later LoadLibraryEx to only add 16KB more standby, vs. 2MB for PAGE_EXECUTE_READ or PAGE_READONLY. I think this lets the mapping work the same way as module loading.

Ah, yeah that resolved it. I'm going to do a test now on the reference hardware with CreateFileMapping / PrefetchVirtualMemory instead of the existing solution here.

Assignee

Comment 10

2 months ago

This seems a little too good to be true, so I hope I tested everything properly... The data on the left is the control build, the data on the right is with my new ReadAheadLib implementation applied. There's actually a delay between when I switched to the new implementation and when we see the performance numbers improve. I'm fairly confident that this is, again, just due to the prefetch file updating.

Assignee

Updated

2 months ago
Summary: Investigate whether doing a read-ahead on Windows DLLs is a good idea → Ensure we're properly prefetching Windows DLLs with ReadAheadLib
Assignee

Updated

2 months ago
See Also: → 1541575
Assignee

Comment 11

2 months ago

This is just to make it simpler to use PrefetchVirtualMemory in subsequent patches.

Assignee

Comment 12

2 months ago

There shouldn't be any need to do this for content processes as
the DLL should already be in the system file cache.

Depends on D26016

Assignee

Comment 13

2 months ago

Our old approach was reading this into a region of unexecutable memory,
meaning when we actually loaded the library it needed to be loaded into
executable memory. This fixes that. I suspect that any gains we were
seeing previously were due to hardware effects.

Depends on D26017

(In reply to Doug Thayer [:dthayer] from comment #13)

I suspect that any gains we were seeing previously were due to hardware effects.

One thing to keep in mind with respect to the ReadAhead* stuff on Windows is that, the last time that we made significant changes there and actually tested this, our supported Windows versions ranged from Windows XP to Windows 7. A lot has changed under the hood in Windows since then, particularly with respect to image loading.

Since newer Windows versions support kernel-based process mitigations involving DLL search path, digital signature checks, and so forth, it would not surprise me at all to see that what might have worked well on the versions of Windows that were in use during 2012-2013, is no longer as effective.

Assignee

Comment 15

2 months ago

(In reply to Aaron Klotz [:aklotz] from comment #14)

One thing to keep in mind with respect to the ReadAhead* stuff on Windows is that, the last time that we made significant changes there and actually tested this, our supported Windows versions ranged from Windows XP to Windows 7. A lot has changed under the hood in Windows since then, particularly with respect to image loading.

Since newer Windows versions support kernel-based process mitigations involving DLL search path, digital signature checks, and so forth, it would not surprise me at all to see that what might have worked well on the versions of Windows that were in use during 2012-2013, is no longer as effective.

Yeah, I suspect you're right about that. Especially given that Chromium seems to have the same problem: https://cs.chromium.org/chromium/src/chrome/app/main_dll_loader_win.cc?type=cs&q=PreReadFile&sq=package:chromium&g=0&l=60 (This function is only used before loading DLLs).

OS: Unspecified → Windows
Blocks: 1543096
No longer blocks: 1522877

Comment 16

a month ago
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6f2e7c819c11
Extract nsZipArchive's madvise and similar to a helper r=glandium,froydnj
https://hg.mozilla.org/integration/autoland/rev/508ee4cf9ea2
Only readahead DLLs in parent process r=glandium
https://hg.mozilla.org/integration/autoland/rev/af07f58d18cc
Use CreateFileMapping with SEC_IMAGE to prefetch Windows DLLs r=aklotz

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception&fromchange=af07f58d18cc43677c03e801baa4f81292f314ae&tochange=d1b15e1a37f0f7098c0800ac89239540712210a6&selectedJob=239844974

Failure log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239843797&repo=autoland&lineNumber=1068
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239844974&repo=autoland&lineNumber=970
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=239837504&repo=autoland&lineNumber=1095

Backout link: https://hg.mozilla.org/integration/autoland/rev/d1b15e1a37f0f7098c0800ac89239540712210a6

03:38:08 INFO - TEST-SKIP | dom/canvas/test/webgl-mochitest/test_webgl_color_buffer_float.html | took 0ms
03:38:08 INFO - Running manifest: dom\canvas\test\crash\mochitest.ini
03:38:09 INFO - Z:\task_1555032975\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
03:38:09 INFO - MochitestServer : launching [u'Z:\task_1555032975\build\tests\bin\xpcshell.exe', '-g', 'Z:\task_1555032975\build\application\firefox', '-f', 'Z:\task_1555032975\build\tests\bin\components\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\users\\task_1555032975\\appdata\\local\\temp\\tmpal5qlf.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'Z:\task_1555032975\build\tests\mochitest\server.js']
03:38:09 INFO - runtests.py | Server pid: 5556
03:38:09 INFO - runtests.py | Websocket server pid: 4336
03:38:09 INFO - runtests.py | SSL tunnel pid: 396
03:38:09 INFO - [5556, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, NS_ERROR_UNEXPECTED) failed with result 0x80004005: file z:/build/build/src/extensions/cookie/nsPermissionManager.cpp, line 1026
03:38:09 INFO - Couldn't convert chrome URL: chrome://branding/locale/brand.properties
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Failed to get directory to cache.: file z:/build/build/src/security/sandbox/win/src/sandboxbroker/sandboxBroker.cpp, line 81
03:38:09 INFO - [5556, Main Thread] WARNING: Could not get the program name for a cubeb stream.: 'NS_SUCCEEDED(rv)', file z:/build/build/src/dom/media/CubebUtils.cpp, line 359
03:38:09 INFO - runtests.py | Running with scheme: http
03:38:09 INFO - runtests.py | Running with e10s: True
03:38:09 INFO - runtests.py | Running with serviceworker_e10s: False
03:38:09 INFO - runtests.py | Running with socketprocess_e10s: False
03:38:09 INFO - runtests.py | Running tests: start.
03:38:09 INFO -
03:38:09 INFO - Application command: Z:\task_1555032975\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\task_1555032975\appdata\local\temp\tmpal5qlf.mozrunner
03:38:09 INFO - runtests.py | Application pid: 1792
03:38:09 INFO - TEST-INFO | started process GECKO(1792)
03:41:09 INFO - runtests.py | Waiting for browser...
03:41:09 INFO - TEST-INFO | Main app process: exit c0000005
03:41:09 INFO - Buffered messages finished
03:41:09 ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 3221225477
03:41:09 INFO - runtests.py | Application ran for: 0:03:00.033000
03:41:09 INFO - zombiecheck | Reading PID log: c:\users\task_1555032975\appdata\local\temp\tmplxgwwdpidlog
03:41:09 INFO - Traceback (most recent call last):
03:41:09 INFO - File "Z:\task_1555032975\build\tests\mochitest\runtests.py", line 2830, in doTests
03:41:09 INFO - e10s=options.e10s
03:41:09 INFO - File "Z:\task_1555032975\build\tests\mochitest\runtests.py", line 2261, in runApp
03:41:09 INFO - self.marionette.start_session()
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
03:41:09 INFO - m._handle_socket_failure()
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\marionette.py", line 650, in _handle_socket_failure
03:41:09 INFO - reraise(exc, val, tb)
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
03:41:09 INFO - return func(*args, **kwargs)
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1108, in start_session
03:41:09 INFO - self.raise_for_port(timeout=timeout)
03:41:09 INFO - File "Z:\task_1555032975\build\venv\lib\site-packages\marionette_driver\marionette.py", line 569, in raise_for_port
03:41:09 INFO - self.host, self.port))
03:41:09 INFO - timeout: Timed out waiting for connection on 127.0.0.1:2828!
03:41:09 ERROR - Automation Error: Received unexpected exception while running application

Flags: needinfo?(dothayer)

Comment 19

a month ago
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9f73549db05e
Extract nsZipArchive's madvise and similar to a helper r=glandium,froydnj
https://hg.mozilla.org/integration/autoland/rev/05fe1e422455
Only readahead DLLs in parent process r=glandium
https://hg.mozilla.org/integration/autoland/rev/2c3837b46068
Use CreateFileMapping with SEC_IMAGE to prefetch Windows DLLs r=aklotz

Comment 20

a month ago
bugherder
Status: ASSIGNED → RESOLVED
Last Resolved: a month ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

== Change summary for alert #20459 (as of Sun, 14 Apr 2019 18:02:30 GMT) ==

Improvements:

99% tp5n main_startup_fileio windows10-64-shippable-qr opt e10s stylo 108,569,977.67 -> 830,355.08
99% tp5n main_startup_fileio windows7-32-shippable opt e10s stylo 99,014,800.83 -> 773,859.33
99% tp5n main_startup_fileio windows10-64-shippable opt e10s stylo 99,513,661.75 -> 912,079.00
87% tp5n nonmain_startup_fileio windows10-64-shippable-qr opt e10s stylo 8,873,559.17 -> 1,160,403.08
85% tp5n nonmain_startup_fileio windows10-64-shippable opt e10s stylo 8,132,712.08 -> 1,241,729.58
56% tp5n nonmain_startup_fileio windows7-32-shippable opt e10s stylo 2,698,315.92 -> 1,186,592.17
6% startup_about_home_paint windows10-64-shippable-qr opt e10s stylo 690.79 -> 647.42

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20459

Assignee

Comment 22

a month ago

Summary of what I've observed after landing this (other than the above):

  • Talos shows about a 16% improvement on cpstartup on Windows.
  • Telemetry numbers show significant reduction in the 95th percentile for startup measures, and little change in the median. For example, the 95th percentile for SIMPLE_MEASURES_FIRSTPAINT shows a dip from 22s to about 17s, which is about a 23% improvement, while the median shows no observable change yet. The mean does visibly dip, however.
  • Telemetry numbers seem to show more significant improvements for later startup measures. The effect on SIMPLE_MEASURES_DELAYEDSTARTUPFINISHED seems more pronounced than the effect on SIMPLE_MEASURES_BLANKWINDOWSHOWN, for example. I think this does make some amount of sense, as the change means that we correctly load all of xul.dll into the correct region of memory, meaning pages that were previously being accessed later in startup when the read head may have been somewhere else are now being cached early with PrefetchVirtualMemory, theoretically allowing the read head to take the optimal path.

Given the large impact this has on the 95th percentile, and a conjecture that the release population generally has worse hardware than the nightly population, I would expect to see bigger improvements on release.

Regressions: 1546498
Assignee

Updated

25 days ago
Blocks: 1548033
Assignee

Updated

25 days ago
Depends on: 1548048
You need to log in before you can comment on or make changes to this bug.