Closed Bug 1557564 Opened 4 years ago Closed 4 years ago

Complete implementation of Base Profiler on Windows

Categories

(Core :: Gecko Profiler, task, P2)

Unspecified
Windows
task

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 --- fixed

People

(Reporter: mozbugz, Assigned: toshi)

References

Details

Attachments

(1 file, 1 obsolete file)

Base Profiler cannot be enabled on package builds of Windows yet.

When present and enabled, but not necessarily started, It seems to somehow prevent (or delay?) stdout and stderr in child processes.
Then when Gecko Profiler starts with the main-thread io feature, it initializes PoisonIOInterposer, which blindly tries to register stdout&err, but because they're not available, it triggers an assert -- see details in bug 1556993.

Adding a simple check for errors in MozillaRegisterDebugFD removes the assert and allows for full use of Base Profiler and Gecko Profiler together on Windows.
But I'm unsure that it's the right way to procede, it feels like wallpapering to prevent an assert, but doesn't do anything about the strange absence of stdout&err because of Base Profiler.

Bug 1556993 is focused on MozillaRegisterDebugFD and friends, to determine if they should have this kind of error-checking anyway.
This bug here will focus on the cause of stdout&err disappearance. (And/or whatever else is preventing us from enabling Base Profiler on Windows.)

Assignee: nobody → gsquelart

Bug 1559379 will fix TestBaseProfiler on Windows.

Depends on: 1559379

Bug 1556993 fixed the crash in PoisonIOInterposer (due to missing stdout&stderr
fd's), and bug 1559379 fixed the mozglue memory issue that the unit test
experienced.
So now Base Profiler can be enabled by default on Windows, still using
MOZ_BASE_PROFILER_... env-vars for now (upcoming bug will merge these with
non-BASE env-vars soon).

Pushed by gsquelart@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1ccaad832c1d
Enable Base Profiler in Windows - r=florian
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Regressions: 1531789

Backed out for causing bug 1531789

R/b on two bc chunks:

https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=windows%2C7%2Cshippable%2Copt%2Cmochitests%2Ctest-windows7-32-shippable%2Fopt-mochitest-browser-chrome-e10s-6%2Cm%28bc6%29&tochange=505173c26e1af4afcb7ee13a567a5ae4d8149dc7&fromchange=58fced48cd5dcf0f87c1d8345e011f6961af78cf

https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=windows%2C7%2Cdebug%2Cmochitests%2Ctest-windows7-32%2Fdebug-mochitest-browser-chrome-e10s-7%2Cm%28bc7%29&tochange=2799cfb271ca258e9087a16601dd16561bd4a3f3&fromchange=ff23369d071869072c56646034cd0411cb346a85

Failure log https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=255475447&repo=autoland&lineNumber=8630

Backout: https://hg.mozilla.org/integration/autoland/rev/07df7588275399c0008f811069f23f2c97c9b084

Status: RESOLVED → REOPENED
Flags: needinfo?(gsquelart)
Resolution: FIXED → ---
Target Milestone: mozilla70 → ---

This is either interfering with our detours setup or triggering a load of user32 too early.

https://searchfox.org/mozilla-central/source/mozglue/build/WindowsDllBlocklist.cpp#810

(In reply to Jim Mathies [:jimm] from comment #6)

This is either interfering with our detours setup or triggering a load of user32 too early.

https://searchfox.org/mozilla-central/source/mozglue/build/WindowsDllBlocklist.cpp#810

I tried the test locally, and it's failing even without my patch!
I see "DLL blocklist was unable to intercept AppInit DLLs." in the logs, which points to the user32 issue.

Can you help me with this, please? (Or know someone else who could?)

My totally-uneducated guess is that it could be due to the Windows-specific code run during BaseProfiler init:
https://searchfox.org/mozilla-central/rev/f372470e10c8cb0691681603a1d6324dee5b3b8a/mozglue/baseprofiler/core/platform-win32.cpp#326-331
(Or somewhere else in that file?)

If you want&can try on your side, you can enable BaseProfiler with the attached patch, or by just un-commenting this #define:
https://searchfox.org/mozilla-central/rev/e78f06cdc7e8a54adbb1422b5505162b5258d04d/mozglue/baseprofiler/public/BaseProfiler.h#33

Flags: needinfo?(jmathies)

Hey Aaron, any thoughts here? I tried this in a local build but couldn't reproduce.

Flags: needinfo?(jmathies) → needinfo?(aklotz)

Gerald, I suggest that you start by running Firefox under a debugger and breaking when user32.dll loads. Once you know the call stack that is triggering the load, we can work from there to figure out how to mitigate it.

I'm not sure about the Visual Studio Debugger, but the way to trigger this in WinDbg is to enter the command sxe ld:user32.dll before proceeding with startup.

Flags: needinfo?(aklotz)

Sorry, I'm getting lost in the woods.

A local build just fails the browser_checkdllblockliststate.js test even without my BaseProfiler-enabling patch.
I don't know how to run the test on a packaged build, so I don't know if the problem is happening there; I tried to debug with WinDBG, but I didn't see an issue or I just don't understand what's happening.

I also tried with the one Nightly that included my patch: https://archive.mozilla.org/pub/firefox/nightly/2019/07/2019-07-09-15-37-42-mozilla-central/ , same non-results.
E.g., here's the stack on the first break after sxe ld:user32.dll:

[0x0] ntdll!NtMapViewOfSection + 0x14
[0x1] ntdll!LdrpMinimalMapModule + 0x101
[0x2] ntdll!LdrpMapDllWithSectionHandle + 0x1b
[0x3] ntdll!LdrpLoadKnownDll + 0xe6
[0x4] ntdll!LdrpFindOrPrepareLoadingModule + 0xbd
[0x5] ntdll!LdrpLoadDllInternal + 0x11c
[0x6] ntdll!LdrpLoadDll + 0xa8
[0x7] ntdll!LdrLoadDll + 0xe4
[0x8] mozglue!DllBlocklist_Initialize + 0x10f5
[0x9] KERNELBASE!LoadLibraryExW + 0x161
[0xa] mozglue!DllBlocklist_Initialize + 0x244
[0xb] firefox!Ordinal0 + 0x1402
[0xc] firefox!Ordinal0 + 0x122d
[0xd] firefox!TargetNtUnmapViewOfSection + 0x9338
[0xe] KERNEL32!BaseThreadInitThunk + 0x14
[0xf] ntdll!RtlUserThreadStart + 0x21
which seems "normal"(?)

So I'd really need help from someone who could reproduce the Try failure with my patch, and then could debug the cause. I'd be willing to get training, but I think that would be a waste of resources compared to a Windows expert just doing their thing; I'm remote in Australia, can video-chat or even travel if that was necessary.

Aaron, could you please help, or recommend someone?

Flags: needinfo?(gsquelart) → needinfo?(aklotz)

(In reply to Gerald Squelart [:gerald] from comment #10)

Sorry, I'm getting lost in the woods.

A local build just fails the browser_checkdllblockliststate.js test even without my BaseProfiler-enabling patch.
I don't know how to run the test on a packaged build, so I don't know if the problem is happening there; I tried to debug with WinDBG, but I didn't see an issue or I just don't understand what's happening.

I also tried with the one Nightly that included my patch: https://archive.mozilla.org/pub/firefox/nightly/2019/07/2019-07-09-15-37-42-mozilla-central/ , same non-results.
E.g., here's the stack on the first break after sxe ld:user32.dll:

[0x0] ntdll!NtMapViewOfSection + 0x14
[0x1] ntdll!LdrpMinimalMapModule + 0x101
[0x2] ntdll!LdrpMapDllWithSectionHandle + 0x1b
[0x3] ntdll!LdrpLoadKnownDll + 0xe6
[0x4] ntdll!LdrpFindOrPrepareLoadingModule + 0xbd
[0x5] ntdll!LdrpLoadDllInternal + 0x11c
[0x6] ntdll!LdrpLoadDll + 0xa8
[0x7] ntdll!LdrLoadDll + 0xe4
[0x8] mozglue!DllBlocklist_Initialize + 0x10f5
[0x9] KERNELBASE!LoadLibraryExW + 0x161
[0xa] mozglue!DllBlocklist_Initialize + 0x244
[0xb] firefox!Ordinal0 + 0x1402
[0xc] firefox!Ordinal0 + 0x122d
[0xd] firefox!TargetNtUnmapViewOfSection + 0x9338
[0xe] KERNEL32!BaseThreadInitThunk + 0x14
[0xf] ntdll!RtlUserThreadStart + 0x21
which seems "normal"(?)

So I'd really need help from someone who could reproduce the Try failure with my patch, and then could debug the cause. I'd be willing to get training, but I think that would be a waste of resources compared to a Windows expert just doing their thing; I'm remote in Australia, can video-chat or even travel if that was necessary.

Aaron, could you please help, or recommend someone?

Hey Gerald,

Aaron's pretty busy right now with other work so he really doesn't have the cycles for this. I will try to find some help from someone on the team (no ideas but will poke around).

Flags: needinfo?(aklotz)

Hey Adam, maybe you can help Gerald find a solution here? Something weird with early dll loads after we introduced some profiling code.

Flags: needinfo?(agashlin)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:gerald, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(gsquelart)

Sorry for letting this sit for so long, but I'm not going to be able to give it the time I need to get up to speed. Jim, could you redirect?

I'm also seeing browser_checkdllblockliststate.js fail in unmodified m-c (dba2c8019074) on a local debug build, but not on release. I wasn't able to work out why.

I was confused for a while trying to check this on a release build. I was hitting the assert when I changed it to MOZ_RELEASE_ASSERT and backed out bug 1556993, even with no profiler stuff enabled. I think that this is just because there is no console, unlike in debug. The docs say _get_osf_handle returns -2 "for example, in a Windows application without a console window" . Running with -console doesn't change that, I'm not sure why (maybe it doesn't get attached soon enough).

But I think this is another good argument for the fix in bug 1556993: when there's nothing on stdout/err that issue would routinely be hit, and I think that's the case on release, so it makes sense to handle the error.

Flags: needinfo?(agashlin) → needinfo?(jmathies)

Hey Toshihito, when you have some free cycles could you take a look at this issue and see if you can figure out what's going on? We have a check for a dll blocklist related to user32 getting loaded too early that triggers with some of these profiler changes. The profiling team is in need of some Windows expertise here.

Flags: needinfo?(tkikuchi)
Flags: needinfo?(jmathies)
Flags: needinfo?(gsquelart)

(In reply to Gerald Squelart [:gerald] from comment #10)

So I'd really need help from someone who could reproduce the Try failure with my patch, and then could debug the cause. I'd be willing to get training, but I think that would be a waste of resources compared to a Windows expert just doing their thing; I'm remote in Australia, can video-chat or even travel if that was necessary.

I submitted an empty Try with your patch on top of the latest codebase. Could you tell me which test I should select?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=99a832e16efd33ebdbbffe14ee695e4b684ee07e

I ran mach test browser_checkdllblockliststate locally, but it just passed. Please let me know if I'm missing something.

 0:05.63 TEST_START: toolkit/xre/test/browser_checkdllblockliststate.js
 0:05.63 INFO Entering test bound test
 0:05.96 PASS Windows dll blocklist status should be true, indicating it is running properly. A failure in this test is considered a release blocker. -
 0:05.97 INFO Leaving test bound test
Flags: needinfo?(tkikuchi) → needinfo?(gsquelart)

I found browser_checkdllblockliststate in mochitest-browser-chrome. I hit the same failure, but interestingly this failure happens only on Win7. No failure from Win10. Upon quick debugging, user32.dll was loaded even before firefox!wmain on Win7, while the exact same binary did not load user32.dll on Win10. I'll spend more time on this..

Flags: needinfo?(gsquelart)

I think I got the root cause. When BaseProfiler is enabled on Windows, mozglue.dll static-links winmm.dll because SamplerThread uses timeBeginPeriod/timeEndPeriod. On Win7, winmm.dll static-links user32.dll while win10's winmm.dll does not. This is why user32.dll is loaded at startup on Win7, causing the test failure.

We can add winmm.dll to the delayload list of mozglue, but it does not mitigate the issue when MOZ_BASE_PROFILER_STARTUP=1 is set because the base profiler is initialized and started earlier than the DllBlockList in such a case.

Now that we have the launcher process, the user32 check is really only relevant for the fallback case where the launcher process failed and we're starting the "legacy" blocklist.

We should also add a check to skip that check when sInitFlags & eDllBlocklistInitFlagWasBootstrapped.

Honestly, maybe the most sensible thing to do is just to also skip that check when MOZ_BASE_PROFILER_STARTUP=1 anyway.

Now we know why user32 is being brought in, I think it's safe to say that startup profiling is outside the typical use cases for blocking dlls anyway.

That of course is assuming that we don't plan on setting this env var by default on Nightly or something. Gerald, is MOZ_BASE_PROFILER_STARTUP going to continue to be set only by individual users, or are we planning on setting it as the default somewhere?

Flags: needinfo?(gsquelart)

First, thank you Toshihito for looking into this. Sorry I didn't respond quickly enough to your comment 16, but it's great to see you've found the issue!

(In reply to Aaron Klotz [:aklotz] from comment #20)

Gerald, is MOZ_BASE_PROFILER_STARTUP going to continue to be set only by individual users, or are we planning on setting it as the default somewhere?

That's right, MOZ_BASE_PROFILER_STARTUP should only be set by users (and only a small subset of profiler users, mostly Mozilla devs interested in start-up perf).
But like MOZ_PROFILER_STARTUP (not "BASE"), the parent process will use it to start the profiler in new child processes, when the profiler has been enabled (through MOZ_{,BASE_}PROFILER_STARTUP or through the add-on). But that's still due to user action.

(In reply to Toshihito Kikuchi [:toshi] from comment #18)

When BaseProfiler is enabled on Windows, mozglue.dll static-links winmm.dll because SamplerThread uses timeBeginPeriod/timeEndPeriod. On Win7, winmm.dll static-links user32.dll while win10's winmm.dll does not. This is why user32.dll is loaded at startup on Win7, causing the test failure.

I'm a little bit concerned that the profiler "moves" that load, so its real impact on our code won't be visible. But hopefully it's only a tiny influence compared to all the other things that happen at process start-up. And it's only on win7, so a more accurate profile can be done on win10 anyway.

So, going forward:
As per comment 19 and comment 20, we shouldn't do the check on win < 10, and when MOZ_BASE_PROFILER_STARTUP is set, right?
I'll see if I can prepare a patch myself...

Thank you again everybody for helping me.

Flags: needinfo?(gsquelart)

In summary, I think we need three things:

  • Add winmm.dll to mozglue's delayload list;
  • Skip the user32 check when sInitFlags & eDllBlocklistInitFlagWasBootstrapped;
  • Skip the user32 check when win < 10 and MOZ_BASE_PROFILER_STARTUP is set.

As a note, I submitted a Try with the change to add winmm.dll to the delayload list, and the test passed.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=76d75d8cf3c7eb738afb62ab6df053f639c5aaf1
(browser_checkdllblockliststate.js was run as a part of bc4 in Windows 7 opt)

Let me prepare a patch tomorrow. The most part of a fix would be in the sandbox area :).

Bug 1556993 fixed the crash in PoisonIOInterposer (due to missing stdout&stderr
fd's), bug 1559379 fixed the mozglue memory issue that the unit test
experienced, and this patch also includes a fix for the test failure in
browser_checkdllblockliststate.js.

So now Base Profiler can be enabled by default on Windows, still using
MOZ_BASE_PROFILER_... env-vars for now (upcoming bug will merge these with
non-BASE env-vars soon).

Pushed by aklotz@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/299dc828295a
Enable Base Profiler in Windows. r=aklotz,gerald
Attachment #9076708 - Attachment is obsolete: true
Assignee: gsquelart → tkikuchi
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

Blocks: 1634784
You need to log in before you can comment on or make changes to this bug.