Pine pgo builds are still crashing at startup when workshop is enabled
Categories
(Core :: DOM: Content Processes, defect, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox101 | --- | affected |
People
(Reporter: calixte, Assigned: cmartin)
References
Details
Attachments
(2 files, 1 obsolete file)
In bug 1764045, win32k stuff is now initialized in XRE_mainRun but likely not early enough.
since some process creation are running before the win32k initialization.
DoStartup triggers some notifications which lead to have some process creations:
https://searchfox.org/mozilla-central/rev/d34f9713ae128a3138c2b70d8041a535f1049d19/toolkit/xre/nsAppRunner.cpp#5487
I guess the fix is to move the win32 stuff just before the DoStartup.
For information, we've few crashes with a similar stack trace:
https://crash-stats.mozilla.org/search/?signature=%3DnsObserverService%3A%3AAddObserver&proto_signature=EnsureWin32kInitialized&release_channel=release&product=Firefox&date=%3E%3D2022-04-06T11%3A39%3A00.000Z&date=%3C2022-04-13T11%3A39%3A00.000Z&_facets=signature&_facets=release_channel&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-release_channel
| Reporter | ||
Updated•3 years ago
|
| Reporter | ||
Comment 1•3 years ago
|
||
It'll avoid any potential process creation which could lead to try to initialize
win32k stuff on a thread other than the main one.
| Reporter | ||
Comment 2•3 years ago
|
||
A try:
https://treeherder.mozilla.org/jobs?repo=try&revision=56e17a41ec6f990c308c6fc9cd9987f81ec267b3
The same patch in pine:
https://treeherder.mozilla.org/jobs?repo=try&revision=c733493c2c02ee0e757c6ae67d9631adc028e0d9&selectedTaskRun=fgfFjqlgQ7-oIcLk1BgrYA.0
the pgo builds don't crash anymore at startup.
Comment 4•3 years ago
|
||
Backed out for causing marionette failures on test_win32k_enrollment.py.
[task 2022-04-13T18:40:03.744Z] 18:40:03 INFO - TEST-START | toolkit/xre/test/marionette/test_win32k_enrollment.py TestWin32kAutostart.test_1
[task 2022-04-13T18:40:03.747Z] 18:40:03 INFO - 1649875203748 Marionette DEBUG Accepted connection 3 from 127.0.0.1:53816
[task 2022-04-13T18:40:03.771Z] 18:40:03 INFO - 1649875203772 Marionette DEBUG 3 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2022-04-13T18:40:03.774Z] 18:40:03 INFO - 1649875203774 RemoteAgent TRACE [24] Document already finished loading: about:blank
[task 2022-04-13T18:40:03.774Z] 18:40:03 INFO - 1649875203774 Marionette DEBUG 3 <- [1,1,null,{"sessionId":"74770416-5451-4b26-abc4-1f41ae751345","capabilities":{"browserName":"firefox","browserVersion":"101.0 ... l.mozrunner","moz:shutdownTimeout":60000,"moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true,"proxy":{}}}]
[task 2022-04-13T18:40:03.775Z] 18:40:03 INFO - 1649875203776 Marionette DEBUG 3 -> [0,2,"WebDriver:SetTimeouts",{"script":30000}]
[task 2022-04-13T18:40:03.775Z] 18:40:03 INFO - 1649875203776 Marionette DEBUG 3 <- [1,2,null,{"value":null}]
[task 2022-04-13T18:40:03.776Z] 18:40:03 INFO - 1649875203777 Marionette DEBUG 3 -> [0,3,"WebDriver:SetTimeouts",{"pageLoad":300000}]
[task 2022-04-13T18:40:03.776Z] 18:40:03 INFO - 1649875203777 Marionette DEBUG 3 <- [1,3,null,{"value":null}]
[task 2022-04-13T18:40:03.777Z] 18:40:03 INFO - 1649875203778 Marionette DEBUG 3 -> [0,4,"WebDriver:SetTimeouts",{"implicit":0}]
[task 2022-04-13T18:40:03.778Z] 18:40:03 INFO - 1649875203778 Marionette DEBUG 3 <- [1,4,null,{"value":null}]
[task 2022-04-13T18:40:03.779Z] 18:40:03 INFO - 1649875203779 Marionette DEBUG 3 -> [0,5,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.779Z] 18:40:03 INFO - 1649875203779 Marionette DEBUG 3 <- [1,5,null,{"value":null}]
[task 2022-04-13T18:40:03.780Z] 18:40:03 INFO - 1649875203780 Marionette DEBUG 3 -> [0,6,"Marionette:GetContext",{}]
[task 2022-04-13T18:40:03.780Z] 18:40:03 INFO - 1649875203780 Marionette DEBUG 3 <- [1,6,null,{"value":"chrome"}]
[task 2022-04-13T18:40:03.780Z] 18:40:03 INFO - 1649875203781 Marionette DEBUG 3 -> [0,7,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.781Z] 18:40:03 INFO - 1649875203781 Marionette DEBUG 3 <- [1,7,null,{"value":null}]
[task 2022-04-13T18:40:03.783Z] 18:40:03 INFO - 1649875203783 Marionette DEBUG 3 -> [0,8,"WebDriver:ExecuteScript",{"script":"// We're running in a function, in a sandbox, that inherits from an\n // X ... ame":"Z:\\task_164986957190594\\build\\tests\\marionette\\tests\\toolkit\\xre\\test\\marionette\\test_win32k_enrollment.py"}]
[task 2022-04-13T18:40:03.784Z] 18:40:03 INFO - 1649875203785 Marionette TRACE [7] MarionetteCommands actor created for window id 2
[task 2022-04-13T18:40:03.786Z] 18:40:03 INFO - 1649875203786 Marionette DEBUG 3 <- [1,8,null,{"value":null}]
[task 2022-04-13T18:40:03.786Z] 18:40:03 INFO - 1649875203787 Marionette DEBUG 3 -> [0,9,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.787Z] 18:40:03 INFO - 1649875203787 Marionette DEBUG 3 <- [1,9,null,{"value":null}]
[task 2022-04-13T18:40:03.787Z] 18:40:03 INFO - 1649875203788 Marionette DEBUG 3 -> [0,10,"Marionette:GetContext",{}]
[task 2022-04-13T18:40:03.787Z] 18:40:03 INFO - 1649875203788 Marionette DEBUG 3 <- [1,10,null,{"value":"chrome"}]
[task 2022-04-13T18:40:03.788Z] 18:40:03 INFO - 1649875203788 Marionette DEBUG 3 -> [0,11,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.788Z] 18:40:03 INFO - 1649875203789 Marionette DEBUG 3 <- [1,11,null,{"value":null}]
[task 2022-04-13T18:40:03.789Z] 18:40:03 INFO - 1649875203790 Marionette DEBUG 3 -> [0,12,"WebDriver:ExecuteScript",{"script":"return Services.prefs.getBoolPref(\"security.sandbox.content.win32k-disable\");"," ... ame":"Z:\\task_164986957190594\\build\\tests\\marionette\\tests\\toolkit\\xre\\test\\marionette\\test_win32k_enrollment.py"}]
[task 2022-04-13T18:40:03.790Z] 18:40:03 INFO - 1649875203791 Marionette DEBUG 3 <- [1,12,null,{"value":true}]
[task 2022-04-13T18:40:03.791Z] 18:40:03 INFO - 1649875203791 Marionette DEBUG 3 -> [0,13,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.791Z] 18:40:03 INFO - 1649875203791 Marionette DEBUG 3 <- [1,13,null,{"value":null}]
[task 2022-04-13T18:40:03.791Z] 18:40:03 INFO - 1649875203792 Marionette DEBUG 3 -> [0,14,"Marionette:GetContext",{}]
[task 2022-04-13T18:40:03.792Z] 18:40:03 INFO - 1649875203792 Marionette DEBUG 3 <- [1,14,null,{"value":"chrome"}]
[task 2022-04-13T18:40:03.792Z] 18:40:03 INFO - 1649875203793 Marionette DEBUG 3 -> [0,15,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.792Z] 18:40:03 INFO - 1649875203793 Marionette DEBUG 3 <- [1,15,null,{"value":null}]
[task 2022-04-13T18:40:03.793Z] 18:40:03 INFO - 1649875203794 Marionette DEBUG 3 -> [0,16,"WebDriver:ExecuteScript",{"script":"let win = Services.wm.getMostRecentWindow(\"navigator:browser\");\n let s ... ame":"Z:\\task_164986957190594\\build\\tests\\marionette\\tests\\toolkit\\xre\\test\\marionette\\test_win32k_enrollment.py"}]
[task 2022-04-13T18:40:03.794Z] 18:40:03 INFO - 1649875203795 Marionette DEBUG 3 <- [1,16,null,{"value":{"win32kSessionStatus":2,"win32kStatus":15,"win32kExperimentStatus":0,"win32kPref":true,"win32kStartupEnrollmentStatusPref":0}}]
[task 2022-04-13T18:40:03.795Z] 18:40:03 INFO - 1649875203796 Marionette DEBUG 3 -> [0,17,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.795Z] 18:40:03 INFO - 1649875203796 Marionette DEBUG 3 <- [1,17,null,{"value":null}]
[task 2022-04-13T18:40:03.796Z] 18:40:03 INFO - 1649875203797 Marionette DEBUG 3 -> [0,18,"Marionette:GetContext",{}]
[task 2022-04-13T18:40:03.796Z] 18:40:03 INFO - 1649875203797 Marionette DEBUG 3 <- [1,18,null,{"value":"chrome"}]
[task 2022-04-13T18:40:03.797Z] 18:40:03 INFO - 1649875203797 Marionette DEBUG 3 -> [0,19,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.797Z] 18:40:03 INFO - 1649875203797 Marionette DEBUG 3 <- [1,19,null,{"value":null}]
[task 2022-04-13T18:40:03.797Z] 18:40:03 INFO - 1649875203798 Marionette DEBUG 3 -> [0,20,"WebDriver:TakeScreenshot",{"id":null,"full":true,"hash":false,"scroll":true}]
[task 2022-04-13T18:40:03.826Z] 18:40:03 INFO - 1649875203833 Marionette DEBUG 3 <- [1,20,null,{"value":"iVBORw0KGgoAAAANSUhEUgAABQAAAAP/CAYAAACf8GEBAAAgAElEQVR4XuzdC5RkdX0n8H/DzMDA8FDQoAyIyghkAREQiKjxuYeYs4ob ... ECBAgQIECAAAECBAgQIPALOAB/O0kCBAgQIECAAAECBAgQIECAAAECeQEHYH4iBQkQIECAAAECBAgQIECAAAECBAj8AgOOed4ADWcJvgAAAABJRU5ErkJggg=="}]
[task 2022-04-13T18:40:03.835Z] 18:40:03 INFO - 1649875203835 Marionette DEBUG 3 -> [0,21,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.835Z] 18:40:03 INFO - 1649875203835 Marionette DEBUG 3 <- [1,21,null,{"value":null}]
[task 2022-04-13T18:40:03.835Z] 18:40:03 INFO - 1649875203836 Marionette DEBUG 3 -> [0,22,"Marionette:GetContext",{}]
[task 2022-04-13T18:40:03.836Z] 18:40:03 INFO - 1649875203836 Marionette DEBUG 3 <- [1,22,null,{"value":"chrome"}]
[task 2022-04-13T18:40:03.836Z] 18:40:03 INFO - 1649875203836 Marionette DEBUG 3 -> [0,23,"Marionette:SetContext",{"value":"content"}]
[task 2022-04-13T18:40:03.836Z] 18:40:03 INFO - 1649875203837 Marionette DEBUG 3 <- [1,23,null,{"value":null}]
[task 2022-04-13T18:40:03.837Z] 18:40:03 INFO - 1649875203837 Marionette DEBUG 3 -> [0,24,"WebDriver:GetPageSource",{}]
[task 2022-04-13T18:40:03.840Z] 18:40:03 INFO - 1649875203840 Marionette TRACE [24] MarionetteCommands actor created for window id 4294967297
[task 2022-04-13T18:40:03.841Z] 18:40:03 INFO - 1649875203841 Marionette DEBUG 3 <- [1,24,null,{"value":"<html><head></head><body></body></html>"}]
[task 2022-04-13T18:40:03.841Z] 18:40:03 INFO - 1649875203842 Marionette DEBUG 3 -> [0,25,"Marionette:SetContext",{"value":"chrome"}]
[task 2022-04-13T18:40:03.841Z] 18:40:03 INFO - 1649875203842 Marionette DEBUG 3 <- [1,25,null,{"value":null}]
[task 2022-04-13T18:40:03.846Z] 18:40:03 ERROR - TEST-UNEXPECTED-ERROR | toolkit/xre/test/marionette/test_win32k_enrollment.py TestWin32kAutostart.test_1 | AssertionError: 2 != 15 : win32kSessionStatus should have the value `15`, but has `2`
[task 2022-04-13T18:40:03.846Z] 18:40:03 INFO - Traceback (most recent call last):
[task 2022-04-13T18:40:03.846Z] 18:40:03 INFO - File "Z:\task_164986957190594\build\venv\lib\site-packages\marionette_harness\marionette_test\testcases.py", line 183, in run
[task 2022-04-13T18:40:03.847Z] 18:40:03 INFO - self.setUp()
[task 2022-04-13T18:40:03.847Z] 18:40:03 INFO - File "Z:\task_164986957190594\build\tests\marionette\tests\toolkit\xre\test\marionette\test_win32k_enrollment.py", line 194, in setUp
[task 2022-04-13T18:40:03.847Z] 18:40:03 INFO - self.check_win32k_status(
[task 2022-04-13T18:40:03.848Z] 18:40:03 INFO - File "Z:\task_164986957190594\build\tests\marionette\tests\toolkit\xre\test\marionette\test_win32k_enrollment.py", line 86, in check_win32k_status
[task 2022-04-13T18:40:03.848Z] 18:40:03 INFO - self.assertEqual(
[task 2022-04-13T18:40:03.848Z] 18:40:03 INFO - TEST-INFO took 97ms
[task 2022-04-13T18:40:03.848Z] 18:40:03 INFO - 1649875203847 Marionette DEBUG 3 -> [0,26,"WebDriver:DeleteSession",{}]
[task 2022-04-13T18:40:03.849Z] 18:40:03 INFO - 1649875203848 Marionette DEBUG 3 <- [1,26,null,{"value":null}]
[task 2022-04-13T18:40:03.849Z] 18:40:03 INFO - 1649875203849 Marionette DEBUG Closed connection 3
[task 2022-04-13T18:40:03.849Z] 18:40:03 INFO - TEST-START | toolkit/xre/test/marionette/test_win32k_enrollment.py TestWin32kAutostart.test_10
Comment 5•3 years ago
|
||
Can either of y'all offer advice on this failure?
| Reporter | ||
Comment 6•3 years ago
|
||
The problem is that we're returning a MissingWebRender here:
https://searchfox.org/mozilla-central/rev/d34f9713ae128a3138c2b70d8041a535f1049d19/toolkit/xre/nsAppRunner.cpp#730
when the test is running which could make sense in the Marionette context.
Then this stuff is called:
https://searchfox.org/mozilla-central/rev/d34f9713ae128a3138c2b70d8041a535f1049d19/toolkit/xre/nsXREDirProvider.cpp#936
and at some point:
https://searchfox.org/mozilla-central/rev/d34f9713ae128a3138c2b70d8041a535f1049d19/toolkit/components/telemetry/app/TelemetryEnvironment.jsm#1623
will trigger a call to EnsureWin32KInitialized.
So because of my patch the win32k stuff has the values when EnsureWin32KInitialized has been called the first time.
If I remove my patch and I add a false &&:
https://searchfox.org/mozilla-central/rev/d34f9713ae128a3138c2b70d8041a535f1049d19/toolkit/components/telemetry/TelemetryStartup.jsm#33
the test is broken too.
Comment 7•3 years ago
|
||
(In reply to Mike Kaply [:mkaply] from comment #5)
Can either of y'all offer advice on this failure?
I'll leave the deciphering of the test failure to tjr (unless he can't look at it for some reason).
Comment 8•3 years ago
|
||
Some testing has shown that the earliest we can call EnsureWin32kInitialized (right now) is right after the call to NS_CreateServicesFromCategory("profile-after-change" here.
I'm trying to figure out what it is that gets created from that that we need....
| Reporter | ||
Comment 9•3 years ago
|
||
Yes the call to NS_CreateServicesFromCategory... will induce initialization of telemetry stuff which itself will trigger the EnsureWin32k....
My understanding is once we've the correct value for gfxVars::UseWebRender() we can call ÈnsureWin32k....
Comment 10•3 years ago
|
||
Right; there's got to be something in between right before NS_CreateServicesFromCategory gets called (where EnsureWin32k... will fail) and the Telemetry Stuff that gets created from NS_CreateServicesFromCategory (where ÈnsureWin32k... works). That thing is presumably doing some WebRender/gfx related stuff....
Comment 11•3 years ago
|
||
Do you know what it is that's causing process creation for 'workshop'? (Not sure what that is...)
| Reporter | ||
Comment 12•3 years ago
|
||
I thought that maybe we could listen on gfxVars changes:
https://searchfox.org/mozilla-central/rev/d34f9713ae128a3138c2b70d8041a535f1049d19/gfx/config/gfxVars.cpp#144
and then once we've something about webrender we can call EnsureWin32k...., wdyt ?
| Reporter | ||
Comment 13•3 years ago
|
||
Workshop is a backend for calendar stuff in Pro-Client.
It lives in a SharedWorker and we create some hidden windows to have some notifications:
https://searchfox.org/mozilla-pine/rev/5728b2243863d5072e1e1f5355ed39aecd03e38d/browser/components/BrowserGlue.jsm#1423
I don't know if it's because of the shared worker or because of the hidden window... maybe it can be something else.
But I guess that something is launching a process before we run the Telemetry thing and this process will trigger an EnsureWin32k... but not on the main thread, it's why I moved it before any jsm is executed.
| Reporter | ||
Comment 14•3 years ago
|
||
And unfortunately, I'm unable to reproduce the issue locally: we've the startup crash only with pgo builds... so it's really a matter of unfortunate scheduling somewhere.
Comment 15•3 years ago
|
||
(In reply to Calixte Denizet (:calixte) from comment #12)
I thought that maybe we could listen on gfxVars changes:
https://searchfox.org/mozilla-central/rev/d34f9713ae128a3138c2b70d8041a535f1049d19/gfx/config/gfxVars.cpp#144
and then once we've something about webrender we can callEnsureWin32k...., wdyt ?
That seems worth trying, I'm not sure if it will work (the stack traces I'm seeing kind of make me think it might not, but it might be an artifact of optimization also.) I probably won't be able to try to implement this until late Monday or Tuesday though.
Updated•3 years ago
|
Comment 17•3 years ago
|
||
I understand there to be 2 general issues going on here[1]:
mozilla::ipc::BaseProcessLauncher::PerformAsyncLaunchhas a dependency on EnsureWin32kInitialized having been called previously on the main thread but doesn't quite take steps to make this happen.
- There's logic that seems like it's trying to do sandbox initialization steps where GeckoChildProcessHost::AsyncLaunch calls GeckoChildProcessHost::PrepareLaunch does call GetEffectiveContentSandboxLevel but that doesn't call GetContentWin32kLockdownState or EnsureWin32kInitialized. It seems like if this was done here, that would solve things, except...
- It sounds like EnsureWin32kInitialized has a data-dependency on webrender initializing, but there's nothing in code that currently enforces this because it probably wasn't previously an issue.
- This might need to necessitate having PerformAsyncLaunch or some of its callers wait on a MozPromise for this initialization to resolve.
I understand that our initial attempt to make the problem go away was adding an explicit initialization in apprunner startup in a location that should precede the workshop consumer's attempt to bootstrap itself on the "final-ui-startup" observer notification which should be strictly after where the code was added. However, it seems like maybe our assumption about that observer notification being the trigger is incorrect and there's some other path triggering a process launch earlier. The second attempt at a fix here that bounced failed because it was trying to move the initialization prior to profile-after-change which meant the webrender data was not available.
:tjr, does the above match your understanding of the problem? Are there other data dependencies for the win32k initialization (or can you needinfo someone else appropriate?)
:nika, does it seem reasonable to potentially be delaying process launching via some MozPromise-y thing for all the dependencies? Should the RemoteWorkerManager instead be trying to wait until some explicit startup phase before it starts asking ContentParent to launch things? Please also feel free to propagate the needinfo.
1: Beyond the general issue that Gecko startup is a complicated emergent thing that could benefit from a significant effort at documentation[2] and improved infrastructure tooling, and so it's not surprising that we can run into complicated emergent behaviors, especially as we do more and more with dynamic configuration stuff, all of which depends on the profile directory being available.
2: Like I think the docs about the various observer notifications including "profile-after-change" were removed from MDN and never moved to firefox-source-docs.mozilla.org and so these things are less documented than ever.
Comment 18•3 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #17)
I understand there to be 2 general issues going on here[1]:
mozilla::ipc::BaseProcessLauncher::PerformAsyncLaunchhas a dependency on EnsureWin32kInitialized having been called previously on the main thread but doesn't quite take steps to make this happen.
- There's logic that seems like it's trying to do sandbox initialization steps where GeckoChildProcessHost::AsyncLaunch calls GeckoChildProcessHost::PrepareLaunch does call GetEffectiveContentSandboxLevel but that doesn't call GetContentWin32kLockdownState or EnsureWin32kInitialized. It seems like if this was done here, that would solve things, except...
The first paragraphs is right, but I don't think the second is. Win32k is independent of sandbox level because of how fragile/unusual it is. We need to know its state in SetSecurityLevelForProcess which is down-call from DoSetup() (the first line in PerformAsyncLaunch).
- It sounds like EnsureWin32kInitialized has a data-dependency on webrender initializing, but there's nothing in code that currently enforces this because it probably wasn't previously an issue.
Sounds right.
- This might need to necessitate having PerformAsyncLaunch or some of its callers wait on a MozPromise for this initialization to resolve.
I understand that our initial attempt to make the problem go away was adding an explicit initialization in apprunner startup in a location that should precede the workshop consumer's attempt to bootstrap itself on the "final-ui-startup" observer notification which should be strictly after where the code was added. However, it seems like maybe our assumption about that observer notification being the trigger is incorrect and there's some other path triggering a process launch earlier. The second attempt at a fix here that bounced failed because it was trying to move the initialization prior to profile-after-change which meant the webrender data was not available.
:tjr, does the above match your understanding of the problem? Are there other data dependencies for the win32k initialization (or can you needinfo someone else appropriate?)
Yes; it matches. Only AllowWebGLOop[0] which is initialized in the same call to gfxPlatform::Init.
:nika, does it seem reasonable to potentially be delaying process launching via some MozPromise-y thing for all the dependencies? Should the RemoteWorkerManager instead be trying to wait until some explicit startup phase before it starts asking ContentParent to launch things? Please also feel free to propagate the needinfo.
1: Beyond the general issue that Gecko startup is a complicated emergent thing that could benefit from a significant effort at documentation[2] and improved infrastructure tooling, and so it's not surprising that we can run into complicated emergent behaviors, especially as we do more and more with dynamic configuration stuff, all of which depends on the profile directory being available.
2: Like I think the docs about the various observer notifications including "profile-after-change" were removed from MDN and never moved to firefox-source-docs.mozilla.org and so these things are less documented than ever.
[0] Can I just throw out I love the variable 'allow web gloop'?
Comment 19•3 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #17)
I understand there to be 2 general issues going on here[1]:
mozilla::ipc::BaseProcessLauncher::PerformAsyncLaunchhas a dependency on EnsureWin32kInitialized having been called previously on the main thread but doesn't quite take steps to make this happen.
- There's logic that seems like it's trying to do sandbox initialization steps where GeckoChildProcessHost::AsyncLaunch calls GeckoChildProcessHost::PrepareLaunch does call GetEffectiveContentSandboxLevel but that doesn't call GetContentWin32kLockdownState or EnsureWin32kInitialized. It seems like if this was done here, that would solve things, except...
It's probably worth checking in with someone who's more familiar with win32k lockdown to know if this is a practical approach. You'd probably need to call EnsureWin32kInitialized in GeckoChildProcessHost::PrepareLaunch as you expected.
- It sounds like EnsureWin32kInitialized has a data-dependency on webrender initializing, but there's nothing in code that currently enforces this because it probably wasn't previously an issue.
From a quick look, it looks like calling gfxPlatform::GetPlatform() might work better than gfx::gfxVars::Initialize() here for actually making sure that the variables are initialized: https://searchfox.org/mozilla-central/rev/6f6cf2810742d32454251afdb8e632c27999bea7/toolkit/xre/nsAppRunner.cpp#701. I think there's a chance that the fact that gfxVars::Initialize() doesn't actually initialize the variables, just creates the storage for them, was missed as it wasn't an issue (though I'd double-check with :tjr).
- This might need to necessitate having PerformAsyncLaunch or some of its callers wait on a MozPromise for this initialization to resolve.
FWIW in many places the process launcher is actually waited on with the main thread synchronously blocked (e.g. with GeckoChildProcessHost::{SyncLaunch,LaunchAndWaitForProcessHandle,WaitForProcessHande,WaitUntilConnected}, so we can't have it attempt to asynchronously wait on any main-thread driven initialization. All work must occur off-main-thread.
Because of that, we'll need to make sure that webrender is initialized before ever deciding to launch a child process, as otherwise we'll deadlock during startup. I'm guessing that some of the earliest processes to launch are GPU and Socket processes, both of which block the main thread waiting for the launch: (https://searchfox.org/mozilla-central/rev/6f6cf2810742d32454251afdb8e632c27999bea7/gfx/ipc/GPUProcessManager.cpp#302, https://searchfox.org/mozilla-central/rev/6f6cf2810742d32454251afdb8e632c27999bea7/netwerk/ipc/SocketProcessHost.cpp#79).
Content processes are also often launched synchronously when created for a new tab (though they are also launched async when created during navigation): https://searchfox.org/mozilla-central/rev/6f6cf2810742d32454251afdb8e632c27999bea7/dom/ipc/ContentParent.cpp#1112.
I understand that our initial attempt to make the problem go away was adding an explicit initialization in apprunner startup in a location that should precede the workshop consumer's attempt to bootstrap itself on the "final-ui-startup" observer notification which should be strictly after where the code was added. However, it seems like maybe our assumption about that observer notification being the trigger is incorrect and there's some other path triggering a process launch earlier. The second attempt at a fix here that bounced failed because it was trying to move the initialization prior to profile-after-change which meant the webrender data was not available.
Unfortunately browser startup is super messy, and the work around win32k lockdown is even messier than that. IIRC we also ran into another situation recently (though I can't remember the details) where we can try to start processes before we've even selected a profile to show an alert dialog, which was causing :jld some issues lately.
Right now the logic for what needs to happen before we start content processes seems a bit wibbly unfortunately. It looks like at least nsXREDirProvider::DoStartup() should've been run due to it setting up some directories needed for content processes (https://searchfox.org/mozilla-central/rev/6f6cf2810742d32454251afdb8e632c27999bea7/toolkit/xre/nsXREDirProvider.cpp#957-964). "final-ui-startup" definitely seems like it ought to be safe, but I wouldn't be surprised if we're hitting another process launch earlier (e.g. the socket or gpu process).
It might be the case that long term we need to re-arrange some of these initialization steps to make sure they happen before e.g. "app-startup" is notified and the first JS is allowed to run.
:nika, does it seem reasonable to potentially be delaying process launching via some MozPromise-y thing for all the dependencies? Should the RemoteWorkerManager instead be trying to wait until some explicit startup phase before it starts asking ContentParent to launch things? Please also feel free to propagate the needinfo.
Replied above.
Updated•3 years ago
|
Comment 20•3 years ago
|
||
Comment 21•3 years ago
|
||
This diagram required an enhancement to a naive redundant-use suppression that was insufficiently tupled. There are some double edges in the graph that are probably an expected outgrowth of the searchfox index reflecting the union of 4 different platforms and there being a lot of ifdefs in this area of code, but could maybe be a bug.
The diagram now captures that GetLiveWin32kLockdownState() calls EnsureWin32kInitialized() which then calls GetLiveWin32kLockdownState() which then calls EnsureWin32kInitialized() but that early returns because it sets a flag on entry.
Comment 22•3 years ago
|
||
Chris, Bob, please see comment 19. We might need to advise how to ensure the startup sequence stays correct.
Comment 23•3 years ago
|
||
Is it possible to sidestep this issue by deferring the initialization until later?
| Reporter | ||
Comment 24•3 years ago
|
||
On the pine's side, it's possible to initialize the win32k stuff in the main thread before creating the worker in just calling a telemetry js function.
I don't know if it'll work but it's worth trying.
:asuth, did you try that ?
:bgrins, I don't know, :bobowen or :cmartin can probably answer.
Comment 25•3 years ago
|
||
(In reply to Calixte Denizet (:calixte) from comment #24)
On the pine's side, it's possible to initialize the win32k stuff in the main thread before creating the worker in just calling a telemetry js function.
I don't know if it'll work but it's worth trying.
:asuth, did you try that ?:bgrins, I don't know, :bobowen or :cmartin can probably answer.
I’m asking something different. Can we delay creating this stuff in the frontend until later on - at a more normal time when the rest of the dom that consumes it is created?
Comment 26•3 years ago
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #19)
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #17)
I understand there to be 2 general issues going on here[1]:
mozilla::ipc::BaseProcessLauncher::PerformAsyncLaunchhas a dependency on EnsureWin32kInitialized having been called previously on the main thread but doesn't quite take steps to make this happen.
- There's logic that seems like it's trying to do sandbox initialization steps where GeckoChildProcessHost::AsyncLaunch calls GeckoChildProcessHost::PrepareLaunch does call GetEffectiveContentSandboxLevel but that doesn't call GetContentWin32kLockdownState or EnsureWin32kInitialized. It seems like if this was done here, that would solve things, except...
It's probably worth checking in with someone who's more familiar with win32k lockdown to know if this is a practical approach. You'd probably need to call
EnsureWin32kInitializedinGeckoChildProcessHost::PrepareLaunchas you expected.
I'm not really clear on how all the startup dependencies work, but if processes can be launched before EnsureWin32kInitialized is currently called then adding a call intoGeckoChildProcessHost::PrepareLaunch seems to make sense.
| Reporter | ||
Comment 27•3 years ago
•
|
||
I tried this:
diff --git a/toolkit/xre/nsAppRunner.cpp b/toolkit/xre/nsAppRunner.cpp
--- a/toolkit/xre/nsAppRunner.cpp
+++ b/toolkit/xre/nsAppRunner.cpp
@@ -780,16 +780,20 @@ nsIXULRuntime::ContentWin32kLockdownStat
return nsIXULRuntime::ContentWin32kLockdownState::OperatingSystemNotSupported;
#endif
}
namespace mozilla {
void EnsureWin32kInitialized() {
+ if (!NS_IsMainThread()) {
+ return;
+ }
+
if (gWin32kInitialized) {
return;
}
gWin32kInitialized = true;
#ifdef XP_WIN
// Initialize the Win32k experiment, configuring win32k's
and tests are still broken (I forgot to move the EnsureWin32kInitialized before to run the tests, it's why I edited this comment).
Updated•3 years ago
|
| Assignee | ||
Comment 28•3 years ago
•
|
||
Hi all,
I've tried quite a few things to see if I can make some progress here, but so far no luck:
-
Introducing sleeps into a few places to see if I can change the timing on things: Didn't reproduce -- It seems it's not a race condition
-
Getting a stack trace the first time the Main Thread attempts to launch a content process -- Main thread doesn't even attempt to queue up a content process launch before
EnsureWin32kInitialized()is called on my machine -
I attempted to implement Nika's advice above and have
PrepareLaunch()callEnsureWin32kInitialized(), which itself needs to callgfxPlatform::GetPlatform()before attempting to read WebRender vars.Nika's comment above is correct that
gfx::gfxVars::Initialize()just creates the vars with their default initial values, andgfxPlatform::Init()is where the values actually get populated (Example forUseWebRender()here)So here is the patch I used:
diff --git a/ipc/glue/GeckoChildProcessHost.cpp b/ipc/glue/GeckoChildProcessHost.cpp --- a/ipc/glue/GeckoChildProcessHost.cpp +++ b/ipc/glue/GeckoChildProcessHost.cpp @@ -594,6 +594,7 @@ void GeckoChildProcessHost::PrepareLaunc # if defined(MOZ_SANDBOX) // We need to get the pref here as the process is launched off main thread. if (mProcessType == GeckoProcessType_Content) { + GetWin32kLockdownState(); mSandboxLevel = GetEffectiveContentSandboxLevel(); mEnableSandboxLogging = Preferences::GetBool("security.sandbox.logging.enabled"); diff --git a/toolkit/xre/nsAppRunner.cpp b/toolkit/xre/nsAppRunner.cpp --- a/toolkit/xre/nsAppRunner.cpp +++ b/toolkit/xre/nsAppRunner.cpp @@ -698,7 +698,7 @@ nsIXULRuntime::ContentWin32kLockdownStat // HasUserValue The Pref functions can only be called on main thread MOZ_ASSERT(NS_IsMainThread()); mozilla::EnsureWin32kInitialized(); - gfx::gfxVars::Initialize(); + gfxPlatform::GetPlatform(); if (gSafeMode) { return nsIXULRuntime::ContentWin32kLockdownState::DisabledBySafeMode; @@ -791,7 +791,6 @@ void EnsureWin32kInitialized() { gWin32kInitialized = true; #ifdef XP_WIN - // Initialize the Win32k experiment, configuring win32k's // default, before checking other overrides. This allows opting-out of a // Win32k experiment through about:preferences or about:config from aUnfortunately, it looks from this build that this didn't work either. There are tons of GFX-related issues, so I'm guessing that this initializes
gfxPlatformtoo early and it ends up with some weird values for things.
Note that this crash also seems to be happening on the FF 100 release (Bug 1769807), so it seems that it's not unique to Pine. Since I'm fairly certain that FF release is also built with PGO, that part of it seems to line up.
I'm going to try the same patch in Central to see if that works. I don't know if there is something different about Pine compared to Central WRT these tests.
| Assignee | ||
Comment 29•3 years ago
•
|
||
I'm going to try the same patch in Central to see if that works. I don't know if there is something different about Pine compared to Central WRT these tests.
I asked around, and it seems that the testing for Pine is known to be different from the testing for Central, so it appears that the failures in the Try build above may not be very meaningful.
Here is a Try build with the same patch above, but against Central. We'll have to see if it works better.
| Assignee | ||
Comment 30•3 years ago
•
|
||
Well... this is looking much more promising. At least that confirms that doing this doesn't seem to break the build.
It also looks like the Pine build may be working now too with this patch: https://treeherder.mozilla.org/jobs?repo=try&revision=5eb026a671f0cbc11e9ebb42eca37843ede9c0d4
:bgrins -- I think someone mentioned that QA has a machine that is able to repro this issue? If so, are they able to verify this patch? I think it's fair to assume that if it fixes your crash in Pine, it probably also fixes the crash in Bug 1769807
Comment 31•3 years ago
|
||
(In reply to Chris Martin [:cmartin] from comment #30)
:bgrins -- I think someone mentioned that QA has a machine that is able to repro this issue? If so, are they able to verify this patch? I think it's fair to assume that if it fixes your crash in Pine, it probably also fixes the crash in Bug 1769807
Yep - I just confirmed that your patch is fixing the crash on pine with the problematic machine
Comment 32•3 years ago
|
||
(In reply to Brian Grinstead [:bgrins] from comment #31)
Yep - I just confirmed that your patch is fixing the crash on pine with the problematic machine
Do we think the patch can land then?
Comment 33•3 years ago
|
||
Okay, the fix landed in bug 1769807 so maybe we can dupe to that.
Updated•3 years ago
|
Updated•2 years ago
|
Description
•