Closed Bug 1515826 Opened 11 months ago Closed 11 months ago

Long delay after startup with white document not showing tabs or content

Categories

(Core :: Security: Process Sandboxing, defect, P1)

ARM64
Windows 10
defect

Tracking

()

VERIFIED FIXED
mozilla66
Tracking Status
firefox66 --- verified

People

(Reporter: asa, Assigned: bobowen)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

System: Windows on Snapdragon
Build: Today's Nightly build, 2018-12-20 a.m. update
Problem: The window is blank after startup for about 4 seconds before tabs are painted and content starts loading.

This wasn't as bad a week or two ago, or I simply overlooked it (hard to believe).
It feels like it hasn't changed for me (and yes, startup time is way too long).
I wonder if this is related to bug 1513748, though it's hard to say given the very little information currently in that bug.
Actually, now that I'm starting to restart, it does feel longer than it used to... Here's a profile: https://perfht.ml/2QLP1xx
It looks like waiting for the start of the first web content process is what takes forever.
(In reply to Mike Hommey [:glandium] from comment #4)
> It looks like waiting for the start of the first web content process is what
> takes forever.

Maybe jld's async process launch work will really help here?
According to Windows Performance Analyzer, we're not doing any work, just waiting, but for what? The main thread is completely switched out for 4+ seconds, and there is nearly zero activity from any process named firefox.exe in that time. I'm having trouble getting "CPU Usage (Precise)" to tell me what we were waiting on (the Readying Process is "Idle"?), I'll look again next week.
Thanks to a tip from MS, the stall is due to timing out waiting for the RDD process to launch. Maybe it failed? And it appears to be related to sandboxing, since setting MOZ_DISABLE_RDD_SANDBOX=1 is enough to avoid the issue. Bob (IIRC you have an arm device) could you take a look?
Blocks: 1498624
Flags: needinfo?(bobowencode)
(In reply to David Major [:dmajor] from comment #7)
> Thanks to a tip from MS, the stall is due to timing out waiting for the RDD
> process to launch. Maybe it failed? And it appears to be related to
> sandboxing, since setting MOZ_DISABLE_RDD_SANDBOX=1 is enough to avoid the
> issue. Bob (IIRC you have an arm device) could you take a look?

At first I thought this might be a similar thing to bug 1481518 comment 17, but this seems to happen when installed in Program Files as well.

The root cause looks like the RDD sandbox causing a failure on ARM64, so I'll look into that.

This is compounded by the RDD process being started at start-up, instead of when needed.
The launch is in fact async, but it waits for the launch of the process here, which is called from ContentParent::InitInternal.
So, my guess is that is why we see the delay. The timeout is 5000ms, so that fits.

[1] https://searchfox.org/mozilla-central/rev/f8de61826903996f6bdf41b11a2844dd59ac144f/dom/media/ipc/RDDProcessManager.cpp#192
Assignee: nobody → bobowencode
Status: NEW → ASSIGNED
Flags: needinfo?(bobowencode)
Priority: -- → P1
If you find that this is going to be thorny/lengthy to fix, maybe we should temporarily disable the RDD process on arm64 for the sake of the self-host experience?

And regardless of whatever arm64-specific thing is happening here, it doesn't seem great that the RDD process is allowed to hold up the browser launch in general. Is there a bug to move the init off the startup path?
(In reply to David Major [:dmajor] from comment #9)
> If you find that this is going to be thorny/lengthy to fix, maybe we should
> temporarily disable the RDD process on arm64 for the sake of the self-host
> experience?

Not quite sure what you mean by self-host, it fails if installed in Program Files or Users (if that is self-host).
But I agree that disabling the RDD process on arm64 might be a good idea.

> And regardless of whatever arm64-specific thing is happening here, it
> doesn't seem great that the RDD process is allowed to hold up the browser
> launch in general. Is there a bug to move the init off the startup path?

There is, I found bug 1514874.

(In reply to Nathan Froyd [:froydnj] from comment #5)

Maybe jld's async process launch work will really help here?

Somewhat. RDDProcessManager::CreateContentBridge would need to be changed to return a promise, and then the SendInitRemoteDecoder could be chained off of that and happen asynchronously, and then in the content process RecvInitRemoveDecoder does an async dispatch, so that's already unordered relative to PContent IPC and in theory shouldn't need further changes.

As for the “changed to return a promise“ part, chaining from WhenProcessHandleReady should be enough: the process handle is all that's needed to construct endpoints, and RDDProcessHost::InitAfterConnect (and thus RDDChild::Open) already doesn't happen-before anything blocking on WaitUntilConnected, so making the latter run earlier shouldn't make a difference.

OK not fully confirmed it yet, but I think this is actually the same problem as bug 1481518 comment 17.
In the case of the RDD process, the new thread will be running at USER_LOCKDOWN, so won't have access to Program Files* either.

Interestingly, this would also cause a problem for any child process launch for chromium (apart from maybe their GPU process), so I guess they'll need to solve this problem as well.
I'll investigate further, but maybe turning off the RDD process for arm64 should be the work-around at the moment.

Not sure who would make a final decision on that ... mjf?
However, it's never going to launch successfully with the sandbox at that moment.

Flags: needinfo?(mfroman)

We should disable it on Arm64 for now and file a follow-up to fix the underlying issue. The artificial delay is causing a pretty negative impression with early testers.

(In reply to Bob Owen (:bobowen) from comment #12)

I'll investigate further, but maybe turning off the RDD process for arm64 should be the work-around at the moment.

Not sure who would make a final decision on that ... mjf?
Turning off RDD almost certainly means turning off AV1 also, so that will not be my decision. Jean-Yves?

Flags: needinfo?(mfroman) → needinfo?(jyavenard)

(In reply to Michael Froman [:mjf] from comment #14)

Turning off RDD almost certainly means turning off AV1 also, so that will not be my decision. Jean-Yves?

Just to be clear: we only want to disable this on Windows on Arm64. Regardless, if AV1 needs RDD it's certainly not working now as the process just crashes.

(In reply to Eric Rahm [:erahm] from comment #15)

(In reply to Michael Froman [:mjf] from comment #14)

Turning off RDD almost certainly means turning off AV1 also, so that will not be my decision. Jean-Yves?

Just to be clear: we only want to disable this on Windows on Arm64. Regardless, if AV1 needs RDD it's certainly not working now as the process just crashes.
AV1 can be decoded without using a separate process but it was considered a security risk to do so. The purpose of the RDD process is to sandbox AV1 to reduce that risk. That risk should be considered when deciding on whether to leave AV1 default enabled for Arm64.

I've confirmed that this is the same issue with parallel DLL loading.
We've already had a reply from Microsoft over one way of turning this off in the registry by creating:

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\firefox.exe]
"MaxLoaderThreads"=dword:00000001

It looks like they have a way of setting this by some other means automatically, I'm not totally clear as to whether that sets the registry or uses a different mechanism, but that doesn't really matter.

(In reply to Michael Froman [:mjf] from comment #16)

(In reply to Eric Rahm [:erahm] from comment #15)

(In reply to Michael Froman [:mjf] from comment #14)

Turning off RDD almost certainly means turning off AV1 also, so that will not be my decision. Jean-Yves?

Just to be clear: we only want to disable this on Windows on Arm64. Regardless, if AV1 needs RDD it's certainly not working now as the process just crashes.

AV1 can be decoded without using a separate process but it was considered a security risk to do so. The purpose of the RDD process is to sandbox AV1 to reduce that risk. That risk should be considered when deciding on whether to leave AV1 default enabled for Arm64.

Right, but disabling the RDD process on arm64 until Microsoft can roll out this change should be uncontroversial, as it just crashes now anyway.

Will the patch at bug 1514874 fix this issue for us?

(In reply to Asa Dotzler [:asa] from comment #18)

Will the patch at bug 1514874 fix this issue for us?

It will remove the 5 second delay from startup, but fix the functionality of the RDD process (av1 videos).

*but it won't fix the functionality

I've spoken to mjf on IRC, I'm going to disable the RDD process and AV1 on arm64 until we can fix the RDD process:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=65708e024a80b82297a60d86291c0061ba73b2fd

Attachment #9035112 - Flags: review?(jyavenard)

I've tested the opt try build on x86 and rdd process and av1 are still enabled and working.
I've also confirmed they are disabled on the aarch64 build and that the delay has gone.

Comment on attachment 9035112 [details] [diff] [review]
Disable RDD process and AV1 on Windows arm64

Review of attachment 9035112 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #9035112 - Flags: review?(jyavenard) → review+

disabling AV1 on windows ARM64 seem an acceptable compromise, I doubt it would work well anyway

Flags: needinfo?(jyavenard)
Pushed by bobowencode@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3951a4f90911
Disable RDD process and AV1 on Windows arm64. r=jya
Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 66

I'm looking at adding the registry entries to the installer to set the MaxLoaderThreads and fix this issue (for normal installs) in bug 1519368.

Component: General → Security: Process Sandboxing
Product: Firefox → Core
See Also: → 1519368
Target Milestone: Firefox 66 → mozilla66
Duplicate of this bug: 1513748

Reproduced the issue on affected Nightly 66 and 67 (2019-01-01) on laptop Lenovo Yoga with Windows 10 x64.
Verified - Fixed on Firefox Beta 66.0 (2019-03-11) on the OS mentioned above.

Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.