Closed Bug 1690877 Opened 3 years ago Closed 3 years ago

New long-ish delay between first paint and presenting initial about:home

Categories

(Toolkit :: Startup and Profile System, defect)

Firefox 87
defect

Tracking

()

RESOLVED FIXED
87 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox85 --- unaffected
firefox86 --- unaffected
firefox87 + fixed

People

(Reporter: mconley, Assigned: jfkthame)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression, Whiteboard: [fxperf:p1])

Attachments

(3 files, 1 obsolete file)

I just started noticing this not long ago: when starting Firefox Nightly on my MBP, I'm seeing a pretty sizeable delay between presenting the toolbars, and presenting about:home. This is even with the about:home startup cache enabled.

It's pretty easy to detect visually - I'll post some links to some videos shortly to describe what I'm seeing. What's kind of disturbing is that none of our automated performance tests seem to have picked anything like this up, which is exactly what they're supposed to do.

Using mozregression, I've narrowed it down to this range:

https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=a2da9ad7af5174f244a830f53df91833f384af65&tochange=75042b68adbbcedd3881695bd94b002cb5b5f329

Is a narrower range possible?

Flags: needinfo?(mconley)

Yeah, doing some local builds to try to narrow it down. I'll do my best. Leaving needinfos until I can hopefully narrow it down.

For what it's worth, on my Windows 10 machine I see the expected behavior, i.e. no long delay between first paint and display of about:home. I've tested it on Nightly 87.0a1 (2021-02-03) (64-bit) as well as a local build based on https://hg.mozilla.org/mozilla-central/rev/203b0106b320b510974d704cd7f28ae507b48e19.

Sebastian

Local regression hunting points at bug 1687622

No longer blocks: 1687622, 1689316, 1689899, 1689940, 1690043
Flags: needinfo?(mconley)
Regressed by: 1687622
Has Regression Range: --- → yes

That's a relief. I was hoping it was that one and not bug 1689316.

Hmm, that suggests that there may be content on your about:home (it might depend what's present among your "top sites" etc) that's hitting the font fallback codepath, and in particular causing additional fonts to be loaded that weren't previously included in the "common" fallbacks.

Could you get a profile, please?

Sure thing:

https://share.firefox.dev/3aCuZML

Looks like ~2 seconds blocking the privileged about content process main thread running CreatePlatformFontList. I'm happy to provide other debugging information.

Set release status flags based on info from the regressing bug 1687622

Oh, I see.... it's spending all this time in the Cocoa method [NSFont systemFontOfSize:]. So this isn't about the change to GetCommonFallbackFonts, it's a result of the second patch in bug 1687622 (https://phabricator.services.mozilla.com/D102821), which activates a bunch of fonts for additional language support that we were previously missing.

We deliberately added code to activate these "language support" fonts in the past, back in bug 1392659. That broke when Apple moved them to a new location; bug 1687622 brought them back to life again. But in the meantime we've been enjoying the faster startup that results from a reduced font collection.

The delay here seems awfully long, though; we should see if there's some way we can reduce it.

What exact macOS version are you on? Launching Nightly on my MBPro, I see 150ms under gfxMacPlatformFontList::InitSystemFontNames(), which is far more reasonable than the 1931ms in your profile! I'm wondering why it's so different.

Flags: needinfo?(mconley)

I'm still rocking Catalina - 10.15.7 on a 16-inch 2019 MBP.

Flags: needinfo?(mconley)

I'm happy to give you a dump of all of the names of the fonts installed on my system, if that helps (but I'm not sure I'd know how to do that).

I doubt it'd tell us anything we can't already guess. Do you have a directory /System/Library/Fonts/Supplemental, containing a couple hundred-plus font files, including lots of Noto faces for assorted languages?

Flags: needinfo?(mconley)
```
Flags: needinfo?(mconley)

OK, so not quite as many as I have on Big Sur, but it's the same idea. Yet here, I'm not getting the 2-second delay, although it sounds like that's pretty consistently reproducible for you.

Presumably if you go back to a build from before bug 1687622 landed, you're likely to see hexboxes in place of characters for some of the letters like 𐬔𐬖𐬘𐬚ꛊꛎꛒꛕᯂᯈᯐ𑀃𑀆𑀞𑀢ᨂᨅᨈᨋᝀᝄᝇᝊ𐊠𐊡𐊢𐊣𑄄𑄇𑄋𑄎ꨓꨖꨝꨡⲘⲒⲔⲈ𒀂𒁀𒀶𒁇𐠂𐠛𐠟𐠴𓀪𓀮𓁔𓁙ⰗⰛⰎⱇ𐌼𐌹𐌺𐌴ᜤᜡᜩᜱ𐡂𐡚𐡈𐡎𐭽𐭪𐭧𐭱𐭇𐭌𐭏 (... and more, these are just a sampling of the scripts I'd expect to be affected). We activate these fonts in order to get the extended Unicode support they provide.

I guess we can try to use different system font APIs during this initialization, though I fear they may all end up hitting the same slow path internally; given that I don't see the issue here on 11.1, it seems like this may be something within Apple's font management subsystem that's hard for us to work around.

It's interesting that the long delays only appear in the content processes. In the parent process InitSystemFontNames is also called but it runs much more quickly.
Do we still need to call these functions in content processes, now that we have the shared font list?

(In reply to Markus Stange [:mstange] from comment #18)

It's interesting that the long delays only appear in the content processes. In the parent process InitSystemFontNames is also called but it runs much more quickly.

Yes, that's strange.

From the profile, it looks like in the content processes, when we call [NSFont systemFontOfSize:], code down in libFontRegistry.dylib ends up making XPC calls to collect font information, whereas in the parent process, it works directly in-process (and far faster).

My guess is that this happens because of our sandboxing; libFontRegistry must know that it the process is sandboxed, and hence it proxies its requests over to another service, whereas in the main process it doesn't have to do that.

Mike, if you set security.sandbox.content.level to zero, does that make any difference to the behavior here?

Do we still need to call these functions in content processes, now that we have the shared font list?

Unfortunately, the answer seems to be yes (according to bug 1672842), because the system font doesn't appear at all in the exposed font list.

We might be able to avoid all this by migrating to newer APIs such as CTFontCreateUIFontForLanguage instead of treating -apple-system as if it's an alias for a "normal" font in our font list, but that'd involve some pretty substantial reworking of our font management to support an entirely separate kind of font with its own methods of lookup/instantiation.

A less invasive, though also less comprehensive, workaround might be to have the parent process add hidden system fonts to the shared list, and pass the associated names over to content processes. I'll see if I can find some cycles to look into this possibility further.

Flags: needinfo?(mconley)
Attachment #9201465 - Attachment is obsolete: true
Flags: needinfo?(mconley)

Whoops - accidentally cleared needinfo. Yes, will try with that pref flipped - sec.

Mike, if you set security.sandbox.content.level to zero, does that make any difference to the behavior here?

Yes, absolutely - the performance issue seems to disappear entirely.

Yay, that's great..... kinda, I guess. Except that we want people to have sandboxing enabled. Sigh.

Maybe we can poke just enough holes in the sandbox to make the font loader happy. Haik, can you advise?

Flags: needinfo?(haftandilian)

(In reply to Markus Stange [:mstange] from comment #23)

Maybe we can poke just enough holes in the sandbox to make the font loader happy. Haik, can you advise?

I'll take a look, but probably not. If getting these system fonts is dependent on an additional service, we may be able to allow it. But we are already allowing certain font services in the sandbox to facilitate font access so I don't anticipate there is anything additional we could do. If the fonts in question are all in safe directories (i.e., directories we would allow content processes to access), we could try allowing filesystem access to those locations to see if that bypasses any of the slow code paths. We already allow full read access to /Library/Fonts and /System.

For fonts not in those directories, font access uses the sandbox exception mechanism where our content processes are able to be granted direct filesystem access to individual font files by the font server even though the content process otherwise doesn't have access to those directories.

@Mike, are all these fonts already under /System/Library? FYI, I found this command useful system_profiler SPFontsDataType. It prints all the fonts out and you can grep with "Location:" to get the paths. It includes fonts that have been disabled by the font manager.

I'll try to reproduce and check if there are other font-related services being blocked that might help.

Flags: needinfo?(haftandilian) → needinfo?(mconley)
Flags: needinfo?(haftandilian)

I haven't done an exhaustive check to see if the items in the first list have entries in the second, but I did see some common ones at a glance.

Flags: needinfo?(mconley)

Could you rule out if the fonts in /Users/mikeconley/Library/Fonts are causing the problem? I could also provide a build with that directory on the allow list.

Flags: needinfo?(mconley)

(In reply to Haik Aftandilian [:haik] from comment #27)

Could you rule out if the fonts in /Users/mikeconley/Library/Fonts are causing the problem? I could also provide a build with that directory on the allow list.

Sure, but how? Do I temporarily delete those files? What's the best way to rule those out?

(In reply to Mike Conley (:mconley) (:⚙️) (Catching up on needinfos) from comment #28)

(In reply to Haik Aftandilian [:haik] from comment #27)

Could you rule out if the fonts in /Users/mikeconley/Library/Fonts are causing the problem? I could also provide a build with that directory on the allow list.

Sure, but how? Do I temporarily delete those files? What's the best way to rule those out?

Disabling them in the Font Book application should do the trick. They should show up in the "User" section allowing you to right-click and select disable.

I had the following fonts listed in the "User" section of the Font Book application:

  • Ahem
  • ChopinScript
  • Kingthings Extortion
  • Mademoiselle Camille
  • Mademoiselle Camille L
  • MrsEavesSmallCaps

Disabling all of these fonts in Font Book had no perceivable impact on the startup performance problem.

Flags: needinfo?(mconley)

(In reply to Haik Aftandilian [:haik] from comment #24)

(In reply to Markus Stange [:mstange] from comment #23)

Maybe we can poke just enough holes in the sandbox to make the font loader happy. Haik, can you advise?

I'll take a look, but probably not. If getting these system fonts is dependent on an additional service, we may be able to allow it. But we are already allowing certain font services in the sandbox to facilitate font access so I don't anticipate there is anything additional we could do. If the fonts in question are all in safe directories (i.e., directories we would allow content processes to access), we could try allowing filesystem access to those locations to see if that bypasses any of the slow code paths. We already allow full read access to /Library/Fonts and /System.

The font files involved here are all found under /System (specifically, /System/Library/Fonts/Supplemental), so there should be read access to them already; it must be something other than read access to the files that's causing the font service to decide it needs to remote its work.

In Activity Monitor, I see there's a process fontd, which I suspect may be what manages the "font registry"; and it has open file handles to several files under /private/var/ (in my case, under /private/var/folders/cn/nm20gjl96q7g0dqf00y7bh5m0000gn/C/com.apple.FontRegistry/, but the exact path there will presumably vary) that I suspect may be font metadata caches.

So I wonder if the Core Text code tries to use these caches directly, but in the content process it finds it can't and so resorts to slow XPC instead.

Do the sandbox rules allow us to grant access to a path with wildcards such as /private/var/folders/*/*/*/com.apple.FontRegistry (assuming the exact path I see may not be universal)? Or can we try simply allowing all subpaths of /private/var/folders? I'd be curious if that would solve Mike's issue.

BTW, in profiles I'm seeing the same Long Delay - ~2.1s for initializing fonts. See https://share.firefox.dev/2LuSXB9 - I've seen this in multiple profiles. This might be causing major delays when loading some sites in fission, since we're starting new content processes in parallel to loading the page (to replenish the preallocated process cache)

Could you confirm which macOS version you're seeing this on -- also 10.15.x like Mike, or....?

Flags: needinfo?(rjesup)

Oh, never mind - I just realized the profiler reports that: 10.15 indeed.

Flags: needinfo?(rjesup)

I'd guess there's a strong possibility this is the reason for the Mac-specific visual metrics issues that Neha mentioned in the Fission breakout at last week's Firefox-Weekly meeting. (See question 7 at https://docs.google.com/document/d/1xNaDuF8grV3QEMwukW6HGJzwFdZmQAfH3stA3uC4wzY/edit#.) Might be interesting to try disabling the content sandbox and seeing what impact that has on the metrics.

Blocks: 1690855

I haven't had a chance to debug this yet, but to test Jonathan's theory, here's a patch that allows content processes access to the cache directory containing the com.apple.FontRegistry directory. If this does avoid the delays, we would debug further to see exactly which directories are needed and just allow those.

https://hg.mozilla.org/try/rev/9370a14f1b6821539210811eb1413034b0ca3444

(In reply to Jonathan Kew (:jfkthame) from comment #31)

Do the sandbox rules allow us to grant access to a path with wildcards such as /private/var/folders/*/*/*/com.apple.FontRegistry (assuming the exact path I see may not be universal)? Or can we try simply allowing all subpaths of /private/var/folders? I'd be curious if that would solve Mike's issue.

Yes, we can allow paths using regex, but in this case we could build the path using confstr(_CS_DARWIN_USER_CACHE_DIR, ...). From the terminal, you can run getconf DARWIN_USER_CACHE_DIR to check what that path is.

Flags: needinfo?(haftandilian)

(In reply to Jonathan Kew (:jfkthame) from comment #35)

I'd guess there's a strong possibility this is the reason for the Mac-specific visual metrics issues that Neha mentioned in the Fission breakout at last week's Firefox-Weekly meeting.

The problem is that the visual metrics issue on mac predates this landing, and automation runs don't show any regression when this lands: https://vismet.arewefastyet.com/mac/cold-page-load/overview?numDays=60&series=Firefox-Fission,Firefox-WebRender

I ran into this while doing perf runs and profiles to track that down; I saw that youtube had jumped to the top of the list of regressions, while in automation it wasn't that bad.

Note that automation runs (dual-core Mac minis) might not be affected by this bug while a laptop with more cores (i7 quad is what I used) might; you can get odd timing interactions. We might never see 'idle' time on the Minis while loading, while my machine would see idle and start a new process. (just a guess as to why we see this in local runs but not in automation)

Given how the sandboxing adjustment doesn't sound very straightforward, this workaround from comment 19 sounds more attractive now:

(In reply to Jonathan Kew (:jfkthame) from comment #19)

A less invasive, though also less comprehensive, workaround might be to have the parent process add hidden system fonts to the shared list, and pass the associated names over to content processes. I'll see if I can find some cycles to look into this possibility further.

Mike, Randell: have either of you had a chance to try Haik's build from comment 36 and see how it behaves? If it resolves the issue, I think that -- potentially refined to narrow its scope to just the com.apple.FontRegistry subdirectory -- would be the best way forward, as I suspect this could be causing a wider performance drag on content processes if Core Text has to go via an XPC route for font access. It seems plausible this could hurt us a little bit every time Core Text needs to instantiate a font, if it doesn't have direct access to the system's cache. The effect would be small on any given instance, unlike here where we're querying every available font, but still it'd be better to avoid it if possible.

ni'ing myself to try haik's build from comment 36 tomorrow.

Flags: needinfo?(mconley)

So far I haven't been able to reproduce this on 10.16. I tried with Fission enabled and browsing to sites and then looking for the delays during content process startup time. I also tried opening several about:newtab pages. I will test on 10.15 too.

Yeah, it doesn't reproduce for me on 10.16 either. I believe there must have been changes within CoreText (or lower, e.g. libFontRegistry) that mean it's no longer an issue on the latest OS version.

(I'm not sure about earlier versions, either; it'd be mildly interesting to know whether this is an issue on 10.14. It may be purely Catalina-specific.)

I haven't looked at a profile with the comment 36 patch yet, but I did a browsertime performance run with it, and the youtube.com visual metrics regression for fission (which starts processes to replenish the preallocated-process-cache during load) dropped from ~35% to ~7%, which is in-line with regressions we see on other sites with mac.

I note we don't see a regression in automation when this landed: https://treeherder.mozilla.org/perfherder/graphs?series=mozilla-central,2893742,1,13&timerange=5184000&series=mozilla-central,2893768,1,13
That's run on Mac Minis; I'm checking on the OS version.

The mac minis are running 10.14, so perhaps this isn't a bug there either

Yeah, after applying the patch from the comment 36, things were much much better - the regression was effectively gone.

Flags: needinfo?(mconley)

On dupe bug 1691719, Henrik is hitting this on 10.16 so it is not limited to 10.15 and earlier.

[Tracking Requested - why for this release]: I'm going to ask again for tracking the 87 release given that this causes a 550ms delay during each and every startup of Firefox.

Component: New Tab Page → Startup and Profile System
Product: Firefox → Toolkit
Version: unspecified → Firefox 87

I can now reproduce this on macOS 11.2 (aka 10.16.2) using MOZ_PROFILER_STARTUP=1 with Fission enabled with/without about:home startup cache.

It looks like we can resolve this, at least for 10.16 aka 11.x, by limiting the activation of the supplemental language support fonts to the parent process; even if child processes don't explicitly activate those fonts, the fact that the parent did so is sufficient to make them available.

I'm not sure what impact this will have on the 10.15 case where libFontRegistry appears to be hitting XPC; it may not help as much there.

We'll also need to test that it doesn't impact the availability of fonts on earlier systems.

Try build with patch; testing/profiling on various systems welcome:
https://treeherder.mozilla.org/jobs?repo=try&revision=99347722b1f8d4036fb543509010739365baeb57

Assignee: nobody → jfkthame
Status: NEW → ASSIGNED

(In reply to Jonathan Kew (:jfkthame) from comment #50)

Try build with patch; testing/profiling on various systems welcome:
https://treeherder.mozilla.org/jobs?repo=try&revision=99347722b1f8d4036fb543509010739365baeb57

On macOS 11.2, with the patch I no longer see the large delays in the profiler and switching through my restored tabs is noticeably faster.

Pushed by jkew@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b96c864bbc34
Only activate additional fonts in the parent process on macOS. r=lsalzman
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch

I'm seeing better numbers in local browertime runs with the patch, and the regression for youtube has gone down from ~27-38% to ~16-25%.

However, with the patch from comment 36, the regression dropped to ~7%, so it looks like something here (probably with the fonts, but not verified) is still hurting us. I'll collect new profiles
NI haik and jfkthame for their awareness

Flags: needinfo?(jfkthame)
Flags: needinfo?(haftandilian)
Blocks: 1692220

Continuing this in bug 1692220.

Flags: needinfo?(jfkthame)
Flags: needinfo?(haftandilian)
Regressions: 1704273
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: