Closed Bug 1443437 Opened 4 years ago Closed 4 years ago

Slow first startup of 32bit Firefox builds on gecko-t-linux-xlarge (linux 64bit) workers due to missing i386 font cache

Categories

(Firefox Build System :: Task Configuration, task, P1)

task

Tracking

(firefox-esr52 unaffected, firefox59 wontfix, firefox60 fixed)

RESOLVED FIXED
mozilla60
Tracking Status
firefox-esr52 --- unaffected
firefox59 --- wontfix
firefox60 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

On the linux32 platform I can see a slow first startup of Firefox for basically every web-platform related test job. Checking different logs it shows that we talk about roughly 30s here. The webdriver tests seem to be more prone to fail due to the 30s test duration setting, and given that this is the duration Firefox needs for a cold start, we always fail. See:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b2a3f6a13b2230d7ddc5083cc6b82844268d03f1

It means that opt builds need exactly the same time as debug builds to startup! This is not something I would ever expect.
Here some stats:

Debug build:
https://treeherder.mozilla.org/logviewer.html#?job_id=166082214&repo=try&lineNumber=1224-1263

> [task 2018-03-05T22:20:29.934Z] 22:20:29     INFO - PID 858 | 1520288429930	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.oW8s8qxIYUEb"
> [task 2018-03-05T22:20:31.890Z] 22:20:31     INFO - PID 858 | [871, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file /builds/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 2909
> [task 2018-03-05T22:20:59.088Z] 22:20:59     INFO - PID 858 | ++DOCSHELL 0xe4384000 == 1 [pid = 871] [id = {a2b6bd56-b0d0-41e1-952c-2eed44999634}]
> [..]
> [task 2018-03-05T22:21:04.299Z] 22:21:04     INFO - PID 858 | 1520288464294	Marionette	INFO	Listening on port 2828

Here we get a log line for the permission manager 2s after we started Firefox. Then it takes 28s for the next log output.

Opt build:
https://treeherder.mozilla.org/logviewer.html#?job_id=166082499&repo=try&lineNumber=1211-1214

> [task 2018-03-05T22:20:45.650Z] 22:20:45     INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/key.py::test_lone_keyup_sends_no_events 
> [task 2018-03-05T22:20:45.650Z] 22:20:45     INFO - PID 853 | 1520288445644	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.P3FNtNslzjsF"
> [task 2018-03-05T22:21:14.666Z] 22:21:14     INFO - PID 853 | GLib-GIO-Message: Using the 'memory' GSettings backend.  Your settings will not be saved or shared with other applications.
> [task 2018-03-05T22:21:15.354Z] 22:21:15     INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/actions/key.py | expected OK
> [task 2018-03-05T22:21:15.355Z] 22:21:15     INFO - TEST-INFO took 30040ms

With starting the test Firefox gets started immediately, but then it takes 30s for a first log line by Firefox.

I think that submitting a patched wdspec job which uses `--webdriver-args=-vv` for trace output might give us more details.
Here the log output from a job of an opt build and trace logs enabled:

https://treeherder.mozilla.org/logviewer.html#?job_id=166176579&repo=try&lineNumber=1208-1484

Timings of important steps:

> [task 2018-03-06T10:50:42.275Z] 10:50:42     INFO - PID 850 | 1520333442265	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.I2QMtuUyKgTC"
> [..]
> [task 2018-03-06T10:50:42.752Z] 10:50:42     INFO - PID 850 | 1520333442742	Marionette	DEBUG	Received observer notification profile-after-change
> [..]
> [task 2018-03-06T10:51:07.448Z] 10:51:07     INFO - PID 850 | 1520333467442	Marionette	DEBUG	Received observer notification command-line-startup
> [..]
> [task 2018-03-06T10:51:08.611Z] 10:51:08     INFO - PID 850 | 1520333468600	Marionette	DEBUG	Received observer notification sessionstore-windows-restored
> [..]
> [task 2018-03-06T10:51:08.898Z] 10:51:08     INFO - PID 850 | 1520333468872	Marionette	INFO	Listening on port 2828

So the `profile-after-change` observer notification is received pretty quickly, but then there is a huge delay of 25 seconds until `command-line-startup` is received.

Karl, do you have an idea what could delay the observer notification that drastically on our gecko-t-linux-xlarge workers in automation? Are there any specific MOZ_LOG settings which could be used to get more internal details of the delay?
Flags: needinfo?(karlt)
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Priority: -- → P1
This problem is actually not only affecting web-platform tests but every test job as run in automation on Linux 32. For each of those I can see a massive delay until the `command-line-startup` observer notification is sent. Here some examples:

Mochitest: https://treeherder.mozilla.org/logviewer.html#?job_id=166172237&repo=autoland&lineNumber=2309-2317 

[task 2018-03-06T10:10:29.646Z] 10:10:29     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmppoVstB.mozrunner
[task 2018-03-06T10:10:29.662Z] 10:10:29     INFO - runtests.py | Application pid: 1036
[task 2018-03-06T10:10:29.662Z] 10:10:29     INFO - TEST-INFO | started process GECKO(1036)
[task 2018-03-06T10:10:30.551Z] 10:10:30     INFO - GECKO(1036) | 1520331030544	Marionette	DEBUG	Received observer notification profile-after-change
[task 2018-03-06T10:10:58.119Z] 10:10:58     INFO - GECKO(1036) | 1520331058114	Marionette	DEBUG	Received observer notification command-line-startup
[task 2018-03-06T10:10:58.120Z] 10:10:58     INFO - GECKO(1036) | 1520331058114	Marionette	DEBUG	Received observer notification nsPref:changed
[task 2018-03-06T10:10:59.411Z] 10:10:59     INFO - GECKO(1036) | 1520331059400	Marionette	DEBUG	Received observer notification sessionstore-windows-restored
[task 2018-03-06T10:10:59.734Z] 10:10:59     INFO - GECKO(1036) | 1520331059720	Marionette	DEBUG	New connections are accepted
[task 2018-03-06T10:10:59.735Z] 10:10:59     INFO - GECKO(1036) | 1520331059720	Marionette	INFO	Listening on port 2828

Browser Chrome: https://treeherder.mozilla.org/logviewer.html#?job_id=166172222&repo=autoland&lineNumber=2012-2020
Ref test: https://treeherder.mozilla.org/logviewer.html#?job_id=166172264&repo=autoland&lineNumber=1772-1779
Component: web-platform-tests → General
Summary: Investigate first slow startup of Firefox opt builds for web-platform-tests on linux32 → Very slow cold startup time of Firefox opt builds on gecko-t-linux-xlarge (linux32) workers
Maybe it is somewhat related to bug 1414495 comment 24, which has only seen / investigated on Windows yet. To hopefully see where the hang actually occurs I pushed a try build with a couple of debug print statements:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d6759a7767cbe072da39dd115fe07fe3739417ef
Can we just run the profiler as part of this setup and look at the profiler output via http://perf-html.io/ ?
Flags: needinfo?(hskupin)
Most of the time between profile-after-change and command-line-startup is usually in the add-on manager startup. I think our test jobs sideload some test add-ons; that requires main thread I/O, and that triggers loading the blocklist. The machine would need to have some pretty slow (or busy) disks to have this take 30s, but it's possible.
wpt jobs don't use any addons; they just use marionette. Of course marionette could be doing something slow, or there could be some other test-specific problem.
I got a profile for startup: https://perfht.ml/2Hcn0G5

As it looks like we spent 99% of the time in TelemetryStartup for annotating the environment, and specifically building up the font list?
Flags: needinfo?(karlt)
Flags: needinfo?(hskupin)
Flags: needinfo?(gfritzsche)
Whereby the affected code seems to be located in gfxFcPlatformFontList::gfxFcPlatformFontList. So it looks more like a GFX as Telemetry issue.
Component: General → Graphics
Flags: needinfo?(gfritzsche) → needinfo?(milan)
Product: Testing → Core
Version: Version 3 → unspecified
Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Priority: P1 → --
An observation I just made is that the hang is actually only existent for non Nightly/Release builds, so any CI build we run tests for.

Here some steps to reproduce it:

1) Request a one-click loaner based on https://treeherder.mozilla.org/#/jobs?repo=try&revision=1544ec8719409ce0e0f2ad2f58e4e4898f15154b&selectedJob=166224867

2) Download any CI build of Firefox like https://queue.taskcluster.net/v1/task/PEWCZZrfQN2bFlZYFrvJnA/runs/0/artifacts/public/build/target.tar.bz2 and extract it

3) Run Firefox

Please note that using a Nightly or Release doesn't show this problem! Also if you hit Ctrl-C during the hang, the next time Firefox is started it hangs again. But once Firefox has been completed the startup, the hang will not appear again for any other startup for that one-click loaner and a new one has to be requested.
I tried to use strace to log the file system access but sadly after a couple of seconds a dialog pops-up which tells that Firefox has unexpectedly closed and offers safe mode or to refresh nightly. Then it seems to hang in those SIGVTALRM lines.

open("/usr/lib/i386-linux-gnu/libgconf-2.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/lib/libgconf-2.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/libgconf-2.so.4", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/dev/shm/org.chromium.6taIq6", O_RDWR|O_CREAT|O_EXCL, 0600) = 50
open("/dev/shm/org.chromium.0sTRM0", O_RDWR|O_CREAT|O_EXCL, 0600) = 52
open("/usr/share/themes/Ambiance/gtk-3.0/assets/button-focused.png", O_RDONLY|O_LARGEFILE) = 52
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=3781, si_uid=0} ---
--- SIGVTALRM {si_signo=SIGVTALRM, si_code=SI_TKILL, si_pid=3781, si_uid=0} ---
[..]

Looks like this is caused by a broken profile and maybe due to hitting ctrl-c during startup.

When I tried with another fresh profile it worked. So a new loaner is needed which currently takes forever. I will continue tomorrow.
So when I compare the two strace logs between a working Nightly build and the broken CI build already the second line shows a difference.

CI build:
> execve("firefox/firefox", ["firefox/firefox", "-profile", "/builds/worker/_profile"], [/* 47 vars */]) = 0
> strace: [ Process PID=1115 runs in 32 bit mode. ]
> brk(NULL)                               = 0x9d86000

Nightly build:
> execve("firefox/firefox", ["firefox/firefox", "-profile", "/builds/worker/_profile"], [/* 17 vars */]) = 0
> brk(NULL)                               = 0x979000

Both builds should be 32bit Linux binaries and are run on a 32bit system. Why does strace list "runs in 32 bit mode" for CI builds? What are we building different for CI builds compared to Nightly builds? Could this difference cause such a slow behavior of gfxFcPlatformFontList::gfxFcPlatformFontList as seen in the profile in comment 8?

https://perf-html.io/public/851a746970e990b5e2de74958ab7fef3703c4672/calltree/?hiddenThreads=&range=0.0747_30.0274&thread=0&threadOrder=0-2-3-4-5-6-7-8-9-1&v=2

The part of the strace log which is showing that slowness is between the 4th instance of "ZZZ NS_CreateServicesFromCategory starts loop" and the next "ZZZ NS_CreateServicesFromCategory done".

Jed or Mike does one of you know why we build differently and if this could cause the problem?
Component: Graphics → General
Flags: needinfo?(milan)
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jld)
Product: Core → Firefox
Here some updated steps in how to reproduce the 30s hang during startup:

1) Grab a one-click loaner based on https://treeherder.mozilla.org/#/jobs?repo=try&revision=1544ec8719409ce0e0f2ad2f58e4e4898f15154b&selectedJob=166224867

2) Exit the wizard

3) Run the following commands:

wget https://queue.taskcluster.net/v1/task/PEWCZZrfQN2bFlZYFrvJnA/runs/0/artifacts/public/build/target.tar.bz2
tar xfv target.tar.bz2
apt install strace
mkdir _profile

4) Run Firefox:

firefox/firefox -profile /builds/worker/_profile

or

strace firefox/firefox -profile /builds/worker/_profile &>strace.log
I just noticed that when requesting a Linux opt one-click loaner I get the following:

# uname -a
Linux 47486cdcf0c8 4.4.0-1014-aws #14taskcluster-Ubuntu SMP Mon Feb 19 11:31:26 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Means we actually run a 64bit system but not a 32bit system. The above CI build is a 32bit binary, whereby the Nightly/Release builds I downloaded via mozdownload were 64bit binaries because I didn't force the 32bit binary.

Actually grabbing a 32bit nightly I can reproduce the same hang, and that indicates that we have those delays when running a 32bit Firefox binary on a 64bit Linux system.
Summary: Very slow cold startup time of Firefox opt builds on gecko-t-linux-xlarge (linux32) workers → Very slow cold startup time of Firefox opt 32bit builds on gecko-t-linux-xlarge (linux 64bit) workers
It actually doesn't seem to be a regression. I went back until Firefox 17.0 which also has this slow first start-up time. I wonder if it's a problem with Firefox, or if there is just some package/setting missing in the docker image for desktop1604-test.
I have the feeling that the problem is related to a missing font cache for the 32bit binaries. In the docker image we add the i386 architecture but newer seem to run fc-cache to build up the cache. As such it will be created the first time a 32bit application gets started.

If that is the case it will be a taskcluster related issue, and I will have a fix soon.
Ok, this is clearly a docker image problem. The font cache for i386 binaries is missing and during the first start of Firefox this cache will be created under `/var/cache/fontconfig/`. Removing this folder, I can always trigger the slowness.

I had a look into the fc-cache man page and as it looks like there is no way to force a rebuild of the font cache which will also include the cache for 32bit binaries because fc-cache itself is a 64bit binary.

I wonder how we can pre-build the cache so that we do not suffer from this 30s delay in each and every test job, which will cause a fair amount of intermittent failures at least for wdspec tests for the first test.
Component: General → Docker Images
Product: Firefox → Taskcluster
Summary: Very slow cold startup time of Firefox opt 32bit builds on gecko-t-linux-xlarge (linux 64bit) workers → Slow first startup of 32bit Firefox builds on gecko-t-linux-xlarge (linux 64bit) workers due to missing i386 font cache
(In reply to Henrik Skupin (:whimboo) from comment #19)
> I had a look into the fc-cache man page and as it looks like there is no way
> to force a rebuild of the font cache which will also include the cache for
> 32bit binaries because fc-cache itself is a 64bit binary.
> 
> I wonder how we can pre-build the cache so that we do not suffer from this
> 30s delay in each and every test job, which will cause a fair amount of
> intermittent failures at least for wdspec tests for the first test.

Write a dummy bit of code that forces the same initialization to happen, compile as 32-bit and run prior to running Firefox? (Maybe this is a dumb idea for some reason...)
An option could be to install `firefox:i386` via apt, which will remove the x86_64 package. But running the system firefox binary afterward will populate the font cache for 32bit binaries, and keeps the cache for 64bit binaries in-tact.

With `firefox --screenshot` we could run Firefox and let it close immediately once the cache has been built. A start of the 32bit binary under test by test jobs later would be fast because the cache exists. But I'm not that happy by using a specific feature of Firefox here. 

Maybe we can check when startup is done and kill the process?
(In reply to :Gijs (under the weather; responses will be slow) from comment #20)
> Write a dummy bit of code that forces the same initialization to happen,
> compile as 32-bit and run prior to running Firefox? (Maybe this is a dumb
> idea for some reason...)

That doesn't work. Each application builds its own cache. At least we would have to figure out how to spoof a Firefox binary.
Why do you  think each application has it's own cache? It doesn't look like that on my machine, and if it were true I'm not sure what the point  of fc-cache would be.

AIUI fc-cache basically is a wrapper program that calls the same cache-generating endpoint as a real application and then exits. So the question is how to run the 32bit variant of that at some appropriate  time (e.g. in the dockerfile so that the initial cache is baked into the image).
Because when I start a 32-bit `nautilus` binary it also takes 30s to start while building its font cache. A second start is fast again. And this always happens whether I build the font cache for Firefox before or not.

We could install `fontconfig:i386` which would generate the 32bit cache. But that removes the x86_64 version, and I don't know which implications that has.
From a conversation with Ted on IRC:

ted> my alternate proposal would be to just fetch the i386 deb and manually extract it somewhere else and run the binary from there
<whimboo> ted: correct. both exists next to each other
<whimboo> if we don't need fc-cache anymore it should be all fine
<whimboo> the deb way is more complicated
<ted> whimboo: it wouldn't be terrible, apparently you can `apt-get download fontcache:i386` which downloads the .deb to the current directory
<ted> and `dpkg-deb --extract whatever.deb path` extracts its contents
<whimboo> oh
<ted> but if installing both packages in series works and doesn't break anything else i say go for it
<whimboo> k, let the reviewer decide :)
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jld)
Priority: -- → P1
Both patches work, so I would tend to the replacement of fontconfig with fontconfig:i386. If there is a concern with this we could still run what Ted suggested. I checked that manually on a loaner and it works too.

Linux 32:
https://treeherder.mozilla.org/logviewer.html#?job_id=166464003&repo=try&lineNumber=1235-1237

> [task 2018-03-07T12:09:42.767Z] 12:09:42     INFO - PID 852 | 1520424582764	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.NU6UpL7fxtoR"
[..]
> [task 2018-03-07T12:09:44.868Z] 12:09:44     INFO - PID 852 | 1520424584858	Marionette	INFO	Listening on port 2828

Linux 64:
https://treeherder.mozilla.org/logviewer.html#?job_id=166464004&repo=try&lineNumber=1248-1273

> [task 2018-03-07T12:08:12.562Z] 12:08:12     INFO - PID 856 | 1520424492556	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.gHEClkRqg9tU"
[..]
> [task 2018-03-07T12:08:14.722Z] 12:08:14     INFO - PID 856 | 1520424494718	Marionette	INFO	Listening on port 2828

For both architectures we decreased the first startup time from ~30s to 2s.
Comment on attachment 8956807 [details]
Bug 1443437 - Pre-build font cache for Firefox i386 binaries on x86_64 systems.

https://reviewboard.mozilla.org/r/225782/#review231608

thanks for figuring this out :whimboo!  In reading over this bug, it seems like this is the right solution and your try pushes look promising :)
Attachment #8956807 - Flags: review?(jmaher) → review+
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b8df7526c0f6
Pre-build font cache for Firefox i386 binaries on x86_64 systems. r=jmaher
We also see a decrease of the first startup time for debug builds from ˜35s to 8s across all test suites:

> [task 2018-03-07T15:53:27.952Z] 15:53:27     INFO - PID 856 | 1520438007948	mozrunner::runner	INFO	Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.f9DqKSQQjHjS"
> [..]
> [task 2018-03-07T15:53:35.420Z] 15:53:35     INFO - PID 856 | 1520438015414	Marionette	INFO	Listening on port 2828
Component: Docker Images → Task Configuration
Product: Taskcluster → Firefox Build System
Blocks: fastci
https://hg.mozilla.org/mozilla-central/rev/b8df7526c0f6
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Nice detective work on this, whimboo!
See Also: → 1633101
You need to log in before you can comment on or make changes to this bug.