Intermittent make[1]: *** [profiledbuild] Error 1 after error: XDG_RUNTIME_DIR not set in the environment.

RESOLVED FIXED in Firefox 68

Status

defect
P5
normal
RESOLVED FIXED
7 months ago
2 months ago

People

(Reporter: intermittent-bug-filer, Assigned: mshal)

Tracking

({intermittent-failure})

unspecified
mozilla68
Dependency tree / graph

Firefox Tracking Flags

(firefox68 fixed)

Details

(Whiteboard: [stockwell unknown])

Attachments

(1 attachment)

Filed by: apavel [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=218530833&repo=try

https://queue.taskcluster.net/v1/task/HKIZF2-YRZWJlkA790KSuA/runs/0/artifacts/public/logs/live_backing.log

[task 2018-12-22T13:45:40.986Z] 13:45:40     INFO -  error: XDG_RUNTIME_DIR not set in the environment.
[task 2018-12-22T13:45:40.986Z] 13:45:40     INFO -  Unable to init server
[task 2018-12-22T13:45:40.986Z] 13:45:40     INFO -  Error: cannot open display: :2
[task 2018-12-22T13:45:40.986Z] 13:45:40     INFO -  Firefox exited with code 1 during profile initialization
[task 2018-12-22T13:45:40.987Z] 13:45:40     INFO -  Makefile:190: recipe for target 'profiledbuild' failed
[task 2018-12-22T13:45:40.987Z] 13:45:40    ERROR -  make[1]: *** [profiledbuild] Error 1
[task 2018-12-22T13:45:40.987Z] 13:45:40     INFO -  make[1]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2018-12-22T13:45:40.987Z] 13:45:40     INFO -  client.mk:125: recipe for target 'build' failed
[task 2018-12-22T13:45:40.987Z] 13:45:40     INFO -  make: *** [build] Error 2
[task 2018-12-22T13:45:41.022Z] 13:45:41     INFO -  279 compiler warnings present.
[task 2018-12-22T13:45:41.083Z] 13:45:41     INFO -  Notification center failed: Install notify-send (usually part of the libnotify package) to get a notification when the build finishes.
[task 2018-12-22T13:45:41.136Z] 13:45:41    ERROR - Return code: 2
[task 2018-12-22T13:45:41.136Z] 13:45:41  WARNING - setting return code to 2
[task 2018-12-22T13:45:41.136Z] 13:45:41    FATAL - 'mach build -v' did not run successfully. Please check log for errors.
[task 2018-12-22T13:45:41.136Z] 13:45:41    FATAL - Running post_fatal callback...
[task 2018-12-22T13:45:41.136Z] 13:45:41    FATAL - Exiting -1
[task 2018-12-22T13:45:41.136Z] 13:45:41     INFO - [mozharness: 2018-12-22 13:45:41.136623Z] Finished build step (failed)
[task 2018-12-22T13:45:41.136Z] 13:45:41     INFO - Running post-run listener: _parse_build_tests_ccov
[task 2018-12-22T13:45:41.136Z] 13:45:41     INFO - Running post-run listener: _shutdown_sccache
[task 2018-12-22T13:45:41.136Z] 13:45:41     INFO - Running command: ['/builds/worker/workspace/build/src/sccache2/sccache', '--stop-server'] in /builds/worker/workspace/build/src
[task 2018-12-22T13:45:41.137Z] 13:45:41     INFO - Copy/paste: /builds/worker/workspace/build/src/sccache2/sccache --stop-server
[task 2018-12-22T13:45:41.140Z] 13:45:41     INFO -  Stopping sccache server...
[task 2018-12-22T13:45:41.141Z] 13:45:41     INFO -  error: couldn't connect to server
[task 2018-12-22T13:45:41.141Z] 13:45:41     INFO -  caused by: Connection refused (os error 111)
[task 2018-12-22T13:45:41.141Z] 13:45:41    ERROR - Return code: 2
[task 2018-12-22T13:45:41.141Z] 13:45:41     INFO - Running post-run listener: _summarize
[task 2018-12-22T13:45:41.141Z] 13:45:41    ERROR - # TBPL FAILURE #
[task 2018-12-22T13:45:41.141Z] 13:45:41     INFO - [mozharness: 2018-12-22 13:45:41.141585Z] FxDesktopBuild summary:
[task 2018-12-22T13:45:41.141Z] 13:45:41    ERROR - # TBPL FAILURE #
This intermittent has likely been around for a while, but only shows up now because we are properly reporting errors in profileserver.py since bug 1252556 landed. Before then, if the profileserver failed, we would just end up with no profile data (and presumably an under-optimized build). As such, we shouldn't back out bug 1252556.

The XDG_RUNTIME_DIR message comes from wayland, which is a bit of a red herring because gdk only tries to use wayland in our automation if the X11 backend fails to initialize. So really the problem is from gdk failing to use X, but there is no error message when this happens. With GDK_DEBUG=all in the environment, there are some more clues that this is what is happening:

[task 2018-12-13T03:02:27.795Z] Gdk-Message: Trying x11 backend
[task 2018-12-13T03:02:27.795Z] Gdk-Message: Trying wayland backend
[task 2018-12-13T03:02:27.795Z] error: XDG_RUNTIME_DIR not set in the environment.
[task 2018-12-13T03:02:27.795Z] Gdk-Message: Trying broadway backend
[task 2018-12-13T03:02:27.795Z] Unable to init server

As compared to a successful run, which looks like:

[task 2018-12-13T02:35:28.589Z] Gdk-Message: Trying x11 backend
[task 2018-12-13T02:35:28.604Z] Gdk-Message: visual: true color: 32
[task 2018-12-13T02:35:28.604Z] Gdk-Message: visual: direct color: 24
[task 2018-12-13T02:35:28.604Z] Gdk-Message: visual: true color: 24

The taskcluster xvfb.sh script is supposed to verify that xfvb is running correctly by checking the return code of xvinfo, but perhaps that isn't sufficient, or xvfb is dying sometime after xvinfo runs.

It's also possible this is a bug in gdk or xvfb that is fixed upstream already, so it may be worth experimenting using a Debian 9 image for these builds instead of 7.
Note that since bug 1514288 landed, the "error: XDG_RUNTIME_DIR" message will appear in the profile-run-1.log file rather than in the build output.
See Also: → 1517939

This is ok for now in the Bpgo(run) step, but it does confirm our suspicion that it appears more often in the 3-tier PGO than it does in the single tier PGO. The fact that this intermittently fails more often in 3-tier version is the reason we haven't switched over Linux PGO builds yet.

The Bpgo(run) failures can be starred as this bug without requiring a retrigger, but Linux x64 pgo 'B' builds should still be retriggered. If this is causing problems for sheriffing, we can hide the Bpgo(run) builds for now, though we'd like them to still execute while we are implementing 3-tier PGO.

To clarify: Bpgo(run) is new from bug 1507334 and is not needed to ship anything yet (so retriggering is not necessary), though it is the direction we are moving toward in the near future. So we'd prefer not to back it out, but can hide things in the meantime if necessary.

Make them tier 3? or at least tier 2.

Depends on: 1519424

There are 27 failures in the last 7 days, all on linux64 pgo.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=221619269&repo=mozilla-inbound&lineNumber=241

[task 2019-01-13T21:33:53.165Z] New python executable in /builds/worker/checkouts/gecko/obj-x86_64-pc-linux-gnu/_virtualenvs/init/bin/python2.7
[task 2019-01-13T21:33:53.165Z] Also creating executable in /builds/worker/checkouts/gecko/obj-x86_64-pc-linux-gnu/_virtualenvs/init/bin/python
[task 2019-01-13T21:33:54.534Z] Installing setuptools, pip, wheel...done.
[task 2019-01-13T21:33:54.755Z] WARNING: Python.h not found. Install Python development headers.
[task 2019-01-13T21:33:54.755Z] Error processing command. Ignoring because optional. (optional:setup.py:third_party/python/psutil:build_ext:--inplace)
[task 2019-01-13T21:33:54.755Z] Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt)
[task 2019-01-13T21:33:55.482Z] Firefox exited with code 1 during profile initialization
[task 2019-01-13T21:33:55.482Z] Firefox output (/builds/worker/artifacts/profile-run-1.log):
[task 2019-01-13T21:33:55.482Z] error: XDG_RUNTIME_DIR not set in the environment.
[task 2019-01-13T21:33:55.482Z] Unable to init server
[task 2019-01-13T21:33:55.482Z] Error: cannot open display: :2
[task 2019-01-13T21:33:55.482Z]
[task 2019-01-13T21:33:55.514Z] cleanup
[task 2019-01-13T21:33:55.514Z] + cleanup
[task 2019-01-13T21:33:55.514Z] + local rv=1
[task 2019-01-13T21:33:55.514Z] + cleanup_xvfb
[task 2019-01-13T21:33:55.514Z] pidof Xvfb
[task 2019-01-13T21:33:55.514Z] ++ pidof Xvfb
[task 2019-01-13T21:33:55.516Z] + local xvfb_pid=37
[task 2019-01-13T21:33:55.516Z] + local vnc=false
[task 2019-01-13T21:33:55.516Z] + local interactive=false
[task 2019-01-13T21:33:55.516Z] + '[' -n 37 ']'
[task 2019-01-13T21:33:55.516Z] + [[ false == false ]]
[task 2019-01-13T21:33:55.516Z] + [[ false == false ]]
[task 2019-01-13T21:33:55.516Z] + kill 37
[task 2019-01-13T21:33:55.516Z] + screen -XS xvfb quit
[task 2019-01-13T21:33:55.640Z] No screen session found.
[task 2019-01-13T21:33:55.640Z] + true
[task 2019-01-13T21:33:55.640Z] + exit 1
[fetches 2019-01-13T21:33:55.641Z] removing /builds/worker/fetches
[fetches 2019-01-13T21:33:55.641Z] finished
[taskcluster 2019-01-13 21:33:56.023Z] === Task Finished ===
[taskcluster 2019-01-13 21:33:56.161Z] Artifact "public/build/profdata.tar.xz" not found at "/builds/worker/artifacts/profdata.tar.xz"
[taskcluster 2019-01-13 21:33:56.360Z] Artifact "public/build/profile-run-2.log" not found at "/builds/worker/artifacts/profile-run-2.log"
[taskcluster 2019-01-13 21:33:57.064Z] Unsuccessful task run with exit code: 1 completed in 109.299 seconds

Whiteboard: [stockwell needswork]

I just pushed bug 1519424. Hopefully that has an impact here.

Resolving as fixed. Michael pushed debian9 (bug 1519424) on January 14 and he says it appears to have fixed the XDG_RUNTIME_DIR errors:

https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-01-01&endday=2019-02-04&tree=trunk&bug=1516114

Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

These failures are all from the 1-tier Linux PGO builds. We held off on switching to the 3-tier PGO builds because they hit the XDG_RUNTIME_DIR failures much more often, but those have been fixed by switching to debian9. So I think we should take this opportunity to fully switch Linux over to 3-tier. We can start with regular B builds, and then after a short trial period convert N builds as well.

Assignee: nobody → mshal
Duplicate of this bug: 1517939

Now that 3-tier PGO uses a debian9 image to generate the profile data
(bug 1519424), we no longer see the XDG_RUNTIME_DIR failures in the run
task. The frequency of those errors was the primary blocker for enabling
3-tier PGO in the first place. Since we still see those errors
occasionally in 1-tier PGO, we should switch to the 3-tier model for
Linux.

Pushed by mshal@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fdbd2c02f391
Enable 3-tier PGO for Linux; r=firefox-build-system-reviewers,Callek,chmanchester
Status: REOPENED → RESOLVED
Closed: 5 months ago3 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

Seems like the fix from comment 25 reduced some of the build times on Linux:

== Change summary for alert #20479 (as of Mon, 15 Apr 2019 22:10:52 GMT) ==

Improvements:

41% build times linux64-shippable opt nightly taskcluster-m5.4xlarge 4,888.47 -> 2,891.82

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20479

(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #27)

Seems like the fix from comment 25 reduced some of the build times on Linux:

== Change summary for alert #20479 (as of Mon, 15 Apr 2019 22:10:52 GMT) ==

Improvements:

41% build times linux64-shippable opt nightly taskcluster-m5.4xlarge 4,888.47 -> 2,891.82

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20479

Unfortunately they aren't directly comparable since that build was changed from doing a 1-tier PGO, where all parts of the PGO build are done in one task, to the 3-tier model, where the build is split into the "instr", "run", and "B" tasks. Looking at a recent m-c push, the 3 tasks were 39 minutes, 4 minutes, and 51 minutes, so a total of 5640s. Overall it's a little slower, but it allows us to enable PGO on more platforms and opens up the possibility for reproducible PGO builds since the profile data is now an artifact in Taskcluster.

This continues to occur in recent central as beta simulations:

Early beta: https://treeherder.mozilla.org/#/jobs?repo=try&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&revision=95c973c0a332b2f647495ee55e0b20161042100b&selectedJob=242035935&searchStr=linux%2Cx64%2Cdevedition%2Copt%2Cbuild-linux64-devedition-nightly%2Fopt%2C%28n%29

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=242035935&repo=try&lineNumber=35518

[task 2019-04-23T13:42:11.854Z] 13:42:11 INFO - error: XDG_RUNTIME_DIR not set in the environment.
[task 2019-04-23T13:42:11.854Z] 13:42:11 INFO - Unable to init server
[task 2019-04-23T13:42:11.854Z] 13:42:11 INFO - Error: cannot open display: :2
[task 2019-04-23T13:42:11.854Z] 13:42:11 INFO - Makefile:188: recipe for target 'profiledbuild' failed
[task 2019-04-23T13:42:11.855Z] 13:42:11 ERROR - make[1]: *** [profiledbuild] Error 1
[task 2019-04-23T13:42:11.855Z] 13:42:11 INFO - make[1]: Leaving directory '/builds/worker/workspace/build/src/obj-firefox'
[task 2019-04-23T13:42:11.855Z] 13:42:11 INFO - client.mk:125: recipe for target 'build' failed
[task 2019-04-23T13:42:11.855Z] 13:42:11 INFO - make: *** [build] Error 2
[task 2019-04-23T13:42:11.895Z] 13:42:11 INFO - 292 compiler warnings present.
[task 2019-04-23T13:42:12.023Z] 13:42:12 INFO - Notification center failed: Install notify-send (usually part of the libnotify package) to get a notification when the build finishes.
[task 2019-04-23T13:42:12.082Z] 13:42:12 ERROR - Return code: 2
[task 2019-04-23T13:42:12.082Z] 13:42:12 WARNING - setting return code to 2
[task 2019-04-23T13:42:12.082Z] 13:42:12 FATAL - 'mach build -v' did not run successfully. Please check log for errors.
[task 2019-04-23T13:42:12.082Z] 13:42:12 FATAL - Running post_fatal callback...
[task 2019-04-23T13:42:12.082Z] 13:42:12 FATAL - Exiting -1
[task 2019-04-23T13:42:12.083Z] 13:42:12 INFO - [mozharness: 2019-04-23 13:42:12.082974Z] Finished build step (failed)
[task 2019-04-23T13:42:12.083Z] 13:42:12 INFO - Running post-run listener: _parse_build_tests_ccov
[task 2019-04-23T13:42:12.083Z] 13:42:12 INFO - Running post-run listener: _shutdown_sccache
[task 2019-04-23T13:42:12.083Z] 13:42:12 INFO - Running post-run listener: _summarize
[task 2019-04-23T13:42:12.083Z] 13:42:12 ERROR - # TBPL FAILURE #

Flags: needinfo?(mshal)

It looks like those builds are still using the 1-tier PGO. :Callek, should the builds from #c30 be shippable builds now? Or are they still intended to be nightlies? If it's the latter, we'll probably want to convert them to 3-tier PGO.

Flags: needinfo?(mshal) → needinfo?(bugspam.Callek)

The devedition builds are labeled nightlies, but have only ever[1] run on-push on mozilla-beta (and beta-sims). They should be renamed to -shippable from nightly, but doing that is purely cosmetic[2]. That said, independent of what they are named, they should be switched to 3-tier pgo.

[1] at least since the migration to taskcluster and release-promotion
[2] the two main points for shippable where to unify the on-push pgo builds and those we ship, and to use the on-push builds for shipping nightlies; both those have always been true of the devedition builds.

Flags: needinfo?(bugspam.Callek)

I filed bug 1547395 for enabling 3-tier PGO on the devedition builds.

You need to log in before you can comment on or make changes to this bug.