Closed Bug 1499382 Opened 7 years ago Closed 1 year ago

mach build hangs near the end of the build in make

Categories

(Firefox Build System :: General, defect, P2)

Unspecified
macOS
defect

Tracking

(firefox136 fixed)

RESOLVED FIXED
136 Branch
Tracking Status
firefox136 --- fixed

People

(Reporter: Alex_Gaynor, Assigned: glandium)

References

(Blocks 1 open bug)

Details

I'm experiencing this hang on a regular, but not consistent!, basis. It seems to happen when doing a mostly-full build (e.g. when I pull and build first thing in the morning), but if I kill the hung process and re-run mach build it'll complete fine. Anyways, mach build hangs at the very end of the build: 52:15.50 adding: install.rdf (deflated 53%) 52:15.52 adding: plugins/Test.plugin/Contents/MacOS/libnptest.dylib (deflated 69%) 52:15.52 adding: plugins/Test.plugin/Contents/Info.plist (deflated 54%) 52:15.54 adding: plugins/SecondTest.plugin/Contents/MacOS/libnpsecondtest.dylib (deflated 69%) 52:15.54 adding: plugins/SecondTest.plugin/Contents/Info.plist (deflated 59%) 52:15.54 adding: plugins/ThirdTest.plugin/Contents/MacOS/libnpthirdtest.dylib (deflated 69%) 52:15.54 adding: plugins/ThirdTest.plugin/Contents/Info.plist (deflated 59%) 52:15.56 adding: plugins/npswftest.plugin/Contents/MacOS/libnpswftest.dylib (deflated 69%) 52:15.56 adding: plugins/npswftest.plugin/Contents/Info.plist (deflated 59%) 52:30.96 Packaging specialpowers@mozilla.org.xpi... 52:31.27 Packaging quitter@mozilla.org.xpi... 52:40.17 Packaging mozscreenshots@mozilla.org.xpi... TIER: configure pre-export export compile misc libs tools The process hierarchy looks like: \-+= 74368 agaynor -fish \-+= 42630 agaynor /usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python ./mac h build |--- 42665 agaynor /usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python ./m ach build \-+- 42670 agaynor /Applications/Xcode.app/Contents/Developer/usr/bin/make -f client.mk MOZ_PARALLEL_BUILD=9 -s \--- 46246 agaynor /Applications/Xcode.app/Contents/Developer/usr/bin/make -j9 -C /Users/agaynor/projects/mozilla-central/obj-x86_64-apple-d arwin18.0.0 So I infer that it's the bottom make process that's hanging. Grabbing that make process's stack, it's hung in a read on... something: 2864 Thread_5287729 DispatchQueue_1: com.apple.main-thread (serial) 2864 start (in libdyld.dylib) + 1 [0x7fff76023085] 2864 ??? (in make) load address 0x107fff000 + 0x10644 [0x10800f644] 2864 ??? (in make) load address 0x107fff000 + 0x11359 [0x108010359] 2864 ??? (in make) load address 0x107fff000 + 0x11ca0 [0x108010ca0] 2864 read (in libsystem_kernel.dylib) + 10 [0x7fff7615d8a6] I haven't figured out how to get the arguments to a current syscall on macOS (which would be handy, to know the file descriptor we're reading from!) but the open files are: ~ ❯❯❯ lsof -p 46246 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME make 46246 agaynor cwd DIR 1,4 2880 8632238682 /Users/agaynor/projects/mozilla-central/obj-x86_64-apple-darwin18.0.0 make 46246 agaynor txt REG 1,4 165168 8630131202 /Applications/Xcode.app/Contents/Developer/usr/bin/make make 46246 agaynor txt REG 1,4 1100896 8632029578 /usr/lib/dyld make 46246 agaynor 0u CHR 16,0 0t13513298 1303 /dev/ttys000 make 46246 agaynor 1 PIPE 0x77fa703ea651a0a7 16384 ->0x77fa703ea6517fa7 make 46246 agaynor 2 PIPE 0x77fa703ea651a0a7 16384 ->0x77fa703ea6517fa7 make 46246 agaynor 3w REG 1,4 394003 8632238684 /Users/agaynor/projects/mozilla-central/obj-x86_64-apple-darwin18.0.0/.mozbuild/last_log.json make 46246 agaynor 4 PIPE 0x77fa703e88c59867 16384 ->0x77fa703e88c588a7 make 46246 agaynor 5r DIR 1,4 832 1100292 /Users/agaynor/projects/mozilla-central/browser make 46246 agaynor 6u unix 0x77fa703eadba2ebf 0t0 ->0x77fa703eadba40b7 make 46246 agaynor 8r DIR 1,4 128 1098212 /Users/agaynor/projects/mozilla-central/.cargo make 46246 agaynor 9 PIPE 0x77fa703e88c57b27 16384 make 46246 agaynor 10r DIR 1,4 160 1098320 /Users/agaynor/projects/mozilla-central/accessible/aom make 46246 agaynor 11r DIR 1,4 2240 1098224 /Users/agaynor/projects/mozilla-central/accessible/base make 46246 agaynor 12r DIR 1,4 1056 1098233 /Users/agaynor/projects/mozilla-central/accessible/generic make 46246 agaynor 13r DIR 1,4 608 1098240 /Users/agaynor/projects/mozilla-central/accessible/html make 46246 agaynor 14r DIR 1,4 1088 1098245 /Users/agaynor/projects/mozilla-central/accessible/interfaces make 46246 agaynor 15r DIR 1,4 416 1098260 /Users/agaynor/projects/mozilla-central/accessible/ipc So it's presumably one of these pipes that we're hanging on a read from. I'm not familiar enough with make's internals to know what it uses the pipes for.
Can you see the environment variables of the hung process? If so there's likely to be a `MAKEFLAGS` env var with some content like `--jobserver-fds=...`. If it's hanging on one of the file descriptors listed there then the problem is with the job server somehow.
Flags: needinfo?(agaynor)
ps eww shows MAKEFLAGS=sr --jobserver-fds=4,7, so I guess it's one of those pipes. I haven't been able to confirm which pipe it's actually hanging on though.
Flags: needinfo?(agaynor)
Huh, I take that back, these make invocations don't appear to have the jobserver: ~ ❯❯❯ pstree -p 74368 -+= 00001 root /sbin/launchd \-+= 74366 agaynor /Applications/iTerm.app/Contents/MacOS/iTerm2 --server /usr/bin/login -fpl agaynor /Applications/iTerm.app/Contents/MacOS/iTerm2 - \-+= 74367 root /usr/bin/login -fpl agaynor /Applications/iTerm.app/Contents/MacOS/iTerm2 --launch_shell \-+= 74368 agaynor -fish \-+= 91972 agaynor /usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python ./ma |--- 92014 agaynor /usr/local/Cellar/python@2/2.7.15_1/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python ./ \-+- 92040 agaynor /Applications/Xcode.app/Contents/Developer/usr/bin/make -f client.mk MOZ_PARALLEL_BUILD=9 -s \--- 92080 agaynor /Applications/Xcode.app/Contents/Developer/usr/bin/make -j9 -C /Users/agaynor/projects/mozilla-central/obj-x86_64-apple- ~ ❯❯❯ ps ewww -p 92080 PID TT STAT TIME COMMAND 92080 s000 S+ 0:00.46 /Applications/Xcode.app/Contents/Developer/usr/bin/make -j9 -C /Users/agaynor/projects/mozilla-central/obj-x86_64-apple-darwin18.0.0 CONFIG_GUESS=x86_64-apple-darwin18.0.0 TOPSRCDIR=/Users/agaynor/projects/mozilla-central SHELL=/usr/local/bin/fish MACH_STDOUT_ISATTY=1 MOZ_SANDBOX_LOGGING=1 MOZ_PARALLEL_BUILD=9 Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.gDfn5clqt2/Render PATH=/Users/agaynor/bin:/Users/agaynor/.cargo/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin SCCACHE_CACHE_SIZE=16G SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.VUSurkFrhF/Listeners PWD=/Users/agaynor/projects/mozilla-central FOUND_MOZCONFIG=/Users/agaynor/projects/mozilla-central/mozconfig HOME=/Users/agaynor MACH=1 LOGNAME=agaynor XPC_FLAGS=0x0 COLORTERM=truecolor SHLVL=1 MAKEOVERRIDES=${-*-command-variables-*-} SCCACHE_IDLE_TIMEOUT=0 USER=agaynor TERM_SESSION_ID=w0t0p0:CF965EBA-D4B5-4B47-ACAC-853FE086671A TERM_PROGRAM=iTerm.app TMPDIR=/var/folders/21/3kp8v0nx0ds4xpp12th8vydh0000gn/T/ MAKEFLAGS=s -- MOZ_PARALLEL_BUILD=9 MFLAGS=-s TERM_PROGRAM_VERSION=3.2.2 XPC_SERVICE_NAME=0 ITERM_SESSION_ID=w0t0p0:CF965EBA-D4B5-4B47-ACAC-853FE086671A COLORFGBG=7;0 __CF_USER_TEXT_ENCODING=0x1F5:0:0 ITERM_PROFILE=Default OBJDIR=/Users/agaynor/projects/mozilla-central/obj-x86_64-apple-darwin18.0.0 EDITOR=/Applications/Sublime\ Text.app/Contents/SharedSupport/bin/subl -w LANG=en_US.UTF-8 TERM=xterm-256color MAKELEVEL=1 ~ ❯❯❯ ps ewww -p 92040 PID TT STAT TIME COMMAND 92040 s000 S+ 0:00.10 /Applications/Xcode.app/Contents/Developer/usr/bin/make -f client.mk MOZ_PARALLEL_BUILD=9 -s TOPSRCDIR=/Users/agaynor/projects/mozilla-central TERM_PROGRAM_VERSION=3.2.2 LOGNAME=agaynor USER=agaynor PATH=/Users/agaynor/bin:/Users/agaynor/.cargo/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin HOME=/Users/agaynor SCCACHE_CACHE_SIZE=16G TERM_PROGRAM=iTerm.app LANG=en_US.UTF-8 TERM=xterm-256color Apple_PubSub_Socket_Render=/private/tmp/com.apple.launchd.gDfn5clqt2/Render COLORFGBG=7;0 SHLVL=1 XPC_FLAGS=0x0 ITERM_SESSION_ID=w0t0p0:CF965EBA-D4B5-4B47-ACAC-853FE086671A EDITOR=/Applications/Sublime\ Text.app/Contents/SharedSupport/bin/subl -w OBJDIR=/Users/agaynor/projects/mozilla-central/obj-x86_64-apple-darwin18.0.0 TERM_SESSION_ID=w0t0p0:CF965EBA-D4B5-4B47-ACAC-853FE086671A SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.VUSurkFrhF/Listeners XPC_SERVICE_NAME=0 SHELL=/usr/local/bin/fish ITERM_PROFILE=Default CONFIG_GUESS=x86_64-apple-darwin18.0.0 MACH_STDOUT_ISATTY=1 TMPDIR=/var/folders/21/3kp8v0nx0ds4xpp12th8vydh0000gn/T/ MACH=1 MOZ_SANDBOX_LOGGING=1 __CF_USER_TEXT_ENCODING=0x1F5:0:0 PWD=/Users/agaynor/projects/mozilla-central SCCACHE_IDLE_TIMEOUT=0 COLORTERM=truecolor

./mach build hangs on my machine, Ubuntu 20.04, too, at the end of the build, until cancelled via Ctrl+c:

0:09.58     Finished dev [unoptimized + debuginfo] target(s) in 6.45s
 0:09.76 toolkit/library/build/libxul.so
 0:24.86 ./dependentlibs.list.stub
 0:30.10 Packaging specialpowers@mozilla.org.xpi...
 0:30.19 Packaging quitter@mozilla.org.xpi...
 0:30.28 Packaging mozscreenshots@mozilla.org.xpi...
^CProcess Process-1:ort compile misc libs tools

The next build succeeds.

Severity: normal → S3

I am encountering the same issue. I think it might be related to scccache, running sccache --stop-server while it is hanging makes it continue as usual.

The hang seems to occur when the sccache server is not running before the build. So running sccache --stop-server ; ./mach build makes it occur every time for me.

See Also: → 1841067
Duplicate of this bug: 1841067
See Also: 1841067

Could somebody please work on improving this? I'm hitting it multiple times a day, and it wastes a minute or so of my time whenever it happens. From the many dupes, I can't be the only one. Thanks.

Flags: needinfo?(ahochheiden)
Duplicate of this bug: 1898290
Duplicate of this bug: 1915997

(In reply to Andrew McCreight [:mccr8] from comment #8)

Could somebody please work on improving this? I'm hitting it multiple times a day, and it wastes a minute or so of my time whenever it happens. From the many dupes, I can't be the only one. Thanks.

Same here. If there is anything I can do to help track this down please let me know. It's probably happening on around a third of my builds at this point.

Glandium said he will dig into this.

Flags: needinfo?(ahochheiden) → needinfo?(mh+mozilla)

Might be a red herring, here is the stack I got when I ctrl-c:

Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/julien/travail/git/mozilla-central/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 137, in _collect
    while not _poll(pipe, poll_interval=sleep_interval):
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/julien/travail/git/mozilla-central/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 104, in _poll
    return pipe.poll(poll_interval)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 256, in poll
    return self._poll(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 423, in _poll
    r = wait([self], timeout)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 930, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

I'm running into this multiple times a day. At this point it's really slowing me down. Is there a known workaround? Would you consider bumping this in severity / priority?

I'll bump the priority and see if Serge can take a look since Glandium hasn't been able to.

I will be on leave for ~2 months starting next week, otherwise I would do it.

Severity: S3 → S2
Flags: needinfo?(sguelton)
Priority: -- → P2

So... this only happens on Mac. Not Linux. Not Windows.

A workaround: setting SCCACHE_IDLE_TIMEOUT=5 in the environment, or running sccache --start-server before the build.

But I'm also not sure this has anything to do with the original bug anymore.

Flags: needinfo?(sguelton)
Flags: needinfo?(mh+mozilla)
OS: All → macOS

comment 13 also seems to be a different issue.

Assignee: nobody → mh+mozilla

This is caused by https://raw.githubusercontent.com/apple-oss-distributions/gnumake/refs/tags/gnumake-135/patches/PR-5071266.patch but it's also possible to end up in a similar situation in the following situations, on both Linux and Mac:

  • with a verbatim make 4.4.1
  • with a verbatim make 3.81 when adding a + before the sccache command

The reason this is happening is that sccache, invoked from make, inherits its jobserver file descriptors. It's actually not supposed to, because the commands in the Makefile don't have a +. But on mac that happens because of that patch. It also happens with newer versions of make because they use a different method of giving access to the jobserver (using a named fifo), and that's always available, whether + is on the command or not.

(so comment 13 might, in fact, be the same bug ; comment 4 looks like it might be a different issue, but comment 0 might be the same)

In the case of newer make versions, not initializing the jobserver from the environment would be a workaround. sccache would have its own jobserver, and while it would not respect the limits set by the original make jobserver, it would still serve its purpose. The theoretically ideal solution would be for the jobserver pipes to be passed to the sccache server by the sccache client, but that would require Unix sockets, and sccache currently uses TCP sockets.

Unfortunately, in the case of older make versions, the simple fact of daemonizing triggers the problem. That is, the fact that the pipes are still around in the daemonized process blocks make until that process closes.

This also happens on Linux. My reported bug 1841067 was for Linux. (Shows same stack trace comment 13 IIRC). If it is a different issue it might be good to reopen.

Can y'all test these builds of sccache:

Downloading and unpacking those might not be totally convenient, so what you can do is the following:

cd ~/.mozbuild
/path/to/source_dir/mach artifact toolchain --from-task <TASK>:public/build/sccache.tar.zst

Where <TASK> is:

  • XgOwaOFTR1COSG_RQ3pCEw for Linux
  • Pk8K_ZPqSB69NxjPdDu0hw for Intel mac
  • a4a2GB1eSjO58C1wDhyflw for Apple silicon mac

(beware that updating your tree may make the build pull a new sccache, so please try to avoid pulling when testing)

A reliable way I found to reproduce:

./mach configure
find obj* -name \*.o -delete
~/.mozbuild/sccache/sccache --stop-server
./mach build

(In reply to Mike Hommey [:glandium] from comment #20)

Can y'all test these builds of sccache:

This fixes the issue for me!

we upgraded sccache with a fix here:
https://bugzilla.mozilla.org/show_bug.cgi?id=1940229

Mike, can we close this bug? thanks

Flags: needinfo?(mh+mozilla)
Status: NEW → RESOLVED
Closed: 1 year ago
Depends on: 1940229
Flags: needinfo?(mh+mozilla)
Resolution: --- → FIXED
Target Milestone: --- → 136 Branch
See Also: → 1948051
You need to log in before you can comment on or make changes to this bug.