Closed Bug 1841067 Opened 2 years ago Closed 1 year ago

Stall during early `./mach build`: toolkit/library/build/libxul.so sometimes takes exactly 10 minutes to complete

Categories

(Firefox Build System :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1499382

People

(Reporter: manuel, Unassigned)

References

(Blocks 1 open bug)

Details

This is the ~third time it happened to me. First time was probably last week. After pressing ctrl+c and entering ./mach build again the build just works.

System: Arch Linux

 0:28.61 toolkit/library/build/libxul.so
 0:33.13 ./dependentlibs.list.stub
 0:34.82 Packaging specialpowers@mozilla.org.xpi...
 0:34.89 Packaging mozscreenshots@mozilla.org.xpi...
 0:52.72 W 234 compiler warnings present.
Process Process-1:xport compile misc libs tools
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/dev/gecko5/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 137, in _collect
    while not _poll(pipe, poll_interval=sleep_interval):
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dev/gecko5/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 101, in _poll
    return pipe.poll(poll_interval)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 256, in poll
    return self._poll(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 423, in _poll
    r = wait([self], timeout)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 930, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

The severity field is not set for this bug.
:ahochheiden, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(ahochheiden)

I wasn't able to reproduce, but the comments where you're hanging seem to leave a clue: https://searchfox.org/mozilla-central/source/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py#91-108

    """Wrap multiprocessing.Pipe.poll to hide POLLERR and POLLIN
    exceptions.

    multiprocessing.Pipe is not actually a pipe on at least Linux.
    That has an effect on the expected outcome of reading from it when
    the other end of the pipe dies, leading to possibly hanging on revc()
    below.

So I'm guessing something is killing resource monitor process.

Severity: -- → S4
Flags: needinfo?(ahochheiden)
Priority: -- → P5

Some additional infos: It always stops after mozscreenshots@mozilla.org.xpi

 0:28.61 toolkit/library/build/libxul.so
 0:33.13 ./dependentlibs.list.stub
 0:34.82 Packaging specialpowers@mozilla.org.xpi...
 0:34.89 Packaging mozscreenshots@mozilla.org.xpi...

ctrl-c

 0:52.72 W 234 compiler warnings present.
Process Process-1:xport compile misc libs tools
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/dev/gecko5/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 137, in _collect
    while not _poll(pipe, poll_interval=sleep_interval):
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/dev/gecko5/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 101, in _poll
    return pipe.poll(poll_interval)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 256, in poll
    return self._poll(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 423, in _poll
    r = wait([self], timeout)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 930, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt

It seems like it eventually completes.

 0:54.58 ./dependentlibs.list.stub
 0:56.45 Packaging specialpowers@mozilla.org.xpi...
 0:56.53 Packaging mozscreenshots@mozilla.org.xpi...
10:53.61 W 232 compiler warnings present.
10:53.76 W Overall system resources - Wall time: 653s; CPU: 5%; Read bytes: 108104774656; Write bytes: 12304355328; Read time: 1102410; Write time: 565229
10:53.76 W Swap in/out (MB): 341.2578125/1153.75390625
10:53.90 /usr/bin/notify-send '--app-name=Mozilla Build System' 'Mozilla Build System' 'Build complete'
To view resource usage of the build, run |mach resource-usage|.
10:53.93 Your build was successful!
To take your build for a test drive, run: |mach run|
For more information on what to do now, see https://firefox-source-docs.mozilla.org/setup/contributing_code.html

Here it took 10 min Packaging mozscreenshots@mozilla.org.xpi..., where I apparently sometimes just cancel early. Here I simply forgot that I still had a build running.

I've been encountering this issue fairly often on my machine. I've got used to Ctrl-C and restart the build.

Full log:

0:00.52 W Clobber not needed.
 0:00.83 W Adding make options from /home/user/dev/gecko4/.mozconfig
    AUTOCLOBBER=1
    MOZ_OBJDIR=/home/user/dev/gecko4/obj-debug-sccache
    OBJDIR=/home/user/dev/gecko4/obj-debug-sccache
    FOUND_MOZCONFIG=/home/user/dev/gecko4/.mozconfig
    export FOUND_MOZCONFIG
  Parallelism determined by memory: using 28 jobs for 28 cores based on 62.5 GiB RAM and estimated job size of 1.0 GiB
 0:00.83 /usr/bin/make -f client.mk -j28 -s
 0:01.14 Elapsed: 0.00s; From dist/public: Kept 0 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:01.15 Elapsed: 0.01s; From dist/xpi-stage: Kept 91 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:01.17 Elapsed: 0.00s; From dist/private: Kept 0 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:01.39 Elapsed: 0.25s; From _tests: Kept 1424 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:01.48 Elapsed: 0.33s; From dist/include: Kept 6959 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:01.57 Elapsed: 0.43s; From dist/bin: Kept 2916 existing; Added/updated 0; Removed 0 files and 0 directories.
 0:02.40 toolkit/components/antitracking/bouncetrackingprotection
 0:02.43 toolkit/library/buildid.cpp.stub
 0:13.54 toolkit/library/build/libxul.so
 0:19.37 ./dependentlibs.list.stub
 0:21.80 Packaging specialpowers@mozilla.org.xpi...
 0:21.99 Packaging mozscreenshots@mozilla.org.xpi...
10:13.59 W 184 compiler warnings present.
10:14.12 W Overall system resources - Wall time: 614s; CPU: 6%; Read bytes: 12693970944; Write bytes: 9846194176; Read time: 164485; Write time: 787700
10:14.12 /usr/bin/notify-send '--app-name=Mozilla Build System' 'Mozilla Build System' 'Build complete'
To view a profile of the build, run |mach resource-usage|.
10:14.21 Your build was successful!
To take your build for a test drive, run: |mach run|
For more information on what to do now, see https://firefox-source-docs.mozilla.org/setup/contributing_code.html

If we assume a 10min timeout then it is 0:13.54 toolkit/library/build/libxul.so taking 10min.

@Alex, I'm also still encountering this fairly often, but don't know how to force this. Do you have any hints on how to debug this?

Flags: needinfo?(ahochheiden)
Summary: Stall during early `./mach build` in multiprocessing/process.py", poll → Stall during early `./mach build`: toolkit/library/build/libxul.so sometimes takes exactly 10 minutes to complete

I think it ends up hanging in mozbuild/action/zip.py

From:
https://searchfox.org/mozilla-central/source/config/rules.mk#908
https://searchfox.org/mozilla-central/source/config/makefiles/functions.mk#36

So debugging Jarrer with breakpoints is where I'd start if I could reproduce it.

Flags: needinfo?(ahochheiden) → needinfo?(manuel)

I doubt the jarrer is involved here. Those log messages are a red herring, they don't tell when things finish. You'd be better off looking at what processes are still running when it's stuck. And chances are it's mach only, which would suggest yet another bug with the resource manager.

When this happens this is the process tree I get:

bash
  `-python3 ./mach build
      |-gmake -f client.mk -j16 -s
      |-(python)
      |-python3 ./mach build
      |-{python3}
      |-{python3}
      |-{python3}
       `-{python3}

And interrupting the command with CTRL-C yields this stack trace:

Process Process-1:xport compile misc libs tools
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/gsvelto/projects/mozilla-central/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 137, in _collect
    while not _poll(pipe, poll_interval=sleep_interval):
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/gsvelto/projects/mozilla-central/testing/mozbase/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py", line 104, in _poll
    return pipe.poll(poll_interval)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 440, in _poll
    r = wait([self], timeout)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/multiprocessing/connection.py", line 948, in wait
    ready = selector.select(timeout)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Duplicate of this bug: 1898290
Severity: S4 → S3
Priority: P5 → P3

I see this a great deal, but it's not consistent. I'm confident it's a bug of some sort in the resource monitor, which really means in Python's multiprocessing library.

(In reply to Nick Alexander :nalexander [he/him] from comment #12)

I see this a great deal, but it's not consistent. I'm confident it's a bug of some sort in the resource monitor, which really means in Python's multiprocessing library.

Or somewhere in our resource monitor code, that wouldn't be the first time.

Process tree with ps -ef --forest

/bin/bash /home/user/.local/bin/mach build
 \_ python3 ./mach build
     \_ [python] <defunct>
     \_ python3 ./mach build
     \_ /usr/bin/make -f client.mk -j28 -s

I don't know how to set breakpoints or run mach with a debugger to investigate this further. Is there an easy way to do that?

Flags: needinfo?(manuel)
Duplicate of this bug: 1915997

I've started seeing this fairly often in the last few months.

Blocks: mach-busted

This sounds similar to Bug 1499382, which I have been experiencing for quite some time. A lot more people seem to be experiencing this now though, so this may be a separate bug with similar symptoms.

Still, to the people experiencing this just now: Are you using sccache? And if so, does running sccache --stop-server in a separate terminal while the hang is happening fix the hang? If that is the case, I think these two bugs are indeed the same and we can dupe one of them.

See Also: → 1499382

(In reply to Malte Jürgens [:maltejur] from comment #17)

Still, to the people experiencing this just now: Are you using sccache? And if so, does running sccache --stop-server in a separate terminal while the hang is happening fix the hang?

Running sccache --stop-server seems to fix the hang for me.

Running sccache --stop-server seems to fix the hang for me.

Same for me

$ ps -ef | rg sccache
user       56565    3584  0 11:37 ?        00:00:00 /home/user/.mozbuild/sccache/sccache
$ /home/user/.mozbuild/sccache/sccache --stop-server
Stopping sccache server...
Compile requests                      2
Compile requests executed             1
Cache hits                            0
Cache misses                          1
Cache misses (C/C++)                  1
Cache timeouts                        0
Cache read errors                     0
Forced recaches                       0
Cache write errors                    0
Compilation failures                  0
Cache errors                          0
Non-cacheable compilations            0
Non-cacheable calls                   0
Non-compilation calls                 1
Unsupported compiler calls            0
Average cache write               0.000 s
Average compiler                  5.864 s
Average cache read hit            0.000 s
Failed distributed compilations       0
Cache location                  Local disk: "/home/user/.cache/sccache"
Use direct/preprocessor mode?   yes
Version (client)                0.7.7
Cache size                           10 GiB
Max cache size                       10 GiB

Build completes after calling stop-server.

Status: NEW → RESOLVED
Closed: 1 year ago
Duplicate of bug: 1499382
Resolution: --- → DUPLICATE
See Also: 1499382
No longer duplicate of this bug: 1898290
No longer duplicate of this bug: 1915997
You need to log in before you can comment on or make changes to this bug.