Closed Bug 1500436 Opened 6 years ago Closed 5 years ago

"make[4]: write error: stdout" while building Firefox

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(firefox-esr68 fixed, firefox69 fixed)

RESOLVED FIXED
mozilla69
Tracking Status
firefox-esr68 --- fixed
firefox69 --- fixed

People

(Reporter: julienw, Assigned: glandium, NeedInfo)

References

Details

Attachments

(2 files)

When building Firefox, I get these errors often:

make[4]: write error: stdout
make[3]: *** [security/sandbox/linux/broker/target] Erreur 1

Then restarting the build works.

I build with `-j 60` because I use a icecc cluster. So this could be because of the high parallelism.
I am also experiencing this with scan-build:
https://relman-ci.moz.tools/job/firefox-scan-build/348/
with -j20 (doesn't always happen)
I know developers in Mozilla offices with icecream clusters have used very high -j values to build Firefox in the past, so I'm surprised that would be an issue. From the error it sounds like this is an issue with make itself, so perhaps the version of make in use matters?
FWIW, make displays this error when it tries to close stdout and either detects that an existing error happened, or fails to close the stdout file:
http://git.savannah.gnu.org/cgit/make.git/tree/output.c?h=4.2.1&id=9ab8af7cd8aaac587787d9df7e34758951d7c470#n495
Keywords: in-triage
btw, the make error could be improved :)
Looks like I'm not the only one: https://lists.gnu.org/archive/html/bug-make/2014-07/msg00035.html

I'm using make 4.1 on Debian stable, so it's a bit old.

But I'm sure this issue is fairly recent for me and wasn't happening some weeks ago.
Given that it sounds like this is a bug in upstream make. If there's anything we can change in the build system to work around it we'd certainly be open to that.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Even if this isn't our fault, I still have a ~20/30% failure rate on this in Debian testing...

Reopening, as, even if this isn't our fault, this is a huge pain for developers impacted by this. We have to start the build several time to get a final binary

Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Attached file ff_build_log.txt.bz2

I see the same on openSUSE distribution, particularly on a machine with 160 cores. However one can see it even on a smaller machine.

There's a link to strace of mach command:
https://drive.google.com/file/d/1bt94SWMaYOCnsZLX9PSQewjlNP5dD2fe/view?usp=sharing

and I'm attaching build log as well. I can also confirm the problem is new for me, Firefox 63.* releases were built fine.

And I googled the problem and it looks BSD is affected as well (starting from 12/2018, probably FF 64 release):
https://mail-index.netbsd.org/netbsd-users/2018/12/12/msg021781.html

Feel free to ask for details, I can do debugging on the system I have.

Note that the same happens on aarch64 and armv7l with only
MOZ_MAKE_FLAGS=-j6.

Got it, happens only when one uses '-v' for 'mach build'.

I don't use -v. But maybe it makes the problem more visible by making make output more stuff.

I can confirm that '-v' helps. Though I would like to use it...

Did anybody bisected the commit where it appeared for the first time?

This may be related to https://github.com/nodejs/node/issues/14752 , and may require us wrapping nodejs invocations.

The following patch may have fixed it for me (it's not landable, because it simply hides node.js's stderr, which is not desirable in case errors happen):

--- a/python/mozbuild/mozbuild/action/node.py
+++ b/python/mozbuild/mozbuild/action/node.py
@@ -5,6 +5,7 @@
 from __future__ import absolute_import, unicode_literals, print_function

 import buildconfig
+import os
 import pipes
 import subprocess
 import sys
@@ -52,7 +53,7 @@ def execute_node_cmd(node_cmd_list):
         print('Executing "{}"'.format(printable_cmd), file=sys.stderr)
         sys.stderr.flush()

-        output = subprocess.check_output(node_cmd_list)
+        output = subprocess.check_output(node_cmd_list, stderr=open(os.devnull, 'w'))

         # Process the node script output
         #

I'm not sure, though, because I have never been able to reproduce locally, and I'm not sure whether the environment where I did get this error before (which is, some specific downstream Debian builders) just got lucky this time or if the fix actually worked.

Can any of the people who had the problem can check?

Flags: needinfo?(sledru)
Flags: needinfo?(petr.sumbera)
Flags: needinfo?(mliska)
Flags: needinfo?(felash)

Sorry, I can't try because I can't build with icecc these days because of the toolchain problem I talked about already.

That said it seemed to me that I didn't have the make problem with a recent build of icecc anymore. I think Paul may know more.

Flags: needinfo?(felash) → needinfo?(padenot)

Great! Using node-12.4.0 with that upstream patch (minor conflicts resolved) fixed the problem for me. We were encountering the problem very reliably when using high make-parallelism (12-16+) and now it built without a glitch on 32 real threads.

(In reply to vcunat from comment #19)

Great! Using node-12.4.0 with that upstream patch (minor conflicts resolved) fixed the problem for me. We were encountering the problem very reliably when using high make-parallelism (12-16+) and now it built without a glitch on 32 real threads.

Would you mind trying with an older node and the patch from comment 17?

That also works for me. I tried once with the same set up, except for keeping unpatched node-10.16.0.

Can you try this new version?

--- a/python/mozbuild/mozbuild/action/node.py
+++ b/python/mozbuild/mozbuild/action/node.py
@@ -1,15 +1,16 @@
 # This Source Code Form is subject to the terms of the Mozilla Public
 # License, v. 2.0. If a copy of the MPL was not distributed with this
 # file, You can obtain one at http://mozilla.org/MPL/2.0/.

 from __future__ import absolute_import, unicode_literals, print_function

 import buildconfig
+import os
 import pipes
 import subprocess
 import sys

 SCRIPT_ALLOWLIST = [
         buildconfig.topsrcdir + "/devtools/client/shared/build/build.js"
     ]

@@ -47,17 +48,18 @@ def execute_node_cmd(node_cmd_list):
     printed to stderr instead.
     """

     try:
         printable_cmd = ' '.join(pipes.quote(arg) for arg in node_cmd_list)
         print('Executing "{}"'.format(printable_cmd), file=sys.stderr)
         sys.stderr.flush()

-        output = subprocess.check_output(node_cmd_list)
+        output = subprocess.check_output(
+            node_cmd_list, stderr=os.dup(sys.stderr.fileno()))

         # Process the node script output
         #
         # XXX Starting with an empty list means that node scripts can
         # (intentionally or inadvertently) remove deps.  Do we want this?
         deps = []
         for line in output.splitlines():
             if 'dep:' in line:

This seem to work well on my linux box, but I see it rather often on my new 12 thread macbook.

Flags: needinfo?(padenot)

(In reply to Paul Adenot (:padenot) from comment #23)

This seem to work well on my linux box, but I see it rather often on my new 12 thread macbook.

With either patch?

Flags: needinfo?(padenot)

Without any patch, sorry. I'll try that next. It used to work a few week back.

Flags: needinfo?(padenot)

I'll try to recall testing this on Monday, but dup should not be enough, from what I remember off the top of my head. (The non-blocking flag is set on file description, and dup/fork/... will get you another descriptor for the same/shared description.)

I confirmed experimentally that the dup patch does not solve it.

See Also: → 1561465

This works around https://github.com/nodejs/node/issues/14752, which
causes problems with make.

Assignee: nobody → mh+mozilla
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/562655fedb9d
Redirect node.js's stderr to a pipe. r=froydnj
Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

Seems that it is improving the situation, I haven't seen it recently.
Many thanks!

Flags: needinfo?(sledru)

Can this please be uplifted to the esr branch? The patch applies cleanly and fixes the error.

Flags: needinfo?(mh+mozilla)

Comment on attachment 9074104 [details]
Bug 1500436 - Redirect node.js's stderr to a pipe.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Build failures under some conditions, due to node twiddling with its output file descriptors.
  • User impact if declined: See above.
  • Fix Landed on Version: 69
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Build system change that has landed a while ago.
  • String or UUID changes made by this patch: N/A
Flags: needinfo?(mh+mozilla)
Attachment #9074104 - Flags: approval-mozilla-esr68?

Comment on attachment 9074104 [details]
Bug 1500436 - Redirect node.js's stderr to a pipe.

Low-risk build system fix to avoid random bustage. Approved for 68.3esr.

Attachment #9074104 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+

Adding here something I just commented on https://github.com/mozilla/gecko/commit/f8e128f0fe15e6e40a6ac8ac07698649777b8fca#r36592216 -- I think the comment in the fix commit is misleading:

We could theoretically remove this later, once we require a fixed node?

I suspect this is not true, because the "fix" relies on the node process exiting successfully; it appears to not work when the node process is killed or crashes (see my comment on https://github.com/nodejs/node/issues/14752#issuecomment-569285807), and I suspect tests must continue to work even if that happens.

Thus I suspect that the added comment, which suggests that the workaround can be removed when the issue is closed, is misleading.

Flags: needinfo?(mliska)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: