Closed Bug 1720342 Opened 3 years ago Closed 3 years ago

Perma Beta codeql - A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh]

Categories

(Firefox Build System :: Task Configuration, defect, P5)

defect

Tracking

(firefox90 wontfix, firefox91 fixed, firefox92 fixed)

RESOLVED FIXED
92 Branch
Tracking Status
firefox90 --- wontfix
firefox91 --- fixed
firefox92 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: tjr)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 1 obsolete file)

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=345097612&repo=mozilla-beta
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/dqOFy4gJQc-XxxQzegRyaw/runs/0/artifacts/public/logs/live_backing.log


[task 2021-07-13T15:56:56.186Z] [2021-07-13 15:56:56] [build-err] 15:56:56     INFO - Running post-run listener: _shutdown_sccache
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56     INFO - Running post-run listener: _summarize
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56    ERROR - # TBPL FAILURE #
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56     INFO - [mozharness: 2021-07-13 15:56:56.187133Z] FxDesktopBuild summary:
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56    ERROR - # TBPL FAILURE #
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] cleanup
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] + cleanup
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] + local rv=255
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] + cleanup_xvfb
[task 2021-07-13T15:56:56.264Z] [2021-07-13 15:56:56] [build-err] ++ pidof Xvfb
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + local xvfb_pid=107
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + local vnc=false
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + local interactive=false
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + '[' -n 107 ']'
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + [[ false == false ]]
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + [[ false == false ]]
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + kill 107
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + screen -XS xvfb quit
[task 2021-07-13T15:56:56.509Z] [2021-07-13 15:56:56] [build-err] + true
[task 2021-07-13T15:56:56.509Z] [2021-07-13 15:56:56] [build] No screen session found.
[task 2021-07-13T15:56:56.509Z] [2021-07-13 15:56:56] [build-err] + exit 255
[task 2021-07-13T15:56:56.513Z] [2021-07-13 15:56:56] [ERROR] Spawned process exited abnormally (code 255; tried to run: [/builds/worker/fetches/codeql/tools/linux64/preload_tracer, /builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh])
[task 2021-07-13T15:56:56.514Z] A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh]
[taskcluster 2021-07-13 15:56:56.978Z] === Task Finished ===
[taskcluster 2021-07-13 15:56:57.239Z] Artifact "public/codeql/codeql-db-cpp.tar.xz" not found at "/builds/worker/codeql-db-cpp.tar.xz"
[taskcluster 2021-07-13 15:57:03.161Z] Unsuccessful task run with exit code: 2 completed in 1234.951 seconds
Flags: needinfo?(tom)
Flags: needinfo?(freddy)
Summary: Intermittent Beta - A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh] → Perma Beta - A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh]

Looking at a working and non-working job it seems like logalloc-replay might have started running when it wasn't before. Will need to investigate more.

Relevant error:

[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> sed -n '/jemalloc_stats/,$p' test_output.log | /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python /builds/worker/checkouts/gecko/memory/replace/logalloc/replay/logalloc_munge.py | diff -w - expected_output.log
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> sed: can't read test_output.log: No such file or directory
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> 0a1,17
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 jemalloc_stats()
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 malloc(42)=#1
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 malloc(24)=#2
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 free(#1)
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 posix_memalign(4096,1024)=#1
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 calloc(4,42)=#3
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 free(#2)
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 realloc(#3,84)=#2
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 aligned_alloc(256,1024)=#3
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 memalign(512,1024)=#4
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 valloc(1024)=#5
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 jemalloc_stats()
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 free(#5)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 free(#4)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 free(#3)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 free(#2)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37     INFO -  check> > 1 1 free(#1)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37    ERROR -  check> make[4]: *** [Makefile:51: check] Error 1

The only relevant recent change wrt skipping the check is bug 1715553, and that made it skipped in more cases, not less.

Flags: needinfo?(freddy)

<s>logalloc is now being built because MOZ_REPLACE_MALLOC=1; whereas before it must not have been. It is not set in release presently.</s>

I take that back. logalloc is built on early beta but not late beta. It worked in early beta last release and broke this release. It's definetly something related to logalloc but not the fact that it is/isn't being built

Flags: needinfo?(tom)
Summary: Perma Beta - A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh] → Perma Beta codeql - A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh]

More debugging:

  • the logalloc makefile succeeds in an interactive task normally
  • however it fails when run under the codeql wrapper
  • I confirmed the codeql wrapper hasn't changed
  • Searching for relevant strings in the codeql binaries/folder like 'FUZZING', 'MALLOC_LOG', and 'test_output' do not produce hits. (I was wondering if something codeql was doing was clobbering things.)
  • By stracing the codeql wrapper with strace -f -v -s 65535 I confirmed when we execute logalloc-replay the environment var is populated: 26139 execve("./../../../../dist/bin/logalloc-replay", ["./../../../../dist/bin/logalloc-replay"], [... "MALLOC_LOG=test_output.log", ...] <unfinished ...>
  • But the writes are going to stderr: 26139 write(2, "allocated: 81888\n", 24 <unfinished ...>
  • No open() call for test_output.log is found.
  • Minimizing he execution to just strace -f -v -s 65535 -o strace_output.txt ~/fetches/codeql/codeql database create --language=cpp ~/codeql-2 --command='sh -c "MALLOC_LOG=test_output.log ./../../../../dist/bin/logalloc-replay < /builds/worker/checkouts/gecko/memory/replace/logalloc/replay/replay.log"' still produces the failure.
  • I tried doing 27144 execve("./../../../../dist/bin/logalloc-replay", ["./../../../../dist/bin/logalloc-replay"], [... "MALLOC_LOG=5", because 5 is a fd that has been used by codeql before and I wanted to see what would happen. but this produced write(5, "27144 140106744964928 malloc(42)=0x7f6d24a02070\n", 48) = -1 EBADF (Bad file descriptor) and again, no open() call for fd 5.
  • I started looking at the function that does MALLOC_LOG process, replace_init. Added debugging statements to it.
  • Confirmed that outside codeql, the debugging statements and program worked as expected.
  • Confirmed that inside codeql the debugging statements were output as expected
  • Figured out that when running it under codeql that after open() the fd was -1
  • Figured out that the open() call is never happening?!?!?! 27915 write(2, "XXX Opening it\n", 15 <unfinished ...> 27915 <... write resumed> ) = 15 27915 write(2, "XXX Done Opening it, fd: -1\n", 28 <unfinished ...> 27915 <... write resumed> ) = 28
  • On an absolute whim I decided to replace the call to open() with a call to openat(AT_FDCWD, ... because I had seen those in strace.

It worked.

glandium, do you have any idea what could be happening here, or what could have changed to cause this? At this point my best guess is a Docker image change that resulted in some new package doing... something different...?

Flags: needinfo?(mh+mozilla)

When called under the codeql wrapper, this open() call was not happening.
The openat() call does however happen. Why? I have no idea.

Assignee: nobody → tom
Status: NEW → ASSIGNED
Blocks: 1720928

(In reply to Tom Ritter [:tjr] (ni? for response to CVE/sec-approval/advisories/etc) from comment #5)

  • On an absolute whim I decided to replace the call to open() with a call to openat(AT_FDCWD, ... because I had seen those in strace.

It worked.

glibc's implementation of open prefers using openat internally; e.g., here's a program:

#include <unistd.h>
#include <fcntl.h>

int main(int argc, char **argv) {
	open(argv[1], O_RDONLY);
	return 0;
}

And here's strace:

openat(AT_FDCWD, "testfile", O_RDONLY) = 3

I don't know offhand what glibc version started doing that. But if the open isn't working at all: is there something interposing that function at the ELF linkage level, maybe? (i.e., not at the syscall level like seccomp-bpf)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #8)

I don't know offhand what glibc version started doing that. But if the open isn't working at all: is there something interposing that function at the ELF linkage level, maybe? (i.e., not at the syscall level like seccomp-bpf)

There's LD_PRELOAD function hooking going on; yes. I think that's the most likely candidate but again, I don't know what changed on our end to start breaking this /shrug

(In reply to Tom Ritter [:tjr] (ni? for response to CVE/sec-approval/advisories/etc) from comment #9)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #8)

I don't know offhand what glibc version started doing that. But if the open isn't working at all: is there something interposing that function at the ELF linkage level, maybe? (i.e., not at the syscall level like seccomp-bpf)

There's LD_PRELOAD function hooking going on; yes. I think that's the most likely candidate but again, I don't know what changed on our end to start breaking this /shrug

I've tried finding source code of their preload_tracer as well, but failed to do so.

Figured out that when running it under codeql that after open() the fd was -1

That's not what I'm observing. Here, open() returns 3, after having gone into the codeql machinery, and the write calls finish too. But that's when running under rr rather than strace.

Flags: needinfo?(mh+mozilla)

Here's my rr trace on pernosco: https://pernos.co/debug/SnlUwAMizs00ftnZMit3tA/index.html
I'll add some bookmarks to the notebook there for the relevant stuff.

Err, of course, I totally missed the fact that the file is output under rr.

Here's what's going on:

  • libtrace's static initializer calls semmle_init, which calls save_critical_env_vars, which allocates memory (practically speaking, save_critical_env_vars is inlined, so we're still in semmle_init).
  • the memory allocation triggers the allocator initialization, which initializes replace-malloc, which calls into the replace_init from LogAlloc.
  • LogAlloc's replace_init calls open to create the file given in MALLOC_LOG.
  • open is intercepted by libtrace. Remember from above: libtrace is still in the process of initializing itself.
  • libtrace's open calls semmle_init (yes, the same function we came here from).
  • The first thing semmle_init does is check whether initialization has started, which it has, per above, and return if that's the case.
  • now we're back in libtrace's open, which does some jumping around to finally call original_open, which is supposed to be glibc's open.
  • The problem? original_open is initialized in semmle_init, after save_critical_env_vars.

I have no idea how this ever worked. It works in rr because the malloc initialization ends up triggered at a different time.

The patch using openat only works because openat is not diverted by libtrace. It's only a matter of time before it does.

I'd just skip the check when one of the codeql environment variables is set, like CODEQL_DIST.

The codeql initialization process causes a loop in allocator
initialization that results in the inability to issue open()
calls due to function interception having begun set-up but
not completing it. (More details in bug.)

Attachment #9231614 - Attachment is obsolete: true

Comment on attachment 9232466 [details]
Bug 1720342: Do not run logalloc tests for the CodeQL build r?glandium

Beta/Release Uplift Approval Request

  • User impact if declined: codeql jobs on Beta releases will fail
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Only affects a codeql build
  • String changes made/needed:
Attachment #9232466 - Flags: approval-mozilla-beta?
Pushed by tritter@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0018c1e45b41
Do not run logalloc tests for the CodeQL build r=glandium
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch

Comment on attachment 9232466 [details]
Bug 1720342: Do not run logalloc tests for the CodeQL build r?glandium

Approved for 91 beta 7, thanks.

Attachment #9232466 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: