Perma Beta codeql - A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh]
Categories
(Firefox Build System :: Task Configuration, defect, P5)
Tracking
(firefox90 wontfix, firefox91 fixed, firefox92 fixed)
People
(Reporter: intermittent-bug-filer, Assigned: tjr)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
|
Details | Review |
Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=345097612&repo=mozilla-beta
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/dqOFy4gJQc-XxxQzegRyaw/runs/0/artifacts/public/logs/live_backing.log
[task 2021-07-13T15:56:56.186Z] [2021-07-13 15:56:56] [build-err] 15:56:56 INFO - Running post-run listener: _shutdown_sccache
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56 INFO - Running post-run listener: _summarize
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56 ERROR - # TBPL FAILURE #
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56 INFO - [mozharness: 2021-07-13 15:56:56.187133Z] FxDesktopBuild summary:
[task 2021-07-13T15:56:56.187Z] [2021-07-13 15:56:56] [build-err] 15:56:56 ERROR - # TBPL FAILURE #
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] cleanup
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] + cleanup
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] + local rv=255
[task 2021-07-13T15:56:56.263Z] [2021-07-13 15:56:56] [build-err] + cleanup_xvfb
[task 2021-07-13T15:56:56.264Z] [2021-07-13 15:56:56] [build-err] ++ pidof Xvfb
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + local xvfb_pid=107
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + local vnc=false
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + local interactive=false
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + '[' -n 107 ']'
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + [[ false == false ]]
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + [[ false == false ]]
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + kill 107
[task 2021-07-13T15:56:56.265Z] [2021-07-13 15:56:56] [build-err] + screen -XS xvfb quit
[task 2021-07-13T15:56:56.509Z] [2021-07-13 15:56:56] [build-err] + true
[task 2021-07-13T15:56:56.509Z] [2021-07-13 15:56:56] [build] No screen session found.
[task 2021-07-13T15:56:56.509Z] [2021-07-13 15:56:56] [build-err] + exit 255
[task 2021-07-13T15:56:56.513Z] [2021-07-13 15:56:56] [ERROR] Spawned process exited abnormally (code 255; tried to run: [/builds/worker/fetches/codeql/tools/linux64/preload_tracer, /builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh])
[task 2021-07-13T15:56:56.514Z] A fatal error occurred: Exit status 255 from command [/builds/worker/checkouts/gecko/taskcluster/scripts/builder/build-linux.sh]
[taskcluster 2021-07-13 15:56:56.978Z] === Task Finished ===
[taskcluster 2021-07-13 15:56:57.239Z] Artifact "public/codeql/codeql-db-cpp.tar.xz" not found at "/builds/worker/codeql-db-cpp.tar.xz"
[taskcluster 2021-07-13 15:57:03.161Z] Unsuccessful task run with exit code: 2 completed in 1234.951 seconds
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
Looking at a working and non-working job it seems like logalloc-replay might have started running when it wasn't before. Will need to investigate more.
Comment 2•3 years ago
|
||
Relevant error:
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> sed -n '/jemalloc_stats/,$p' test_output.log | /builds/worker/workspace/obj-build/_virtualenvs/common/bin/python /builds/worker/checkouts/gecko/memory/replace/logalloc/replay/logalloc_munge.py | diff -w - expected_output.log
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> sed: can't read test_output.log: No such file or directory
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> 0a1,17
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 jemalloc_stats()
[task 2021-07-13T15:56:37.753Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 malloc(42)=#1
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 malloc(24)=#2
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 free(#1)
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 posix_memalign(4096,1024)=#1
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 calloc(4,42)=#3
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 free(#2)
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 realloc(#3,84)=#2
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 aligned_alloc(256,1024)=#3
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 memalign(512,1024)=#4
[task 2021-07-13T15:56:37.754Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 valloc(1024)=#5
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 jemalloc_stats()
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 free(#5)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 free(#4)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 free(#3)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 free(#2)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37 INFO - check> > 1 1 free(#1)
[task 2021-07-13T15:56:37.755Z] [2021-07-13 15:56:37] [build-err] 15:56:37 ERROR - check> make[4]: *** [Makefile:51: check] Error 1
Comment 3•3 years ago
|
||
The only relevant recent change wrt skipping the check is bug 1715553, and that made it skipped in more cases, not less.
Assignee | ||
Comment 4•3 years ago
•
|
||
<s>logalloc is now being built because MOZ_REPLACE_MALLOC=1; whereas before it must not have been. It is not set in release presently.</s>
I take that back. logalloc is built on early beta but not late beta. It worked in early beta last release and broke this release. It's definetly something related to logalloc but not the fact that it is/isn't being built
Updated•3 years ago
|
Assignee | ||
Comment 5•3 years ago
|
||
More debugging:
- the logalloc makefile succeeds in an interactive task normally
- however it fails when run under the codeql wrapper
- I confirmed the codeql wrapper hasn't changed
- Searching for relevant strings in the codeql binaries/folder like 'FUZZING', 'MALLOC_LOG', and 'test_output' do not produce hits. (I was wondering if something codeql was doing was clobbering things.)
- By stracing the codeql wrapper with
strace -f -v -s 65535
I confirmed when we execute logalloc-replay the environment var is populated:26139 execve("./../../../../dist/bin/logalloc-replay", ["./../../../../dist/bin/logalloc-replay"], [... "MALLOC_LOG=test_output.log", ...] <unfinished ...>
- But the writes are going to stderr:
26139 write(2, "allocated: 81888\n", 24 <unfinished ...>
- No open() call for test_output.log is found.
- Minimizing he execution to just
strace -f -v -s 65535 -o strace_output.txt ~/fetches/codeql/codeql database create --language=cpp ~/codeql-2 --command='sh -c "MALLOC_LOG=test_output.log ./../../../../dist/bin/logalloc-replay < /builds/worker/checkouts/gecko/memory/replace/logalloc/replay/replay.log"'
still produces the failure. - I tried doing
27144 execve("./../../../../dist/bin/logalloc-replay", ["./../../../../dist/bin/logalloc-replay"], [... "MALLOC_LOG=5",
because 5 is a fd that has been used by codeql before and I wanted to see what would happen. but this producedwrite(5, "27144 140106744964928 malloc(42)=0x7f6d24a02070\n", 48) = -1 EBADF (Bad file descriptor)
and again, no open() call for fd 5. - I started looking at the function that does MALLOC_LOG process, replace_init. Added debugging statements to it.
- Confirmed that outside codeql, the debugging statements and program worked as expected.
- Confirmed that inside codeql the debugging statements were output as expected
- Figured out that when running it under codeql that after open() the fd was -1
- Figured out that the open() call is never happening?!?!?!
27915 write(2, "XXX Opening it\n", 15 <unfinished ...> 27915 <... write resumed> ) = 15 27915 write(2, "XXX Done Opening it, fd: -1\n", 28 <unfinished ...> 27915 <... write resumed> ) = 28
- On an absolute whim I decided to replace the call to open() with a call to
openat(AT_FDCWD, ...
because I had seen those in strace.
It worked.
glandium, do you have any idea what could be happening here, or what could have changed to cause this? At this point my best guess is a Docker image change that resulted in some new package doing... something different...?
Assignee | ||
Comment 6•3 years ago
|
||
When called under the codeql wrapper, this open() call was not happening.
The openat() call does however happen. Why? I have no idea.
Updated•3 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 8•3 years ago
|
||
(In reply to Tom Ritter [:tjr] (ni? for response to CVE/sec-approval/advisories/etc) from comment #5)
- On an absolute whim I decided to replace the call to open() with a call to
openat(AT_FDCWD, ...
because I had seen those in strace.It worked.
glibc's implementation of open
prefers using openat
internally; e.g., here's a program:
#include <unistd.h>
#include <fcntl.h>
int main(int argc, char **argv) {
open(argv[1], O_RDONLY);
return 0;
}
And here's strace:
openat(AT_FDCWD, "testfile", O_RDONLY) = 3
I don't know offhand what glibc version started doing that. But if the open
isn't working at all: is there something interposing that function at the ELF linkage level, maybe? (i.e., not at the syscall level like seccomp-bpf)
Assignee | ||
Comment 9•3 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #8)
I don't know offhand what glibc version started doing that. But if the
open
isn't working at all: is there something interposing that function at the ELF linkage level, maybe? (i.e., not at the syscall level like seccomp-bpf)
There's LD_PRELOAD function hooking going on; yes. I think that's the most likely candidate but again, I don't know what changed on our end to start breaking this /shrug
Comment 10•3 years ago
|
||
(In reply to Tom Ritter [:tjr] (ni? for response to CVE/sec-approval/advisories/etc) from comment #9)
(In reply to Jed Davis [:jld] ⟨⏰|UTC-6⟩ ⟦he/him⟧ from comment #8)
I don't know offhand what glibc version started doing that. But if the
open
isn't working at all: is there something interposing that function at the ELF linkage level, maybe? (i.e., not at the syscall level like seccomp-bpf)There's LD_PRELOAD function hooking going on; yes. I think that's the most likely candidate but again, I don't know what changed on our end to start breaking this /shrug
I've tried finding source code of their preload_tracer
as well, but failed to do so.
Comment 11•3 years ago
|
||
Figured out that when running it under codeql that after open() the fd was -1
That's not what I'm observing. Here, open() returns 3, after having gone into the codeql machinery, and the write calls finish too. But that's when running under rr rather than strace.
Comment 12•3 years ago
|
||
Here's my rr trace on pernosco: https://pernos.co/debug/SnlUwAMizs00ftnZMit3tA/index.html
I'll add some bookmarks to the notebook there for the relevant stuff.
Comment 13•3 years ago
|
||
Err, of course, I totally missed the fact that the file is output under rr.
Comment 14•3 years ago
|
||
Here's what's going on:
- libtrace's static initializer calls
semmle_init
, which callssave_critical_env_vars
, which allocates memory (practically speaking,save_critical_env_vars
is inlined, so we're still insemmle_init
). - the memory allocation triggers the allocator initialization, which initializes replace-malloc, which calls into the
replace_init
from LogAlloc. - LogAlloc's
replace_init
callsopen
to create the file given inMALLOC_LOG
. open
is intercepted by libtrace. Remember from above: libtrace is still in the process of initializing itself.- libtrace's
open
callssemmle_init
(yes, the same function we came here from). - The first thing
semmle_init
does is check whether initialization has started, which it has, per above, and return if that's the case. - now we're back in libtrace's
open
, which does some jumping around to finally calloriginal_open
, which is supposed to be glibc'sopen
. - The problem?
original_open
is initialized insemmle_init
, aftersave_critical_env_vars
.
I have no idea how this ever worked. It works in rr because the malloc initialization ends up triggered at a different time.
The patch using openat only works because openat is not diverted by libtrace. It's only a matter of time before it does.
I'd just skip the check when one of the codeql environment variables is set, like CODEQL_DIST
.
Assignee | ||
Comment 15•3 years ago
|
||
The codeql initialization process causes a loop in allocator
initialization that results in the inability to issue open()
calls due to function interception having begun set-up but
not completing it. (More details in bug.)
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 16•3 years ago
|
||
Comment on attachment 9232466 [details]
Bug 1720342: Do not run logalloc tests for the CodeQL build r?glandium
Beta/Release Uplift Approval Request
- User impact if declined: codeql jobs on Beta releases will fail
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Only affects a codeql build
- String changes made/needed:
Comment 17•3 years ago
|
||
Pushed by tritter@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0018c1e45b41 Do not run logalloc tests for the CodeQL build r=glandium
Comment 18•3 years ago
|
||
bugherder |
Comment 19•3 years ago
|
||
Comment on attachment 9232466 [details]
Bug 1720342: Do not run logalloc tests for the CodeQL build r?glandium
Approved for 91 beta 7, thanks.
Comment 20•3 years ago
|
||
bugherder uplift |
Comment hidden (Intermittent Failures Robot) |
Updated•3 years ago
|
Description
•