Intermittent valgrind-test | non-zero exit codefrom Valgrind
Categories
(Core :: General, defect, P5)
Tracking
()
People
(Reporter: intermittent-bug-filer, Unassigned)
Details
(Keywords: intermittent-failure, regression)
#[markdown(off)]
Filed by: nbeleuzu [at] mozilla.com
https://treeherder.mozilla.org/logviewer.html#?job_id=240933940&repo=autoland
The issue was caused by https://github.com/taskcluster/docker-worker/releases/tag/v201904171218
Log snippet:
[task 2019-04-17T12:41:30.436Z] 12:41:30 INFO - 0:03.10 --2368-- Reading syms from /lib/x86_64-linux-gnu/libdbus-1.so.3.7.2
[task 2019-04-17T12:41:30.437Z] 12:41:30 INFO - 0:03.10 --2368-- Considering /usr/lib/debug/.build-id/df/dc5a2010a3e0c72da4c5105d9e2cacaa77ae90.debug ..
[task 2019-04-17T12:41:30.437Z] 12:41:30 INFO - 0:03.10 --2368-- .. build-id is valid
[task 2019-04-17T12:41:30.458Z] 12:41:30 INFO - 0:03.12 --2368-- Reading syms from /usr/lib/x86_64-linux-gnu/libharfbuzz.so.0.909.0
[task 2019-04-17T12:41:30.458Z] 12:41:30 INFO - 0:03.12 --2368-- object doesn't have a symbol table
[task 2019-04-17T12:41:30.714Z] 12:41:30 INFO - 0:03.38 TEST-PASS | valgrind-test | valgrind found no errors
[task 2019-04-17T12:41:30.714Z] 12:41:30 INFO - 0:03.38 TEST-UNEXPECTED-FAIL | valgrind-test | non-zero exit codefrom Valgrind
[task 2019-04-17T12:41:30.860Z] 12:41:30 ERROR - Return code: 2
[task 2019-04-17T12:41:30.860Z] 12:41:30 WARNING - setting return code to 2
[task 2019-04-17T12:41:30.860Z] 12:41:30 FATAL - 'mach valgrind-test' did not run successfully. Please check log for errors.
[task 2019-04-17T12:41:30.860Z] 12:41:30 FATAL - Running post_fatal callback...
[task 2019-04-17T12:41:30.860Z] 12:41:30 FATAL - Exiting -1
[task 2019-04-17T12:41:30.860Z] 12:41:30 INFO - [mozharness: 2019-04-17 12:41:30.860217Z] Finished valgrind-test step (failed)
Comment hidden (Intermittent Failures Robot) |
Comment 2•6 years ago
|
||
I investigate this a bit more. The dbg symbols for libharfbuzz are not available in Debian, but are in Ubuntu sice Xenial. :tomprince, any ideas here?
Comment 3•6 years ago
|
||
I'm not sure if this is related to the missing symbols, and I'm not familiar with this test at all.
Comment 4•6 years ago
•
|
||
It's probably unrelated, but it's also probably worth looking at adding debugging symbols packages for the missing packages. I will file a separate bug for that. Edit: Filed bug 1554039
As for the error itself, it would seem like Firefox itself is exiting with error code 2 without saying anything. But I might be wrong. Julian, what do you think?
Comment 5•6 years ago
|
||
The issue was caused by https://github.com/taskcluster/docker-worker/releases/tag/v201904171218
Practically speaking, what do those change mean for the docker environment? Could it be e.g. the Firefox sandbox failing during startup because of some subtle container differences?
Comment 6•6 years ago
|
||
Come to think of it, this log doesn't look like valgrind even terminates normally. Maybe it was killed? OOM killer?
Comment 7•6 years ago
•
|
||
(In reply to Mike Hommey [:glandium] from comment #4)
As for the error itself, it would seem like Firefox itself is exiting with error code 2
The code in build/valgrind/mach_commands.py
says the error code 2 comes from the script itself. The error code returned by valgrind is eaten. I wrote a patch to display the error code that valgrind returns, which might be useful, but this problem hasn't occurred since April 17, I supposed because the docker-worker change was backed out, so I can't push the patch to try to see if it would give some more useful insight...
Edit: filed bug 1554043
Comment 8•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #5)
The issue was caused by https://github.com/taskcluster/docker-worker/releases/tag/v201904171218
Practically speaking, what do those change mean for the docker environment? Could it be e.g. the Firefox sandbox failing during startup because of some subtle container differences?
:wcosta is the one who knows about docker-worker.
Comment 9•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #6)
Come to think of it, this log doesn't look like valgrind even terminates normally. Maybe it was killed? OOM killer?
Seems plausible to me. If V runs out of memory (iow, one of its mmap-zero requests fail) then it prints a very obvious message before quitting. But here, it appears to have simply disappeared without trace, which sounds like the OOM killer.
Comment 10•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #6)
Come to think of it, this log doesn't look like valgrind even terminates
normally. Maybe it was killed? OOM killer?
OOM would make the instance to reboot and the task finish with "exception" status.
Comment 11•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7)
(In reply to Mike Hommey [:glandium] from comment #4)
As for the error itself, it would seem like Firefox itself is exiting with error code 2
The code in
build/valgrind/mach_commands.py
says the error code 2 comes
from the script itself. The error code returned by valgrind is eaten. I
wrote a patch to display the error code that valgrind returns, which might
be useful, but this problem hasn't occurred since April 17, I supposed
because the docker-worker change was backed out, so I can't push the patch
to try to see if it would give some more useful insight...Edit: filed bug 1554043
Note: this docker-worker release updated AMI to Ubuntu 18.04 and docker to 18.09.5.
The ami-test worker type contains the faulty AMIs, you can push there if you want to.
Comment 12•6 years ago
|
||
Tried the ami-test worker type, and this bug didn't reproduce:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=86c8bd074a909c8033e4a3e07ad9529e571b0bfb
Comment 13•6 years ago
|
||
Investigating a little further, I found out that valgrind is crashing. Running it inside a interactive task:
root@d5b6a2187a27:~/workspace/build/src/obj-firefox# valgrind.bin --sym-offsets=yes --smc-check=all-non-file --vex-iropt-register-updates=allregs-at-mem-access --gen-suppressio
ns=all --num-callers=36 --leak-check=full --show-possibly-lost=no --track-origins=yes --trace-children=yes -v --fair-sched=yes --keep-debuginfo=yes --expensive-definedness-chec
ks=yes --suppressions=/builds/worker/workspace/build/src/build/valgrind/cross-architecture.sup --suppressions=/builds/worker/workspace/build/src/build/valgrind/x86_64-pc-linux-
gnu.sup /builds/worker/workspace/build/src/obj-firefox/dist/bin/firefox-bin http://127.0.0.1:42897/ -profile /tmp/tmpWSAM_c 2> log.txt
Segmentation fault (core dumped)
The command line was extracted from the ps output when running "mach valgrind-test"
Comment 14•6 years ago
|
||
Update: it is firefox-bin and not valgrind that is crashing. Stack trace:
(gdb) bt
#0 0x0000001009a03eb3 in ?? ()
#1 0x0000001008d8df30 in ?? ()
#2 0x0000000000001c10 in ?? ()
#3 0x0000001002008660 in ?? ()
#4 0x0000001002008660 in ?? ()
#5 0x0000001008d8df40 in ?? ()
#6 0x0000000000017a18 in LZ4_putPositionOnHash (p=<optimized out>, tableBase=<optimized out>, tableType=byU32, srcBase=0x0, h=<optimized out>)
at /builds/worker/workspace/build/src/mfbt/lz4.c:666
#7 LZ4_putPosition (p=<optimized out>, tableType=byU32, srcBase=0x0, tableBase=<optimized out>) at /builds/worker/workspace/build/src/mfbt/lz4.c:674
#8 LZ4_compress_generic (cctx=<optimized out>, source=<optimized out>, dest=<optimized out>, inputSize=<optimized out>, inputConsumed=0x0, maxOutputSize=<optimized out>,
outputDirective=limitedOutput, tableType=byU32, dictDirective=usingExtDict, dictIssue=noDictIssue, acceleration=0) at /builds/worker/workspace/build/src/mfbt/lz4.c:837
#9 LZ4_compress_fast_continue (LZ4_stream=0x58016800, source=0x1ffeff9fe8 <error: Cannot access memory at address 0x1ffeff9fe8>, dest=0x0, inputSize=0, maxOutputSize=0,
acceleration=0) at /builds/worker/workspace/build/src/mfbt/lz4.c:1492
#10 0x00000000000179df in LZ4_compress_generic (cctx=<optimized out>, source=<optimized out>, dest=<optimized out>, inputSize=<optimized out>, inputConsumed=0x0,
maxOutputSize=<optimized out>, outputDirective=limitedOutput, tableType=byU32, dictDirective=usingExtDict, dictIssue=noDictIssue,
acceleration=<error reading variable: Cannot access memory at address 0x580908d8>) at /builds/worker/workspace/build/src/mfbt/lz4.c:832
#11 LZ4_compress_fast_continue (LZ4_stream=0x58016800, source=0x1ffeff9fe8 <error: Cannot access memory at address 0x1ffeff9fe8>, dest=0x0, inputSize=0, maxOutputSize=0,
acceleration=<error reading variable: Cannot access memory at address 0x580908d8>) at /builds/worker/workspace/build/src/mfbt/lz4.c:1492
Backtrace stopped: Cannot access memory at address 0x5809091c
(gdb)
Comment 15•6 years ago
|
||
How can firefox-bin crash without valgrind catching it?
Comment 16•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #15)
How can firefox-bin crash without Valgrind catching it?
Honestly, I am not sure what's going on yet, except that Valgrind/firefox crashes and it generates a core file. My guess on firefox-bin is based that it is the only binary there are some symbols in the stack trace, but it might be a false positive.
Updated•6 years ago
|
Comment 17•6 years ago
|
||
If I run the container with SYS_ADMIN capability, it works. Unfortunately, this brings security concerns. :njn, do you or :jseward have any idea what could be done? I tried to identify precisely what capability from CAP_SYS_ADMIN Valgrind needs but no success so far.
Comment 18•6 years ago
|
||
On looking at this again, it seems to me plausible that the Valgrind run was terminated because of some security problem, per :wcosta's comment 17. I've seen cases before where such runs simply disappear without comment, and this turned out also to be due to Valgrind trying to use facilities that a sandboxed process couldn't, even though Firefox itself, given the same sandboxing constraints, would run OK outside of Valgrind. Besides, if Fx really exited with a signal while running on V (eg, segfaulting in the normal way), we would have seen some kind of messages to that effect.
Comment 19•6 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #18)
On looking at this again, it seems to me plausible that the Valgrind run was terminated because of some security problem, per :wcosta's comment 17. I've seen cases before where such runs simply disappear without comment, and this turned out also to be due to Valgrind trying to use facilities that a sandboxed process couldn't, even though Firefox itself, given the same sandboxing constraints, would run OK outside of Valgrind.
So, my investigation showed that Valgrind receives a SIGSEGV while trying to schedule a thread. On firefox side, the crash happens inside dlopen while loading libmozgtk.so.
My first guess was that Valgrind was reaching the stack size limit (it is configured to 8MB by default), and the capabilities main page says if the process reaches its limit, the kernel sends a SIGSEGV signal. But even if I configure it to unlimited Valgrind still crashes. Another restriction I relaxed was the number of open files, but still no luck.
Besides, if Fx really exited with a signal while running on V (eg, segfaulting in the normal way), we would have seen some kind of messages to that effect.
That's because mach valgrind-test
swallows the stderr message. If I run Valgrind directly, I can even get a core file; although it doesn't show anything useful.
Comment 20•6 years ago
|
||
fwiw I got a kernel stack of the SIGSEGV
0xffffffff870ec7e0 : __send_signal+0x0/0x4a0 [kernel]
0xffffffff870ed47f : force_sig_info+0xaf/0xe0 [kernel]
0xffffffff870ee1c9 : force_sig_fault+0x59/0x80 [kernel]
0xffffffff870713b2 : do_page_fault+0x32/0x110 [kernel]
0xffffffff87a0110e : page_fault+0x1e/0x30 [kernel]
0xffffffff87a0110e : page_fault+0x1e/0x30 [kernel] (inexact)
Next week I continue the investigation.
![]() |
||
Updated•6 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Updated•3 years ago
|
Description
•