Open Bug 1545094 Opened 4 months ago Updated Last month

Intermittent valgrind-test | non-zero exit codefrom Valgrind

Categories

(Core :: General, defect, P5)

defect

Tracking

()

ASSIGNED

People

(Reporter: intermittent-bug-filer, Assigned: wcosta)

References

Details

(Keywords: intermittent-failure, regression)

#[markdown(off)]
Filed by: nbeleuzu [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=240933940&repo=autoland

https://queue.taskcluster.net/v1/task/V6SEPMoaQ_ClPI7FW-IAag/runs/0/artifacts/public/logs/live_backing.log

The issue was caused by https://github.com/taskcluster/docker-worker/releases/tag/v201904171218

Log snippet:

[task 2019-04-17T12:41:30.436Z] 12:41:30 INFO - 0:03.10 --2368-- Reading syms from /lib/x86_64-linux-gnu/libdbus-1.so.3.7.2
[task 2019-04-17T12:41:30.437Z] 12:41:30 INFO - 0:03.10 --2368-- Considering /usr/lib/debug/.build-id/df/dc5a2010a3e0c72da4c5105d9e2cacaa77ae90.debug ..
[task 2019-04-17T12:41:30.437Z] 12:41:30 INFO - 0:03.10 --2368-- .. build-id is valid
[task 2019-04-17T12:41:30.458Z] 12:41:30 INFO - 0:03.12 --2368-- Reading syms from /usr/lib/x86_64-linux-gnu/libharfbuzz.so.0.909.0
[task 2019-04-17T12:41:30.458Z] 12:41:30 INFO - 0:03.12 --2368-- object doesn't have a symbol table
[task 2019-04-17T12:41:30.714Z] 12:41:30 INFO - 0:03.38 TEST-PASS | valgrind-test | valgrind found no errors
[task 2019-04-17T12:41:30.714Z] 12:41:30 INFO - 0:03.38 TEST-UNEXPECTED-FAIL | valgrind-test | non-zero exit codefrom Valgrind
[task 2019-04-17T12:41:30.860Z] 12:41:30 ERROR - Return code: 2
[task 2019-04-17T12:41:30.860Z] 12:41:30 WARNING - setting return code to 2
[task 2019-04-17T12:41:30.860Z] 12:41:30 FATAL - 'mach valgrind-test' did not run successfully. Please check log for errors.
[task 2019-04-17T12:41:30.860Z] 12:41:30 FATAL - Running post_fatal callback...
[task 2019-04-17T12:41:30.860Z] 12:41:30 FATAL - Exiting -1
[task 2019-04-17T12:41:30.860Z] 12:41:30 INFO - [mozharness: 2019-04-17 12:41:30.860217Z] Finished valgrind-test step (failed)

Blocks: 1527107

I investigate this a bit more. The dbg symbols for libharfbuzz are not available in Debian, but are in Ubuntu sice Xenial. :tomprince, any ideas here?

Flags: needinfo?(mozilla)

I'm not sure if this is related to the missing symbols, and I'm not familiar with this test at all.

Flags: needinfo?(mozilla) → needinfo?(mh+mozilla)

It's probably unrelated, but it's also probably worth looking at adding debugging symbols packages for the missing packages. I will file a separate bug for that. Edit: Filed bug 1554039

As for the error itself, it would seem like Firefox itself is exiting with error code 2 without saying anything. But I might be wrong. Julian, what do you think?

Flags: needinfo?(mh+mozilla) → needinfo?(jseward)

The issue was caused by https://github.com/taskcluster/docker-worker/releases/tag/v201904171218

Practically speaking, what do those change mean for the docker environment? Could it be e.g. the Firefox sandbox failing during startup because of some subtle container differences?

Flags: needinfo?(mozilla)
Flags: needinfo?(dustin)

Come to think of it, this log doesn't look like valgrind even terminates normally. Maybe it was killed? OOM killer?

(In reply to Mike Hommey [:glandium] from comment #4)

As for the error itself, it would seem like Firefox itself is exiting with error code 2

The code in build/valgrind/mach_commands.py says the error code 2 comes from the script itself. The error code returned by valgrind is eaten. I wrote a patch to display the error code that valgrind returns, which might be useful, but this problem hasn't occurred since April 17, I supposed because the docker-worker change was backed out, so I can't push the patch to try to see if it would give some more useful insight...

Edit: filed bug 1554043

(In reply to Mike Hommey [:glandium] from comment #5)

The issue was caused by https://github.com/taskcluster/docker-worker/releases/tag/v201904171218

Practically speaking, what do those change mean for the docker environment? Could it be e.g. the Firefox sandbox failing during startup because of some subtle container differences?

:wcosta is the one who knows about docker-worker.

Flags: needinfo?(wcosta)
Flags: needinfo?(mozilla)
Flags: needinfo?(dustin)

(In reply to Mike Hommey [:glandium] from comment #6)

Come to think of it, this log doesn't look like valgrind even terminates normally. Maybe it was killed? OOM killer?

Seems plausible to me. If V runs out of memory (iow, one of its mmap-zero requests fail) then it prints a very obvious message before quitting. But here, it appears to have simply disappeared without trace, which sounds like the OOM killer.

Flags: needinfo?(jseward)

(In reply to Mike Hommey [:glandium] from comment #6)

Come to think of it, this log doesn't look like valgrind even terminates
normally. Maybe it was killed? OOM killer?

OOM would make the instance to reboot and the task finish with "exception" status.

Flags: needinfo?(wcosta)

(In reply to Mike Hommey [:glandium] from comment #7)

(In reply to Mike Hommey [:glandium] from comment #4)

As for the error itself, it would seem like Firefox itself is exiting with error code 2

The code in build/valgrind/mach_commands.py says the error code 2 comes
from the script itself. The error code returned by valgrind is eaten. I
wrote a patch to display the error code that valgrind returns, which might
be useful, but this problem hasn't occurred since April 17, I supposed
because the docker-worker change was backed out, so I can't push the patch
to try to see if it would give some more useful insight...

Edit: filed bug 1554043

Note: this docker-worker release updated AMI to Ubuntu 18.04 and docker to 18.09.5.
The ami-test worker type contains the faulty AMIs, you can push there if you want to.

Investigating a little further, I found out that valgrind is crashing. Running it inside a interactive task:

root@d5b6a2187a27:~/workspace/build/src/obj-firefox# valgrind.bin --sym-offsets=yes --smc-check=all-non-file --vex-iropt-register-updates=allregs-at-mem-access --gen-suppressio
ns=all --num-callers=36 --leak-check=full --show-possibly-lost=no --track-origins=yes --trace-children=yes -v --fair-sched=yes --keep-debuginfo=yes --expensive-definedness-chec
ks=yes --suppressions=/builds/worker/workspace/build/src/build/valgrind/cross-architecture.sup --suppressions=/builds/worker/workspace/build/src/build/valgrind/x86_64-pc-linux-
gnu.sup /builds/worker/workspace/build/src/obj-firefox/dist/bin/firefox-bin http://127.0.0.1:42897/ -profile /tmp/tmpWSAM_c 2> log.txt
Segmentation fault (core dumped)

The command line was extracted from the ps output when running "mach valgrind-test"

Update: it is firefox-bin and not valgrind that is crashing. Stack trace:

(gdb) bt
#0 0x0000001009a03eb3 in ?? ()
#1 0x0000001008d8df30 in ?? ()
#2 0x0000000000001c10 in ?? ()
#3 0x0000001002008660 in ?? ()
#4 0x0000001002008660 in ?? ()
#5 0x0000001008d8df40 in ?? ()
#6 0x0000000000017a18 in LZ4_putPositionOnHash (p=<optimized out>, tableBase=<optimized out>, tableType=byU32, srcBase=0x0, h=<optimized out>)
at /builds/worker/workspace/build/src/mfbt/lz4.c:666
#7 LZ4_putPosition (p=<optimized out>, tableType=byU32, srcBase=0x0, tableBase=<optimized out>) at /builds/worker/workspace/build/src/mfbt/lz4.c:674
#8 LZ4_compress_generic (cctx=<optimized out>, source=<optimized out>, dest=<optimized out>, inputSize=<optimized out>, inputConsumed=0x0, maxOutputSize=<optimized out>,
outputDirective=limitedOutput, tableType=byU32, dictDirective=usingExtDict, dictIssue=noDictIssue, acceleration=0) at /builds/worker/workspace/build/src/mfbt/lz4.c:837
#9 LZ4_compress_fast_continue (LZ4_stream=0x58016800, source=0x1ffeff9fe8 <error: Cannot access memory at address 0x1ffeff9fe8>, dest=0x0, inputSize=0, maxOutputSize=0,
acceleration=0) at /builds/worker/workspace/build/src/mfbt/lz4.c:1492
#10 0x00000000000179df in LZ4_compress_generic (cctx=<optimized out>, source=<optimized out>, dest=<optimized out>, inputSize=<optimized out>, inputConsumed=0x0,
maxOutputSize=<optimized out>, outputDirective=limitedOutput, tableType=byU32, dictDirective=usingExtDict, dictIssue=noDictIssue,
acceleration=<error reading variable: Cannot access memory at address 0x580908d8>) at /builds/worker/workspace/build/src/mfbt/lz4.c:832
#11 LZ4_compress_fast_continue (LZ4_stream=0x58016800, source=0x1ffeff9fe8 <error: Cannot access memory at address 0x1ffeff9fe8>, dest=0x0, inputSize=0, maxOutputSize=0,
acceleration=<error reading variable: Cannot access memory at address 0x580908d8>) at /builds/worker/workspace/build/src/mfbt/lz4.c:1492
Backtrace stopped: Cannot access memory at address 0x5809091c
(gdb)

How can firefox-bin crash without valgrind catching it?

Flags: needinfo?(jseward)

(In reply to Mike Hommey [:glandium] from comment #15)

How can firefox-bin crash without Valgrind catching it?

Honestly, I am not sure what's going on yet, except that Valgrind/firefox crashes and it generates a core file. My guess on firefox-bin is based that it is the only binary there are some symbols in the stack trace, but it might be a false positive.

Assignee: nobody → wcosta
Status: NEW → ASSIGNED

If I run the container with SYS_ADMIN capability, it works. Unfortunately, this brings security concerns. :njn, do you or :jseward have any idea what could be done? I tried to identify precisely what capability from CAP_SYS_ADMIN Valgrind needs but no success so far.

Flags: needinfo?(n.nethercote)

On looking at this again, it seems to me plausible that the Valgrind run was terminated because of some security problem, per :wcosta's comment 17. I've seen cases before where such runs simply disappear without comment, and this turned out also to be due to Valgrind trying to use facilities that a sandboxed process couldn't, even though Firefox itself, given the same sandboxing constraints, would run OK outside of Valgrind. Besides, if Fx really exited with a signal while running on V (eg, segfaulting in the normal way), we would have seen some kind of messages to that effect.

Flags: needinfo?(jseward)

(In reply to Julian Seward [:jseward] from comment #18)

On looking at this again, it seems to me plausible that the Valgrind run was terminated because of some security problem, per :wcosta's comment 17. I've seen cases before where such runs simply disappear without comment, and this turned out also to be due to Valgrind trying to use facilities that a sandboxed process couldn't, even though Firefox itself, given the same sandboxing constraints, would run OK outside of Valgrind.

So, my investigation showed that Valgrind receives a SIGSEGV while trying to schedule a thread. On firefox side, the crash happens inside dlopen while loading libmozgtk.so.

My first guess was that Valgrind was reaching the stack size limit (it is configured to 8MB by default), and the capabilities main page says if the process reaches its limit, the kernel sends a SIGSEGV signal. But even if I configure it to unlimited Valgrind still crashes. Another restriction I relaxed was the number of open files, but still no luck.

Besides, if Fx really exited with a signal while running on V (eg, segfaulting in the normal way), we would have seen some kind of messages to that effect.

That's because mach valgrind-test swallows the stderr message. If I run Valgrind directly, I can even get a core file; although it doesn't show anything useful.

fwiw I got a kernel stack of the SIGSEGV

0xffffffff870ec7e0 : __send_signal+0x0/0x4a0 [kernel]
0xffffffff870ed47f : force_sig_info+0xaf/0xe0 [kernel]
0xffffffff870ee1c9 : force_sig_fault+0x59/0x80 [kernel]
0xffffffff870713b2 : do_page_fault+0x32/0x110 [kernel]
0xffffffff87a0110e : page_fault+0x1e/0x30 [kernel]
0xffffffff87a0110e : page_fault+0x1e/0x30 [kernel] (inexact)

Next week I continue the investigation.

Flags: needinfo?(n.nethercote)
You need to log in before you can comment on or make changes to this bug.