1382280 - Update the version of Valgrind in the desktop-build image to one with unstripped shared libraries

Reporter

Description

•

7 years ago

bug 1338651 is hung up on some new Valgrind failures. bug 1338651 comment 166 shows an unhelpful stack: > ==42858== 16 bytes in 1 blocks are definitely lost in loss record 46 of 256 > ==42858== at 0x4C2B525: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) > ==42858== by 0x17509424: ??? > ==42858== by 0x1779BBA5: ??? > ==42858== by 0x16ADCCAA: ??? > ==42858== by 0xFFF0001B2: ??? > ==42858== by 0x772F72656B726F76: ??? > ==42858== by 0x65636170736B726E: ??? > ==42858== by 0x732F646C6975622E: ??? As it turns out, the version of valgrind we've been using for the valgrind *build* jobs (in desktop-test) is some rpm of uncertain provenence (I didn't dig too far): https://dxr.mozilla.org/mozilla-central/rev/1b065ffd8a535a0ad4c39a912af18e948e6a42c1/taskcluster/docker/centos6-build/system-setup.sh#302 sewardj noted that elsewhere in the log Valgrind complained that vgpreload_memcheck-amd64-linux.so was stripped, and when I downloaded the RPM from tooltool and unpacked it, sure enough: $ file usr/lib64/valgrind/vgpreload_core-amd64-linux.so usr/lib64/valgrind/vgpreload_core-amd64-linux.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=6fc247b9ca7a6772443d9e7dc18b30065fa87a03, stripped It turns out that this breaks Valgrind's ability to unwind past things like `operator new`, which is why we get that garbage stack. For the valgrind-mochitest jobs, which run in the Ubuntu test images, we install a newer version of Valgrind from a tarball that sewardj built: https://dxr.mozilla.org/mozilla-central/rev/1b065ffd8a535a0ad4c39a912af18e948e6a42c1/taskcluster/docker/recipes/ubuntu1604-test-system-setup.sh#109 He made sure this build was not stripped, because of this issue. We need to, at the very least, build a new Valgrind for CentOS6 that is not stripped, and switch the centos6-build system-setup to install that instead. That should let us get a useful stack out of this failure and either fix or suppress it. Extra niceties we could do: 1) Update to a much-newer Valgrind--that version is pretty old. sewardj recommended latest stable, which is 3.13.0 at this time. 2) Write a toolchain task to build Valgrind, instead of having someone build it manually and upload a tarball to tooltool. 3) Update the Valgrind version in the Ubuntu testers to match (also making a toolchain task for that binary would be nice).

Mike Hommey [:glandium]

Comment 1

•

7 years ago

3) doesn't seem useful. The result of 2 should be usable on ubuntu testers.

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Assignee: nobody → wcosta

Status: NEW → ASSIGNED

Wander Lairson Costa

Assignee

Comment 2

•

7 years ago

Going to be KISS now, I wrote a proposal [1] to not only valgrind, but all things built inside taskcluster be automatically downloaded as an artifact, avoiding manual, tedious and error prone interventions. [1] https://github.com/taskcluster/taskcluster-rfcs/issues/84

Mike Hommey [:glandium]

Comment 3

•

7 years ago

*cough* *cough* work is under way for that in bug 1313111, where the actual task graph stuff is in bug 1374940, a review away from landing, and have patches mostly ready for clang, gcc, libdmg, hfsplug and cctools-port.

Mike Hommey [:glandium]

Comment 4

•

7 years ago

> and have patches and *I* have patches

Mike Hommey [:glandium]

Comment 5

•

7 years ago

oh, and sccache too. So really, for this bug, you should just care about creating a taskcluster job that builds valgrind.

Julian Seward [:jseward]

Comment 6

•

7 years ago

Upgrading to V latest (3.13.0) is probably also important for bug 1365915.

Blocks: 1365915

(not currently active) Ted Mielczarek

Reporter

Comment 7

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #1) > 3) doesn't seem useful. The result of 2 should be usable on ubuntu testers. Well, then making both Docker images install the same Valgrind build would be fine, but it's still useful to have a single version of Valgrind in use for all our CI.

(not currently active) Ted Mielczarek

Reporter

Comment 8

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #2) > Going to be KISS now, I wrote a proposal [1] to not only valgrind, but all To be clear, it's 100% fine to just fix this bug as filed! I do think writing a toolchain task to do the Valgrind build in taskcluster would be a good payoff, as we've found with our other toolchain tasks, plus it sets things up to work nicely with glandium's work in bug 1313111.

Wander Lairson Costa

Assignee

Comment 9

•

7 years ago

I did rebuilt valgrind with full symbols, but *that* non symbol resolved case still happens... https://treeherder.mozilla.org/#/jobs?repo=try&revision=492a8578d6fae8042a42303f3960cf5b2024defa&selectedJob=116508990 I wonder if valgrind has some way to tell which module the stack frame comes from.

Mike Hommey [:glandium]

Comment 10

•

7 years ago

Old log (from comment 0): > ==42858== at 0x4C2B525: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) New log (from comment 9): > ==42927== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) So that *did* something, but it didn't fix the missing symbols. You can add -d to the flags passed to valgrind, and it will output some info that should end up interesting. I don't remember if there's something less verbose that would just display the mmapped stuff.

Wander Lairson Costa

Assignee

Comment 11

•

7 years ago

I ran a try push with -d [1] I noticed that the unresolved stack trace started in libLLVM-3.6-mesa.so: [snip] # ************************************ # Notice the address range here... # ************************************ --42915-- Discarding syms at 0x16b7bce0-0x17839bb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap() [snip] TEST-UNEXPECTED-FAIL | valgrind-test | 8 bytes in 1 blocks are definitely lost at operator / ??? / ??? / ??? ==42915== 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 ==42915== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) # ************************************ # ...these stack frames fall into it # ************************************ ==42915== by 0x4C29DB5: ??? ==42915== by 0x175BC08D: ??? ==42915== by 0x1759E642: ??? ==42915== by 0x175C0914: ??? ==42915== by 0x17839BA5: ??? ==42920== by 0x16B7ACAA: ??? [snip] I then looked for libLLVM-3.6-mesa in logs and found: --42920-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42920-- object doesn't have a symbol table Some internet search and I found that doing: debuginfo-install -y mesa-private-llvm Would install debug information for this library (also found out that my former method to install all system debug symbols was flawed). Another try push [2]: --42915-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42915-- Considering /usr/lib/debug/.build-id/ec/5d9d345c2d547a19279d0ee052308bb1c77948.debug .. --42915-- .. build-id is valid But I still get the same behavior: [snip] --42915-- Discarding syms at 0x16b7cce0-0x1783abb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap() [snip] ==42915== 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 ==42915== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) ==42915== by 0x175BD08D: ??? ==42915== by 0x1759F642: ??? ==42915== by 0x175C1914: ??? ==42915== by 0x1783ABA5: ??? [snip] At the moment I am writing this comment, I have a new try push running with more debuginfo packages installed [3]. Will look at results tomorrow morning. Any ideas/suggestions? [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=7380e7a67f31&selectedJob=116649677 [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d515e6abc8fe25324ead5663bbe03b7d36e6592&selectedJob=116750252 [3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=78fd821f493f

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(jseward)

Wander Lairson Costa

Assignee

Comment 12

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #11) [snip] > # ************************************ > # ...these stack frames fall into it > # ************************************ > ==42915== by 0x4C29DB5: ??? > ==42915== by 0x175BC08D: ??? > ==42915== by 0x1759E642: ??? > ==42915== by 0x175C0914: ??? > ==42915== by 0x17839BA5: ??? > ==42920== by 0x16B7ACAA: ??? Oops, the correct log stack trace is: ==42920== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) ==42920== by 0x175BC08D: ??? ==42920== by 0x1759E642: ??? ==42920== by 0x175C0914: ??? ==42920== by 0x17839BA5: ??? ==42920== by 0x16B7ACAA: ??? [snip]

Julian Seward [:jseward]

Comment 13

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #11) > --42915-- Discarding syms at 0x16b7bce0-0x17839bb6 in > /usr/lib64/libLLVM-3.6-mesa.so due to munmap() This is an old (very old) limitation of Valgrind. When a library is dlclosed (which appears at a low level as the munmap() that is mentioned) then Valgrind throws away the debuginfo for the library immediately. As a result, you have correct stacks but they can no longer be symbolised because the relevant debuginfo has been discarded. There's no feasible workaround, apart from "don't unmap any library", which isn't practical because it's often not under our control. We've always put off fixing it because it would be a substantial chunk of work to redo how Valgrind manages debuginfo loading/unloading.

Flags: needinfo?(jseward)

(not currently active) Ted Mielczarek

Reporter

Comment 14

•

7 years ago

Julian: any thoughts on how we can work around this to unblock this work? I'm guessing this is a leak in the Mesa LLVM internals, which is the sort of thing we'd normally just write a suppression for, but given the fact that it gets unloaded before shutdown we don't get a symbolicated stack there.

Flags: needinfo?(jseward)

(not currently active) Ted Mielczarek

Reporter

Comment 15

•

7 years ago

One thing we might try is setting `LD_PRELOAD=/usr/lib64/libLLVM-3.6-mesa.so` in the environment for the Valgrind test run, to keep that shared library from being unloaded. You could do that here: https://dxr.mozilla.org/mozilla-central/rev/7d2e89fb92331d7e4296391213c1e63db628e046/build/valgrind/mach_commands.py#90 Also, after some googling, I would bet that getting a version of Mesa with this patch would fix this leak: https://github.com/mesa3d/mesa/commit/83c62597fc8eb38bf274fa1a3ca03c6adafb4bf9 (but that's idle speculation)

Julian Seward [:jseward]

Comment 16

•

7 years ago

FTR, the underlying Valgrind bug is at https://bugs.kde.org/show_bug.cgi?id=79362. Per irc chat with Ted, I'm contemplating whether it's feasible to partially fix this, so that code addresses in objects that are unmapped are tagged with the object name. This would mean that for example, the offending stack trace, instead of appearing like this 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) by 0x175B208D: ??? by 0x17594642: ??? by 0x175B6914: ??? by 0x1782FBA5: ??? would appear like this 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) by 0x175B208D: (in libLLVM.so) by 0x17594642: (in libLLVM.so) by 0x175B6914: (in someOtherUnmappedLib.so) by 0x1782FBA5: (in etc.so) which would give at least some basis on which to write a suppression for the leak, since suppression stack frames can also match object names.

Flags: needinfo?(jseward)

Mike Hommey [:glandium]

Comment 17

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #16) > FTR, the underlying Valgrind bug is at > https://bugs.kde.org/show_bug.cgi?id=79362. > Per irc chat with Ted, I'm contemplating whether it's feasible to partially > fix this, so that code addresses in objects that are unmapped are tagged > with the object name. This would mean that for example, the offending stack > trace, instead of appearing like this Why not just take the fix from that bug?

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 18

•

7 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15) > One thing we might try is setting > `LD_PRELOAD=/usr/lib64/libLLVM-3.6-mesa.so` in the environment for the > Valgrind test run, to keep that shared library from being unloaded. You > could do that here: > https://dxr.mozilla.org/mozilla-central/rev/ > 7d2e89fb92331d7e4296391213c1e63db628e046/build/valgrind/mach_commands.py#90 > > Also, after some googling, I would bet that getting a version of Mesa with > this patch would fix this leak: > https://github.com/mesa3d/mesa/commit/ > 83c62597fc8eb38bf274fa1a3ca03c6adafb4bf9 > > (but that's idle speculation) Alternatively, can we prevent mesa from being upgraded by whatever docker image changes are happening?

Julian Seward [:jseward]

Comment 19

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #17) > (In reply to Julian Seward [:jseward] from comment #16) > > FTR, the underlying Valgrind bug is at > > https://bugs.kde.org/show_bug.cgi?id=79362. > Why not just take the fix from that bug? I commented more about it at https://bugs.kde.org/show_bug.cgi?id=79362#c59. I'm fairly sure that fix is wrong -- not wrong in some small detail, but broken at an architectural/concept level, so as to give wrong code-address to function-name/line-number mappings in some circumstances. That said, from the sound of comments in that bug, the fix actually works fairly well. As a short term workaround, we could perhaps use a version of Valgrind with a rebased version of the patch, so as to unblock this bug (1382280, I mean). I wouldn't want the patch upstream in V (if it indeed turns out to be wrong), but this would at least give more time to fix V properly. Ted, wcosta, if I rebase said patch against Valgrind 3.13.0, would you be able to test it easily on automation?

Flags: needinfo?(wcosta)

Flags: needinfo?(ted)

Wander Lairson Costa

Assignee

Comment 20

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #19) > (In reply to Mike Hommey [:glandium] from comment #17) > > (In reply to Julian Seward [:jseward] from comment #16) > > > FTR, the underlying Valgrind bug is at > > > https://bugs.kde.org/show_bug.cgi?id=79362. > > Why not just take the fix from that bug? > > I commented more about it at https://bugs.kde.org/show_bug.cgi?id=79362#c59. > I'm fairly sure that fix is wrong -- not wrong in some small detail, but > broken at an architectural/concept level, so as to give wrong code-address > to function-name/line-number mappings in some circumstances. > > That said, from the sound of comments in that bug, the fix actually works > fairly well. As a short term workaround, we could perhaps use a version > of Valgrind with a rebased version of the patch, so as to unblock this > bug (1382280, I mean). I wouldn't want the patch upstream in V (if it > indeed turns out to be wrong), but this would at least give more time to > fix V properly. > > Ted, wcosta, if I rebase said patch against Valgrind 3.13.0, would you be > able to test it easily on automation? Yep, I think so. Attach it to the bug and NI me and I will test it.

Flags: needinfo?(wcosta)

Gregory Szorc [:gps]

Comment 21

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #18) > Alternatively, can we prevent mesa from being upgraded by whatever docker > image changes are happening? Yes. We've had to pin packages in Docker images before to avoid unwanted regressions like this. IMO we should split that fix into its own bug and just do it. We can track Valgrind upgrade separately. But let's not let it block restoring some resemblance of sanity to the desktop-build Docker images (which have essentially been broken for a year because of this wonky issue).

Wander Lairson Costa

Assignee

Comment 22

•

7 years ago

(In reply to Gregory Szorc [:gps] from comment #21) > (In reply to Mike Hommey [:glandium] from comment #18) > > Alternatively, can we prevent mesa from being upgraded by whatever docker > > image changes are happening? > > Yes. We've had to pin packages in Docker images before to avoid unwanted > regressions like this. > > IMO we should split that fix into its own bug and just do it. We can track > Valgrind upgrade separately. But let's not let it block restoring some > resemblance of sanity to the desktop-build Docker images (which have > essentially been broken for a year because of this wonky issue). I actually have local patches that freeze the docker image used by valgrind [1]. It won't solve the issues with valgrind (not even upgrading it), but at least won't stop us from upgrading images for other reasons. [1] https://treeherder.allizom.org/#/jobs?repo=try&author=wcosta@mozilla.com&selectedJob=111055993

Mike Hommey [:glandium]

Comment 23

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #22) > (In reply to Gregory Szorc [:gps] from comment #21) > > (In reply to Mike Hommey [:glandium] from comment #18) > > > Alternatively, can we prevent mesa from being upgraded by whatever docker > > > image changes are happening? > > > > Yes. We've had to pin packages in Docker images before to avoid unwanted > > regressions like this. > > > > IMO we should split that fix into its own bug and just do it. We can track > > Valgrind upgrade separately. But let's not let it block restoring some > > resemblance of sanity to the desktop-build Docker images (which have > > essentially been broken for a year because of this wonky issue). > > I actually have local patches that freeze the docker image used by valgrind > [1]. It won't solve the issues with valgrind (not even upgrading it), but at > least won't stop us from upgrading images for other reasons. Actually, it doesn't really freeze. It only not-applies the home->builds change ; which means all the other problems that have prevented docker-images from being updated still are problems. Also, AIUI, we now don't really have a reason to do the home->builds change anymore.

Julian Seward [:jseward]

Comment 24

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #22) > I actually have local patches that freeze the docker image used by valgrind > [1]. It won't solve the issues with valgrind (not even upgrading it), but at > least won't stop us from upgrading images for other reasons. I think it would be wise for you to move ahead with that plan. I am working on fixing the underlying Valgrind problem, but it is complex, since it requires giving V a model of debuginfo that changes over the life of a process, that it didn't have before. It will take me some days to get that working. The workaround fix I mentioned in comment 19 is too out of date to be usable.

(not currently active) Ted Mielczarek

Reporter

Comment 25

•

7 years ago

> Also, AIUI, we now don't really have a reason to do the home->builds change anymore. Right--in light of bug 1383841, maybe we should abandon that patch series? It sounds like that fixed the perf issue in a different way. The Talos improvements on the bug speak to that. We should still figure out a fix for this bug *somehow*, whether it's pinning the version of mesa or getting a Valgrind fix that lets us suppress the new leak, since we've hit this many times when trying to make changes to desktop-build. I feel like we've tracked the root cause of that bustage down and we should fix it instead of leaving it to bite us again later.

Flags: needinfo?(ted)

Julian Seward [:jseward]

Comment 26

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #20) > (In reply to Julian Seward [:jseward] from comment #19) > Yep, I think so. Attach it to the bug and NI me and I will test it. Wander, there is now an initial patch available at https://bugs.kde.org/show_bug.cgi?id=79362#c62. It only works for the Memcheck tool, but that's all you need. The patch applies against the current stable release (3.13.0) and probably against the trunk, although I haven't tried that. To use it you need to add the flag --keep-debuginfo=yes to |valgrind_args| in our tree build/valgrind/mach_commands.py. I would be interested to hear if it helps. I am working to make a more complete and better tested patch, but I think this is functionally what we need.

Wander Lairson Costa

Assignee

Comment 27

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #26) > (In reply to Wander Lairson Costa [:wcosta] from comment #20) > > (In reply to Julian Seward [:jseward] from comment #19) > > Yep, I think so. Attach it to the bug and NI me and I will test it. > > Wander, there is now an initial patch available at > https://bugs.kde.org/show_bug.cgi?id=79362#c62. > It only works for the Memcheck tool, but that's all you need. The > patch applies against the current stable release (3.13.0) and probably > against the trunk, although I haven't tried that. > > To use it you need to add the flag --keep-debuginfo=yes to |valgrind_args| > in our tree build/valgrind/mach_commands.py. I would be interested to hear > if it helps. I am working to make a more complete and better tested patch, > but I think this is functionally what we need. Julian, thanks a lot, will work on applying it and test :-)

Wander Lairson Costa

Assignee

Comment 28

•

7 years ago

I have good and bad news. The good news is that the patch was successfully applied to Valgrind and it now handle unloaded libraries successfully. The bad news is that we only get partial symbol resolution in stack traces, for no clear reason [1]. From logs, I can see Valgrind can't find symbols for three objects: libICE, libasyncns and libogg. CentOS 6 doesn't provide debug info packages for these libraries (don't ask me why). Even so, I have no idea if it is from these libraries the unresolved addresses come from. Julian, any hint from Valgrind logs I can't see? [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d4f241526d0&selectedJob=121050142

Flags: needinfo?(jseward)

Mike Hommey [:glandium]

Comment 29

•

7 years ago

More than those libraries, it's not finding the symbols for /usr/lib64/libLLVM-3.6-mesa.so

Wander Lairson Costa

Assignee

Comment 30

•

7 years ago

There symbol file is there, but for some reason the address is not resolved. --42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42435-- Reading syms from /lib64/libnss_files-2.12.so --42435-- Considering /usr/lib/debug/.build-id/36/b3bd68cc86738a5131b35566f022ef2004b181.debug .. --42435-- .. build-id is valid --42440-- Considering /usr/lib/debug/.build-id/ec/5d9d345c2d547a19279d0ee052308bb1c77948.debug .. --42440-- .. build-id is valid

Julian Seward [:jseward]

Comment 31

•

7 years ago

It looks as if the patch worked as I intended: --42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so .. --42440-- Archiving syms at 0x16b44ce0-0x17802bb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap() And there are now some usable-ish suppressions offered for all four leaks. I am unsure why unwinding of the leak stacks ends like this by 0x17589914: _GLOBAL__sub_I_Timer.cpp (Timer.cpp:389) by 0x17802BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so) // garbage after this point by 0x16B43CAA: ??? It may be that _GLOBAL__sub_I_Timer.cpp (a strange name for a function) is a global static initialiser and therefore very near the bottom of the stack. And that 0x17802BA5 is an unlabelled address in libLLVM-3.6-mesa.so. Without more information I can't tell. It seems to me we have two possible courses of action: (1) investigate a bit more why the stacks are still bad (2) use what we have and suppress these four leaks We could possibly do (1) by doing another run with more debugging flags, provided you can get me the matching libLLVM-3.6-mesa.so to check against (and I mean). If you want to do that, I suggest: re-run with the flags -v -v --sym-offsets=yes (currently you have -v and -d; -d isn't much use here) and send me libLLVM-3.6-mesa.so Unfortunately that will create a log file of somewhere between 100 and 500 MB (I guess) since -v -v causes it to also log every malloc and free done by Firefox. > can't find symbols for three objects: libICE, libasyncns and > libogg. CentOS 6 doesn't provide debug info packages for these libraries That's bizarre. Are you 100% sure?

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Comment 32

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #31) > provided you can get me the matching libLLVM-3.6-mesa.so to check against > (and I mean). If you want to do that, I suggest: Ach, I should not do bugmail uncaffeinated. I meant to write "and I mean, the *exact same* libLLVM.so".

Wander Lairson Costa

Assignee

Comment 33

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #31) > It looks as if the patch worked as I intended: > > --42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so > .. > --42440-- Archiving syms at 0x16b44ce0-0x17802bb6 in > /usr/lib64/libLLVM-3.6-mesa.so due to munmap() > > And there are now some usable-ish suppressions offered for all four leaks. > > I am unsure why unwinding of the leak stacks ends like this > > by 0x17589914: _GLOBAL__sub_I_Timer.cpp (Timer.cpp:389) > by 0x17802BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so) > // garbage after this point > by 0x16B43CAA: ??? > > It may be that _GLOBAL__sub_I_Timer.cpp (a strange name for a function) is a > global static initialiser and therefore very near the bottom of the stack. Yeah probably, it comes from libclang [root@e8a0c6d350fa ~]# grep -r --include="*.so.[0-9].*" _GLOBAL__sub_I_Timer / Binary file /home/worker/workspace/build/src/clang/lib/libclang.so.3.9 matches I guess it is derived from the __FILE__ macro. > And that 0x17802BA5 is an unlabelled address in libLLVM-3.6-mesa.so. > Without more information I can't tell. > > It seems to me we have two possible courses of action: > > (1) investigate a bit more why the stacks are still bad > > (2) use what we have and suppress these four leaks > > We could possibly do (1) by doing another run with more debugging flags, > provided you can get me the matching libLLVM-3.6-mesa.so to check against > (and I mean). If you want to do that, I suggest: > > re-run with the flags -v -v --sym-offsets=yes > (currently you have -v and -d; -d isn't much use here) > > and send me libLLVM-3.6-mesa.so > I am running it here https://treeherder.mozilla.org/#/jobs?repo=try&revision=af702626bf03781a26c7b4c77db070ebac237117 libLLVM-3.6-mesa.so should be available as an artifact when the task finishes. > Unfortunately that will create a log file of somewhere between 100 and 500 > MB (I guess) since -v -v causes it to also log every malloc and free done by > Firefox. > > > can't find symbols for three objects: libICE, libasyncns and > > libogg. CentOS 6 doesn't provide debug info packages for these libraries > > That's bizarre. Are you 100% sure? Yes, I looked manually for debuginfo packages for these libs and I can confirm they are not available (or they are very well hidden).

Julian Seward [:jseward]

Comment 34

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #33) > I am running it here > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=af702626bf03781a26c7b4c77db070ebac237117 > libLLVM-3.6-mesa.so should be available as an artifact Thanks. I retrieved both the log and library successfully. To recap, we have both symbolisation and unwind failures in libLLVM-3.6-mesa.so: 8 bytes in 1 blocks are definitely lost in loss record 28 of 431 at 0x4C29DB5: operator new(unsigned long)+135 (vg_replace_malloc.c:332) by 0x1758608D: void* llvm::object_creator<std::string>()+13 (ManagedStatic.h:26) by 0x17568642: llvm::ManagedStaticBase::RegisterManagedStatic(void* (*)(), void (*)(void*)) const+114 (ManagedStatic.cpp:39) by 0x1758A914: operator* (ManagedStatic.h:68) by 0x1758A914: getLibSupportInfoOutputFilename (Timer.cpp:40) by 0x1758A914: __static_initialization_and_destruction_0 (Timer.cpp:54) by 0x1758A914: _GLOBAL__sub_I_Timer.cpp+628 (Timer.cpp:389) by 0x17803BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so) From some digging, it looks as if libLLVM has neither symbol table nor Dwarf CFI unwind info that covers 0x17803BA5. That address in the unmapped object is 0xf21ba5: --42461-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42461-- svma 0x0000263ce0, avma 0x0016b45ce0 bias = 0x0016b45ce0 - 0x0000263ce0 = 0x168e2000 so 0x17803BA5 mapped = 0x17803BA5 - 0x168e2000 unmapped = 0xf21ba5 0xf21ba5 is not covered by any symbol table entry: (addr) (size) 0000000000f21ac0 18 FUNC GLOBAL DEFAULT 12 _ZNSt17bad_function_callD@@libLLVM-3.6-mesa.so 0000000000f21bb8 0 FUNC GLOBAL DEFAULT 13 _fini@@libLLVM-3.6-mesa.so and it's also not covered by any CFI entry: 00127fc0 0000000000000014 00127fc4 FDE cie=00000000 pc=0000000000f21b60..0000000000f21b7f LOC CFA ra 0000000000f21b60 rsp+8 c-8 0000000000f21b64 rsp+32 c-8 0000000000f21b7e rsp+8 c-8 00127fd8 ZERO terminator (end of the CFI info) What's strange is that the address 0xf21ba5 is in clearly-valid code (see attachment), and is even in a call instruction, as expected. But it looks like some boilerplate for PLT linkage or other "not standard" code, so perhaps this is the result either of a LLVM bug or some CentOS/libLLVM build system bug that causes this code to be unlabelled. In any case I don't think there's much we can do about it.

Julian Seward [:jseward]

Comment 35

•

7 years ago

Attached file Disassembly pertaining to comment 34 — Details

Julian Seward [:jseward]

Comment 36

•

7 years ago

Attached patch bug1382280-1.diff (obsolete) — Details — Splinter Review

Wanda, given comment 33, can you try with this patch? To be clear: * m-c + this patch, no other changes * using the patched valgrind 3.13.0 that you already have (from comment 26)

Wander Lairson Costa

Assignee

Comment 37

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #36) > Created attachment 8894456 [details] [diff] [review] > bug1382280-1.diff > > Wanda, given comment 33, can you try with this patch? To be clear: > > * m-c + this patch, no other changes > > * using the patched valgrind 3.13.0 that you already have (from comment 26) It feels like the suppression file worked: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f856bd0c301c984ecbd231c0d51b451266aaeb9

Julian Seward [:jseward]

Comment 38

•

7 years ago

(In reply to Wander Lairson Costa [:wcosta] from comment #37) > It feels like the suppression file worked: Good. Are you OK to move forward with this set of fixes? If yes, what needs to happen? I can get the suppression and command line fixes in comment 36 reviewed. Can you make the patched Valgrind "permanent"?

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 39

•

7 years ago

(In reply to Julian Seward [:jseward] from comment #38) > (In reply to Wander Lairson Costa [:wcosta] from comment #37) > > It feels like the suppression file worked: > > Good. Are you OK to move forward with this set of fixes? Yep > If yes, what needs to happen? Well, your and mine patch need to land in the same patch set. > I can get the suppression and command line fixes in comment 36 reviewed. Can you make the patched Valgrind "permanent"? Yep, it is already so, the patch is ready for review. I can push your push together with mine, just tell me who should review it.

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 40

•

7 years ago

Attached patch part 1: Upgrade valgrind r=ted (obsolete) — Details — Splinter Review

Bug 1338651 was backed out because when building a newer image, there was a valgrind leak report that couldn't resolve symbols. Further investigation show the valgrind package installed had symbols stripped. We upgrade valgrind version and build it from source with symbols. We had to build inside the docker image because we need to run "make install". Using "make dist" to generate a tar ball will also run "make docs", and it is hard to make it work because of the outdated texlive package present in CentOS 6.

Wander Lairson Costa

Assignee

Comment 41

•

7 years ago

Attached patch part 2: Install debug information. r=ted (obsolete) — Details — Splinter Review

valgrind test will try to load debug information for the modules present in a stack trace. If it fails to do it, we endup with a stack trace with only memory addresses. We recursively look for all libs in the system common locations, and try to install the corresponding debug information package. This is acomplished with debuginfo-install yum utility script.

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Attachment #8894876 - Flags: review?(ted)

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Attachment #8894877 - Flags: review?(ted)

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Attachment #8894876 - Attachment is obsolete: true

Attachment #8894876 - Flags: review?(ted)

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Attachment #8894877 - Attachment is obsolete: true

Attachment #8894877 - Flags: review?(ted)

Wander Lairson Costa

Assignee

Comment 42

•

7 years ago

Attached patch part 1: Upgrade valgrind r=ted — Details — Splinter Review

Bug 1338651 was backed out because when building a newer image, there was a valgrind leak report that couldn't resolve symbols. Further investigation showed the valgrind package installed had symbols stripped. We upgrade valgrind version and build it from source with symbols. We had to build inside the docker image because we need to run "make install". Using "make dist" to generate a tar ball will also run "make docs", and it is hard to make it work because of the outdated texlive package present in CentOS 6. We also apply a patch [1] to valgrind correctly generate symbols for unloaded objects. [1] https://bugs.kde.org/show_bug.cgi?id=79362#c62

Wander Lairson Costa

Assignee

Comment 43

•

7 years ago

Attached patch part 2: Install debug information. r=ted — Details — Splinter Review

valgrind test will try to load debug information for the modules present in a stack trace. If it fails to do it, we endup with a stack trace with only memory addresses. We install debuginfo for all installed packages and look for all libs in the system common locations, and try to install the corresponding debug information package. These are acomplished with debuginfo-install yum utility script.

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Attachment #8894882 - Flags: review?(ted)

Wander Lairson Costa

Assignee

Updated

•

7 years ago

Attachment #8894883 - Flags: review?(ted)

Julian Seward [:jseward]

Comment 44

•

7 years ago

Attached patch part 3: Update Valgrind suppressions and command line flags. r=n.nethercote. — Details — Splinter Review

Attachment #8894456 - Attachment is obsolete: true

Julian Seward [:jseward]

Updated

•

7 years ago

Attachment #8894914 - Flags: review?(n.nethercote)

Nicholas Nethercote [inactive]

Comment 45

•

7 years ago

Comment on attachment 8894914 [details] [diff] [review] part 3: Update Valgrind suppressions and command line flags. r=n.nethercote. Review of attachment 8894914 [details] [diff] [review]: ----------------------------------------------------------------- ::: build/valgrind/mach_commands.py @@ +118,5 @@ > '--fair-sched=yes', > + # Keep debuginfo after library unmap. See bug 1382280. > + '--keep-debuginfo=yes', > + # Reduce noise level on rustc and/or LLVM compiled code. > + # See bug 1365915 Nit: missing '.'

Attachment #8894914 - Flags: review?(n.nethercote) → review+

(not currently active) Ted Mielczarek

Reporter

Comment 46

•

7 years ago

Comment on attachment 8894882 [details] [diff] [review] part 1: Upgrade valgrind r=ted Review of attachment 8894882 [details] [diff] [review]: ----------------------------------------------------------------- ::: taskcluster/docker/centos6-build/system-setup.sh @@ +324,5 @@ > > +valgrind_version=3.13.0 > +tar -xjf valgrind-$valgrind_version.tar.bz2 > +cd valgrind-$valgrind_version > +patch -p0 < /tmp/valgrind-epochs.patch Can you put a note here about why we're using a patch? Maybe something like: # This patch by Julian Seward allows us to write a suppression for # a leak in a library that gets unloaded before shutdown.

Attachment #8894882 - Flags: review?(ted) → review+

(not currently active) Ted Mielczarek

Reporter

Comment 47

•

7 years ago

Comment on attachment 8894883 [details] [diff] [review] part 2: Install debug information. r=ted Review of attachment 8894883 [details] [diff] [review]: ----------------------------------------------------------------- I'm not going to actually try to review the install-debug-symbols.sh script, I trust that it works. I wish that process was simpler. :-( ::: taskcluster/docker/recipes/install-debug-symbols.sh @@ +1,4 @@ > +#!/bin/bash > +# This Source Code Form is subject to the terms of the Mozilla Public > +# License, v. 2.0. If a copy of the MPL was not distributed with this > +# file, You can obtain one at http://mozilla.org/MPL/2.0/. Might not hurt to name this 'centos-install-debug-symbols.sh' since it's in the generic 'recipes' directory.

Attachment #8894883 - Flags: review?(ted) → review+

Julian Seward [:jseward]

Comment 48

•

7 years ago

(In reply to Ted Mielczarek [:ted.mielczarek] from comment #46) > Can you put a note here about why we're using a patch? Maybe something like: > # This patch by Julian Seward allows us to write a suppression for > # a leak in a library that gets unloaded before shutdown. Plus maybe a link to the upstream bug report https://bugs.kde.org/show_bug.cgi?id=79362 ?

Pulsebot

Comment 49

•

7 years ago

Pushed by wcosta@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f020b3ad85f5 part 1: Upgrade valgrind r=ted https://hg.mozilla.org/integration/mozilla-inbound/rev/70fc0b62fc25 part 2: Install debug information. r=ted https://hg.mozilla.org/integration/mozilla-inbound/rev/878f3ec362ff Update the version of Valgrind in the desktop-build image to one with unstripped shared libraries (part 3). r=n.nethercote.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 50

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f020b3ad85f5 https://hg.mozilla.org/mozilla-central/rev/70fc0b62fc25 https://hg.mozilla.org/mozilla-central/rev/878f3ec362ff

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox57: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Gregory Szorc [:gps]

Updated

•

7 years ago

Blocks: 1389720

BMO Automation

Updated

•

7 years ago

Product: Core → Firefox Build System

Disassembly pertaining to comment 34 7 years ago Julian Seward [:jseward] 5.64 KB, text/plain		Details
bug1382280-1.diff 7 years ago Julian Seward [:jseward] 2.18 KB, patch		Details \| Diff \| Splinter Review
part 1: Upgrade valgrind r=ted 7 years ago Wander Lairson Costa 200.30 KB, patch		Details \| Diff \| Splinter Review
part 2: Install debug information. r=ted 7 years ago Wander Lairson Costa 6.03 KB, patch		Details \| Diff \| Splinter Review
part 1: Upgrade valgrind r=ted 7 years ago Wander Lairson Costa 200.43 KB, patch	ted : review+	Details \| Diff \| Splinter Review
part 2: Install debug information. r=ted 7 years ago Wander Lairson Costa 6.06 KB, patch	ted : review+	Details \| Diff \| Splinter Review
part 3: Update Valgrind suppressions and command line flags. r=n.nethercote. 7 years ago Julian Seward [:jseward] 2.73 KB, patch	n.nethercote : review+	Details \| Diff \| Splinter Review