Closed Bug 1382280 Opened 7 years ago Closed 7 years ago

Update the version of Valgrind in the desktop-build image to one with unstripped shared libraries

Categories

(Firefox Build System :: General, enhancement)

enhancement
Not set
normal

Tracking

(firefox57 fixed)

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: ted, Assigned: wcosta)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 3 obsolete files)

bug 1338651 is hung up on some new Valgrind failures. bug 1338651 comment 166 shows an unhelpful stack: > ==42858== 16 bytes in 1 blocks are definitely lost in loss record 46 of 256 > ==42858== at 0x4C2B525: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) > ==42858== by 0x17509424: ??? > ==42858== by 0x1779BBA5: ??? > ==42858== by 0x16ADCCAA: ??? > ==42858== by 0xFFF0001B2: ??? > ==42858== by 0x772F72656B726F76: ??? > ==42858== by 0x65636170736B726E: ??? > ==42858== by 0x732F646C6975622E: ??? As it turns out, the version of valgrind we've been using for the valgrind *build* jobs (in desktop-test) is some rpm of uncertain provenence (I didn't dig too far): https://dxr.mozilla.org/mozilla-central/rev/1b065ffd8a535a0ad4c39a912af18e948e6a42c1/taskcluster/docker/centos6-build/system-setup.sh#302 sewardj noted that elsewhere in the log Valgrind complained that vgpreload_memcheck-amd64-linux.so was stripped, and when I downloaded the RPM from tooltool and unpacked it, sure enough: $ file usr/lib64/valgrind/vgpreload_core-amd64-linux.so usr/lib64/valgrind/vgpreload_core-amd64-linux.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=6fc247b9ca7a6772443d9e7dc18b30065fa87a03, stripped It turns out that this breaks Valgrind's ability to unwind past things like `operator new`, which is why we get that garbage stack. For the valgrind-mochitest jobs, which run in the Ubuntu test images, we install a newer version of Valgrind from a tarball that sewardj built: https://dxr.mozilla.org/mozilla-central/rev/1b065ffd8a535a0ad4c39a912af18e948e6a42c1/taskcluster/docker/recipes/ubuntu1604-test-system-setup.sh#109 He made sure this build was not stripped, because of this issue. We need to, at the very least, build a new Valgrind for CentOS6 that is not stripped, and switch the centos6-build system-setup to install that instead. That should let us get a useful stack out of this failure and either fix or suppress it. Extra niceties we could do: 1) Update to a much-newer Valgrind--that version is pretty old. sewardj recommended latest stable, which is 3.13.0 at this time. 2) Write a toolchain task to build Valgrind, instead of having someone build it manually and upload a tarball to tooltool. 3) Update the Valgrind version in the Ubuntu testers to match (also making a toolchain task for that binary would be nice).
3) doesn't seem useful. The result of 2 should be usable on ubuntu testers.
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Going to be KISS now, I wrote a proposal [1] to not only valgrind, but all things built inside taskcluster be automatically downloaded as an artifact, avoiding manual, tedious and error prone interventions. [1] https://github.com/taskcluster/taskcluster-rfcs/issues/84
*cough* *cough* work is under way for that in bug 1313111, where the actual task graph stuff is in bug 1374940, a review away from landing, and have patches mostly ready for clang, gcc, libdmg, hfsplug and cctools-port.
> and have patches and *I* have patches
oh, and sccache too. So really, for this bug, you should just care about creating a taskcluster job that builds valgrind.
Upgrading to V latest (3.13.0) is probably also important for bug 1365915.
Blocks: 1365915
(In reply to Mike Hommey [:glandium] from comment #1) > 3) doesn't seem useful. The result of 2 should be usable on ubuntu testers. Well, then making both Docker images install the same Valgrind build would be fine, but it's still useful to have a single version of Valgrind in use for all our CI.
(In reply to Wander Lairson Costa [:wcosta] from comment #2) > Going to be KISS now, I wrote a proposal [1] to not only valgrind, but all To be clear, it's 100% fine to just fix this bug as filed! I do think writing a toolchain task to do the Valgrind build in taskcluster would be a good payoff, as we've found with our other toolchain tasks, plus it sets things up to work nicely with glandium's work in bug 1313111.
I did rebuilt valgrind with full symbols, but *that* non symbol resolved case still happens... https://treeherder.mozilla.org/#/jobs?repo=try&revision=492a8578d6fae8042a42303f3960cf5b2024defa&selectedJob=116508990 I wonder if valgrind has some way to tell which module the stack frame comes from.
Old log (from comment 0): > ==42858== at 0x4C2B525: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) New log (from comment 9): > ==42927== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) So that *did* something, but it didn't fix the missing symbols. You can add -d to the flags passed to valgrind, and it will output some info that should end up interesting. I don't remember if there's something less verbose that would just display the mmapped stuff.
I ran a try push with -d [1] I noticed that the unresolved stack trace started in libLLVM-3.6-mesa.so: [snip] # ************************************ # Notice the address range here... # ************************************ --42915-- Discarding syms at 0x16b7bce0-0x17839bb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap() [snip] TEST-UNEXPECTED-FAIL | valgrind-test | 8 bytes in 1 blocks are definitely lost at operator / ??? / ??? / ??? ==42915== 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 ==42915== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) # ************************************ # ...these stack frames fall into it # ************************************ ==42915== by 0x4C29DB5: ??? ==42915== by 0x175BC08D: ??? ==42915== by 0x1759E642: ??? ==42915== by 0x175C0914: ??? ==42915== by 0x17839BA5: ??? ==42920== by 0x16B7ACAA: ??? [snip] I then looked for libLLVM-3.6-mesa in logs and found: --42920-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42920-- object doesn't have a symbol table Some internet search and I found that doing: debuginfo-install -y mesa-private-llvm Would install debug information for this library (also found out that my former method to install all system debug symbols was flawed). Another try push [2]: --42915-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42915-- Considering /usr/lib/debug/.build-id/ec/5d9d345c2d547a19279d0ee052308bb1c77948.debug .. --42915-- .. build-id is valid But I still get the same behavior: [snip] --42915-- Discarding syms at 0x16b7cce0-0x1783abb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap() [snip] ==42915== 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 ==42915== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) ==42915== by 0x175BD08D: ??? ==42915== by 0x1759F642: ??? ==42915== by 0x175C1914: ??? ==42915== by 0x1783ABA5: ??? [snip] At the moment I am writing this comment, I have a new try push running with more debuginfo packages installed [3]. Will look at results tomorrow morning. Any ideas/suggestions? [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=7380e7a67f31&selectedJob=116649677 [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d515e6abc8fe25324ead5663bbe03b7d36e6592&selectedJob=116750252 [3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=78fd821f493f
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jseward)
(In reply to Wander Lairson Costa [:wcosta] from comment #11) [snip] > # ************************************ > # ...these stack frames fall into it > # ************************************ > ==42915== by 0x4C29DB5: ??? > ==42915== by 0x175BC08D: ??? > ==42915== by 0x1759E642: ??? > ==42915== by 0x175C0914: ??? > ==42915== by 0x17839BA5: ??? > ==42920== by 0x16B7ACAA: ??? Oops, the correct log stack trace is: ==42920== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) ==42920== by 0x175BC08D: ??? ==42920== by 0x1759E642: ??? ==42920== by 0x175C0914: ??? ==42920== by 0x17839BA5: ??? ==42920== by 0x16B7ACAA: ??? [snip]
(In reply to Wander Lairson Costa [:wcosta] from comment #11) > --42915-- Discarding syms at 0x16b7bce0-0x17839bb6 in > /usr/lib64/libLLVM-3.6-mesa.so due to munmap() This is an old (very old) limitation of Valgrind. When a library is dlclosed (which appears at a low level as the munmap() that is mentioned) then Valgrind throws away the debuginfo for the library immediately. As a result, you have correct stacks but they can no longer be symbolised because the relevant debuginfo has been discarded. There's no feasible workaround, apart from "don't unmap any library", which isn't practical because it's often not under our control. We've always put off fixing it because it would be a substantial chunk of work to redo how Valgrind manages debuginfo loading/unloading.
Flags: needinfo?(jseward)
Julian: any thoughts on how we can work around this to unblock this work? I'm guessing this is a leak in the Mesa LLVM internals, which is the sort of thing we'd normally just write a suppression for, but given the fact that it gets unloaded before shutdown we don't get a symbolicated stack there.
Flags: needinfo?(jseward)
One thing we might try is setting `LD_PRELOAD=/usr/lib64/libLLVM-3.6-mesa.so` in the environment for the Valgrind test run, to keep that shared library from being unloaded. You could do that here: https://dxr.mozilla.org/mozilla-central/rev/7d2e89fb92331d7e4296391213c1e63db628e046/build/valgrind/mach_commands.py#90 Also, after some googling, I would bet that getting a version of Mesa with this patch would fix this leak: https://github.com/mesa3d/mesa/commit/83c62597fc8eb38bf274fa1a3ca03c6adafb4bf9 (but that's idle speculation)
FTR, the underlying Valgrind bug is at https://bugs.kde.org/show_bug.cgi?id=79362. Per irc chat with Ted, I'm contemplating whether it's feasible to partially fix this, so that code addresses in objects that are unmapped are tagged with the object name. This would mean that for example, the offending stack trace, instead of appearing like this 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) by 0x175B208D: ??? by 0x17594642: ??? by 0x175B6914: ??? by 0x1782FBA5: ??? would appear like this 8 bytes in 1 blocks are definitely lost in loss record 29 of 256 at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332) by 0x175B208D: (in libLLVM.so) by 0x17594642: (in libLLVM.so) by 0x175B6914: (in someOtherUnmappedLib.so) by 0x1782FBA5: (in etc.so) which would give at least some basis on which to write a suppression for the leak, since suppression stack frames can also match object names.
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #16) > FTR, the underlying Valgrind bug is at > https://bugs.kde.org/show_bug.cgi?id=79362. > Per irc chat with Ted, I'm contemplating whether it's feasible to partially > fix this, so that code addresses in objects that are unmapped are tagged > with the object name. This would mean that for example, the offending stack > trace, instead of appearing like this Why not just take the fix from that bug?
Flags: needinfo?(mh+mozilla)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15) > One thing we might try is setting > `LD_PRELOAD=/usr/lib64/libLLVM-3.6-mesa.so` in the environment for the > Valgrind test run, to keep that shared library from being unloaded. You > could do that here: > https://dxr.mozilla.org/mozilla-central/rev/ > 7d2e89fb92331d7e4296391213c1e63db628e046/build/valgrind/mach_commands.py#90 > > Also, after some googling, I would bet that getting a version of Mesa with > this patch would fix this leak: > https://github.com/mesa3d/mesa/commit/ > 83c62597fc8eb38bf274fa1a3ca03c6adafb4bf9 > > (but that's idle speculation) Alternatively, can we prevent mesa from being upgraded by whatever docker image changes are happening?
(In reply to Mike Hommey [:glandium] from comment #17) > (In reply to Julian Seward [:jseward] from comment #16) > > FTR, the underlying Valgrind bug is at > > https://bugs.kde.org/show_bug.cgi?id=79362. > Why not just take the fix from that bug? I commented more about it at https://bugs.kde.org/show_bug.cgi?id=79362#c59. I'm fairly sure that fix is wrong -- not wrong in some small detail, but broken at an architectural/concept level, so as to give wrong code-address to function-name/line-number mappings in some circumstances. That said, from the sound of comments in that bug, the fix actually works fairly well. As a short term workaround, we could perhaps use a version of Valgrind with a rebased version of the patch, so as to unblock this bug (1382280, I mean). I wouldn't want the patch upstream in V (if it indeed turns out to be wrong), but this would at least give more time to fix V properly. Ted, wcosta, if I rebase said patch against Valgrind 3.13.0, would you be able to test it easily on automation?
Flags: needinfo?(wcosta)
Flags: needinfo?(ted)
(In reply to Julian Seward [:jseward] from comment #19) > (In reply to Mike Hommey [:glandium] from comment #17) > > (In reply to Julian Seward [:jseward] from comment #16) > > > FTR, the underlying Valgrind bug is at > > > https://bugs.kde.org/show_bug.cgi?id=79362. > > Why not just take the fix from that bug? > > I commented more about it at https://bugs.kde.org/show_bug.cgi?id=79362#c59. > I'm fairly sure that fix is wrong -- not wrong in some small detail, but > broken at an architectural/concept level, so as to give wrong code-address > to function-name/line-number mappings in some circumstances. > > That said, from the sound of comments in that bug, the fix actually works > fairly well. As a short term workaround, we could perhaps use a version > of Valgrind with a rebased version of the patch, so as to unblock this > bug (1382280, I mean). I wouldn't want the patch upstream in V (if it > indeed turns out to be wrong), but this would at least give more time to > fix V properly. > > Ted, wcosta, if I rebase said patch against Valgrind 3.13.0, would you be > able to test it easily on automation? Yep, I think so. Attach it to the bug and NI me and I will test it.
Flags: needinfo?(wcosta)
(In reply to Mike Hommey [:glandium] from comment #18) > Alternatively, can we prevent mesa from being upgraded by whatever docker > image changes are happening? Yes. We've had to pin packages in Docker images before to avoid unwanted regressions like this. IMO we should split that fix into its own bug and just do it. We can track Valgrind upgrade separately. But let's not let it block restoring some resemblance of sanity to the desktop-build Docker images (which have essentially been broken for a year because of this wonky issue).
(In reply to Gregory Szorc [:gps] from comment #21) > (In reply to Mike Hommey [:glandium] from comment #18) > > Alternatively, can we prevent mesa from being upgraded by whatever docker > > image changes are happening? > > Yes. We've had to pin packages in Docker images before to avoid unwanted > regressions like this. > > IMO we should split that fix into its own bug and just do it. We can track > Valgrind upgrade separately. But let's not let it block restoring some > resemblance of sanity to the desktop-build Docker images (which have > essentially been broken for a year because of this wonky issue). I actually have local patches that freeze the docker image used by valgrind [1]. It won't solve the issues with valgrind (not even upgrading it), but at least won't stop us from upgrading images for other reasons. [1] https://treeherder.allizom.org/#/jobs?repo=try&author=wcosta@mozilla.com&selectedJob=111055993
(In reply to Wander Lairson Costa [:wcosta] from comment #22) > (In reply to Gregory Szorc [:gps] from comment #21) > > (In reply to Mike Hommey [:glandium] from comment #18) > > > Alternatively, can we prevent mesa from being upgraded by whatever docker > > > image changes are happening? > > > > Yes. We've had to pin packages in Docker images before to avoid unwanted > > regressions like this. > > > > IMO we should split that fix into its own bug and just do it. We can track > > Valgrind upgrade separately. But let's not let it block restoring some > > resemblance of sanity to the desktop-build Docker images (which have > > essentially been broken for a year because of this wonky issue). > > I actually have local patches that freeze the docker image used by valgrind > [1]. It won't solve the issues with valgrind (not even upgrading it), but at > least won't stop us from upgrading images for other reasons. Actually, it doesn't really freeze. It only not-applies the home->builds change ; which means all the other problems that have prevented docker-images from being updated still are problems. Also, AIUI, we now don't really have a reason to do the home->builds change anymore.
(In reply to Wander Lairson Costa [:wcosta] from comment #22) > I actually have local patches that freeze the docker image used by valgrind > [1]. It won't solve the issues with valgrind (not even upgrading it), but at > least won't stop us from upgrading images for other reasons. I think it would be wise for you to move ahead with that plan. I am working on fixing the underlying Valgrind problem, but it is complex, since it requires giving V a model of debuginfo that changes over the life of a process, that it didn't have before. It will take me some days to get that working. The workaround fix I mentioned in comment 19 is too out of date to be usable.
> Also, AIUI, we now don't really have a reason to do the home->builds change anymore. Right--in light of bug 1383841, maybe we should abandon that patch series? It sounds like that fixed the perf issue in a different way. The Talos improvements on the bug speak to that. We should still figure out a fix for this bug *somehow*, whether it's pinning the version of mesa or getting a Valgrind fix that lets us suppress the new leak, since we've hit this many times when trying to make changes to desktop-build. I feel like we've tracked the root cause of that bustage down and we should fix it instead of leaving it to bite us again later.
Flags: needinfo?(ted)
(In reply to Wander Lairson Costa [:wcosta] from comment #20) > (In reply to Julian Seward [:jseward] from comment #19) > Yep, I think so. Attach it to the bug and NI me and I will test it. Wander, there is now an initial patch available at https://bugs.kde.org/show_bug.cgi?id=79362#c62. It only works for the Memcheck tool, but that's all you need. The patch applies against the current stable release (3.13.0) and probably against the trunk, although I haven't tried that. To use it you need to add the flag --keep-debuginfo=yes to |valgrind_args| in our tree build/valgrind/mach_commands.py. I would be interested to hear if it helps. I am working to make a more complete and better tested patch, but I think this is functionally what we need.
(In reply to Julian Seward [:jseward] from comment #26) > (In reply to Wander Lairson Costa [:wcosta] from comment #20) > > (In reply to Julian Seward [:jseward] from comment #19) > > Yep, I think so. Attach it to the bug and NI me and I will test it. > > Wander, there is now an initial patch available at > https://bugs.kde.org/show_bug.cgi?id=79362#c62. > It only works for the Memcheck tool, but that's all you need. The > patch applies against the current stable release (3.13.0) and probably > against the trunk, although I haven't tried that. > > To use it you need to add the flag --keep-debuginfo=yes to |valgrind_args| > in our tree build/valgrind/mach_commands.py. I would be interested to hear > if it helps. I am working to make a more complete and better tested patch, > but I think this is functionally what we need. Julian, thanks a lot, will work on applying it and test :-)
I have good and bad news. The good news is that the patch was successfully applied to Valgrind and it now handle unloaded libraries successfully. The bad news is that we only get partial symbol resolution in stack traces, for no clear reason [1]. From logs, I can see Valgrind can't find symbols for three objects: libICE, libasyncns and libogg. CentOS 6 doesn't provide debug info packages for these libraries (don't ask me why). Even so, I have no idea if it is from these libraries the unresolved addresses come from. Julian, any hint from Valgrind logs I can't see? [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d4f241526d0&selectedJob=121050142
Flags: needinfo?(jseward)
More than those libraries, it's not finding the symbols for /usr/lib64/libLLVM-3.6-mesa.so
There symbol file is there, but for some reason the address is not resolved. --42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42435-- Reading syms from /lib64/libnss_files-2.12.so --42435-- Considering /usr/lib/debug/.build-id/36/b3bd68cc86738a5131b35566f022ef2004b181.debug .. --42435-- .. build-id is valid --42440-- Considering /usr/lib/debug/.build-id/ec/5d9d345c2d547a19279d0ee052308bb1c77948.debug .. --42440-- .. build-id is valid
It looks as if the patch worked as I intended: --42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so .. --42440-- Archiving syms at 0x16b44ce0-0x17802bb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap() And there are now some usable-ish suppressions offered for all four leaks. I am unsure why unwinding of the leak stacks ends like this by 0x17589914: _GLOBAL__sub_I_Timer.cpp (Timer.cpp:389) by 0x17802BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so) // garbage after this point by 0x16B43CAA: ??? It may be that _GLOBAL__sub_I_Timer.cpp (a strange name for a function) is a global static initialiser and therefore very near the bottom of the stack. And that 0x17802BA5 is an unlabelled address in libLLVM-3.6-mesa.so. Without more information I can't tell. It seems to me we have two possible courses of action: (1) investigate a bit more why the stacks are still bad (2) use what we have and suppress these four leaks We could possibly do (1) by doing another run with more debugging flags, provided you can get me the matching libLLVM-3.6-mesa.so to check against (and I mean). If you want to do that, I suggest: re-run with the flags -v -v --sym-offsets=yes (currently you have -v and -d; -d isn't much use here) and send me libLLVM-3.6-mesa.so Unfortunately that will create a log file of somewhere between 100 and 500 MB (I guess) since -v -v causes it to also log every malloc and free done by Firefox. > can't find symbols for three objects: libICE, libasyncns and > libogg. CentOS 6 doesn't provide debug info packages for these libraries That's bizarre. Are you 100% sure?
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #31) > provided you can get me the matching libLLVM-3.6-mesa.so to check against > (and I mean). If you want to do that, I suggest: Ach, I should not do bugmail uncaffeinated. I meant to write "and I mean, the *exact same* libLLVM.so".
(In reply to Julian Seward [:jseward] from comment #31) > It looks as if the patch worked as I intended: > > --42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so > .. > --42440-- Archiving syms at 0x16b44ce0-0x17802bb6 in > /usr/lib64/libLLVM-3.6-mesa.so due to munmap() > > And there are now some usable-ish suppressions offered for all four leaks. > > I am unsure why unwinding of the leak stacks ends like this > > by 0x17589914: _GLOBAL__sub_I_Timer.cpp (Timer.cpp:389) > by 0x17802BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so) > // garbage after this point > by 0x16B43CAA: ??? > > It may be that _GLOBAL__sub_I_Timer.cpp (a strange name for a function) is a > global static initialiser and therefore very near the bottom of the stack. Yeah probably, it comes from libclang [root@e8a0c6d350fa ~]# grep -r --include="*.so.[0-9].*" _GLOBAL__sub_I_Timer / Binary file /home/worker/workspace/build/src/clang/lib/libclang.so.3.9 matches I guess it is derived from the __FILE__ macro. > And that 0x17802BA5 is an unlabelled address in libLLVM-3.6-mesa.so. > Without more information I can't tell. > > It seems to me we have two possible courses of action: > > (1) investigate a bit more why the stacks are still bad > > (2) use what we have and suppress these four leaks > > We could possibly do (1) by doing another run with more debugging flags, > provided you can get me the matching libLLVM-3.6-mesa.so to check against > (and I mean). If you want to do that, I suggest: > > re-run with the flags -v -v --sym-offsets=yes > (currently you have -v and -d; -d isn't much use here) > > and send me libLLVM-3.6-mesa.so > I am running it here https://treeherder.mozilla.org/#/jobs?repo=try&revision=af702626bf03781a26c7b4c77db070ebac237117 libLLVM-3.6-mesa.so should be available as an artifact when the task finishes. > Unfortunately that will create a log file of somewhere between 100 and 500 > MB (I guess) since -v -v causes it to also log every malloc and free done by > Firefox. > > > can't find symbols for three objects: libICE, libasyncns and > > libogg. CentOS 6 doesn't provide debug info packages for these libraries > > That's bizarre. Are you 100% sure? Yes, I looked manually for debuginfo packages for these libs and I can confirm they are not available (or they are very well hidden).
(In reply to Wander Lairson Costa [:wcosta] from comment #33) > I am running it here > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=af702626bf03781a26c7b4c77db070ebac237117 > libLLVM-3.6-mesa.so should be available as an artifact Thanks. I retrieved both the log and library successfully. To recap, we have both symbolisation and unwind failures in libLLVM-3.6-mesa.so: 8 bytes in 1 blocks are definitely lost in loss record 28 of 431 at 0x4C29DB5: operator new(unsigned long)+135 (vg_replace_malloc.c:332) by 0x1758608D: void* llvm::object_creator<std::string>()+13 (ManagedStatic.h:26) by 0x17568642: llvm::ManagedStaticBase::RegisterManagedStatic(void* (*)(), void (*)(void*)) const+114 (ManagedStatic.cpp:39) by 0x1758A914: operator* (ManagedStatic.h:68) by 0x1758A914: getLibSupportInfoOutputFilename (Timer.cpp:40) by 0x1758A914: __static_initialization_and_destruction_0 (Timer.cpp:54) by 0x1758A914: _GLOBAL__sub_I_Timer.cpp+628 (Timer.cpp:389) by 0x17803BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so) From some digging, it looks as if libLLVM has neither symbol table nor Dwarf CFI unwind info that covers 0x17803BA5. That address in the unmapped object is 0xf21ba5: --42461-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so --42461-- svma 0x0000263ce0, avma 0x0016b45ce0 bias = 0x0016b45ce0 - 0x0000263ce0 = 0x168e2000 so 0x17803BA5 mapped = 0x17803BA5 - 0x168e2000 unmapped = 0xf21ba5 0xf21ba5 is not covered by any symbol table entry: (addr) (size) 0000000000f21ac0 18 FUNC GLOBAL DEFAULT 12 _ZNSt17bad_function_callD@@libLLVM-3.6-mesa.so 0000000000f21bb8 0 FUNC GLOBAL DEFAULT 13 _fini@@libLLVM-3.6-mesa.so and it's also not covered by any CFI entry: 00127fc0 0000000000000014 00127fc4 FDE cie=00000000 pc=0000000000f21b60..0000000000f21b7f LOC CFA ra 0000000000f21b60 rsp+8 c-8 0000000000f21b64 rsp+32 c-8 0000000000f21b7e rsp+8 c-8 00127fd8 ZERO terminator (end of the CFI info) What's strange is that the address 0xf21ba5 is in clearly-valid code (see attachment), and is even in a call instruction, as expected. But it looks like some boilerplate for PLT linkage or other "not standard" code, so perhaps this is the result either of a LLVM bug or some CentOS/libLLVM build system bug that causes this code to be unlabelled. In any case I don't think there's much we can do about it.
Attached patch bug1382280-1.diff (obsolete) — Splinter Review
Wanda, given comment 33, can you try with this patch? To be clear: * m-c + this patch, no other changes * using the patched valgrind 3.13.0 that you already have (from comment 26)
(In reply to Julian Seward [:jseward] from comment #36) > Created attachment 8894456 [details] [diff] [review] > bug1382280-1.diff > > Wanda, given comment 33, can you try with this patch? To be clear: > > * m-c + this patch, no other changes > > * using the patched valgrind 3.13.0 that you already have (from comment 26) It feels like the suppression file worked: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f856bd0c301c984ecbd231c0d51b451266aaeb9
(In reply to Wander Lairson Costa [:wcosta] from comment #37) > It feels like the suppression file worked: Good. Are you OK to move forward with this set of fixes? If yes, what needs to happen? I can get the suppression and command line fixes in comment 36 reviewed. Can you make the patched Valgrind "permanent"?
Flags: needinfo?(wcosta)
(In reply to Julian Seward [:jseward] from comment #38) > (In reply to Wander Lairson Costa [:wcosta] from comment #37) > > It feels like the suppression file worked: > > Good. Are you OK to move forward with this set of fixes? Yep > If yes, what needs to happen? Well, your and mine patch need to land in the same patch set. > I can get the suppression and command line fixes in comment 36 reviewed. Can you make the patched Valgrind "permanent"? Yep, it is already so, the patch is ready for review. I can push your push together with mine, just tell me who should review it.
Flags: needinfo?(wcosta)
Attached patch part 1: Upgrade valgrind r=ted (obsolete) — Splinter Review
Bug 1338651 was backed out because when building a newer image, there was a valgrind leak report that couldn't resolve symbols. Further investigation show the valgrind package installed had symbols stripped. We upgrade valgrind version and build it from source with symbols. We had to build inside the docker image because we need to run "make install". Using "make dist" to generate a tar ball will also run "make docs", and it is hard to make it work because of the outdated texlive package present in CentOS 6.
valgrind test will try to load debug information for the modules present in a stack trace. If it fails to do it, we endup with a stack trace with only memory addresses. We recursively look for all libs in the system common locations, and try to install the corresponding debug information package. This is acomplished with debuginfo-install yum utility script.
Attachment #8894876 - Flags: review?(ted)
Attachment #8894877 - Flags: review?(ted)
Attachment #8894876 - Attachment is obsolete: true
Attachment #8894876 - Flags: review?(ted)
Attachment #8894877 - Attachment is obsolete: true
Attachment #8894877 - Flags: review?(ted)
Bug 1338651 was backed out because when building a newer image, there was a valgrind leak report that couldn't resolve symbols. Further investigation showed the valgrind package installed had symbols stripped. We upgrade valgrind version and build it from source with symbols. We had to build inside the docker image because we need to run "make install". Using "make dist" to generate a tar ball will also run "make docs", and it is hard to make it work because of the outdated texlive package present in CentOS 6. We also apply a patch [1] to valgrind correctly generate symbols for unloaded objects. [1] https://bugs.kde.org/show_bug.cgi?id=79362#c62
valgrind test will try to load debug information for the modules present in a stack trace. If it fails to do it, we endup with a stack trace with only memory addresses. We install debuginfo for all installed packages and look for all libs in the system common locations, and try to install the corresponding debug information package. These are acomplished with debuginfo-install yum utility script.
Attachment #8894882 - Flags: review?(ted)
Attachment #8894883 - Flags: review?(ted)
Attachment #8894914 - Flags: review?(n.nethercote)
Comment on attachment 8894914 [details] [diff] [review] part 3: Update Valgrind suppressions and command line flags. r=n.nethercote. Review of attachment 8894914 [details] [diff] [review]: ----------------------------------------------------------------- ::: build/valgrind/mach_commands.py @@ +118,5 @@ > '--fair-sched=yes', > + # Keep debuginfo after library unmap. See bug 1382280. > + '--keep-debuginfo=yes', > + # Reduce noise level on rustc and/or LLVM compiled code. > + # See bug 1365915 Nit: missing '.'
Attachment #8894914 - Flags: review?(n.nethercote) → review+
Comment on attachment 8894882 [details] [diff] [review] part 1: Upgrade valgrind r=ted Review of attachment 8894882 [details] [diff] [review]: ----------------------------------------------------------------- ::: taskcluster/docker/centos6-build/system-setup.sh @@ +324,5 @@ > > +valgrind_version=3.13.0 > +tar -xjf valgrind-$valgrind_version.tar.bz2 > +cd valgrind-$valgrind_version > +patch -p0 < /tmp/valgrind-epochs.patch Can you put a note here about why we're using a patch? Maybe something like: # This patch by Julian Seward allows us to write a suppression for # a leak in a library that gets unloaded before shutdown.
Attachment #8894882 - Flags: review?(ted) → review+
Comment on attachment 8894883 [details] [diff] [review] part 2: Install debug information. r=ted Review of attachment 8894883 [details] [diff] [review]: ----------------------------------------------------------------- I'm not going to actually try to review the install-debug-symbols.sh script, I trust that it works. I wish that process was simpler. :-( ::: taskcluster/docker/recipes/install-debug-symbols.sh @@ +1,4 @@ > +#!/bin/bash > +# This Source Code Form is subject to the terms of the Mozilla Public > +# License, v. 2.0. If a copy of the MPL was not distributed with this > +# file, You can obtain one at http://mozilla.org/MPL/2.0/. Might not hurt to name this 'centos-install-debug-symbols.sh' since it's in the generic 'recipes' directory.
Attachment #8894883 - Flags: review?(ted) → review+
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #46) > Can you put a note here about why we're using a patch? Maybe something like: > # This patch by Julian Seward allows us to write a suppression for > # a leak in a library that gets unloaded before shutdown. Plus maybe a link to the upstream bug report https://bugs.kde.org/show_bug.cgi?id=79362 ?
Pushed by wcosta@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f020b3ad85f5 part 1: Upgrade valgrind r=ted https://hg.mozilla.org/integration/mozilla-inbound/rev/70fc0b62fc25 part 2: Install debug information. r=ted https://hg.mozilla.org/integration/mozilla-inbound/rev/878f3ec362ff Update the version of Valgrind in the desktop-build image to one with unstripped shared libraries (part 3). r=n.nethercote.
Blocks: 1389720
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: