Closed
Bug 1382280
Opened 7 years ago
Closed 7 years ago
Update the version of Valgrind in the desktop-build image to one with unstripped shared libraries
Categories
(Firefox Build System :: General, enhancement)
Firefox Build System
General
Tracking
(firefox57 fixed)
RESOLVED
FIXED
mozilla57
Tracking | Status | |
---|---|---|
firefox57 | --- | fixed |
People
(Reporter: ted, Assigned: wcosta)
References
(Blocks 1 open bug)
Details
Attachments
(4 files, 3 obsolete files)
5.64 KB,
text/plain
|
Details | |
200.43 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
6.06 KB,
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
2.73 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
bug 1338651 is hung up on some new Valgrind failures. bug 1338651 comment 166 shows an unhelpful stack:
> ==42858== 16 bytes in 1 blocks are definitely lost in loss record 46 of 256
> ==42858== at 0x4C2B525: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
> ==42858== by 0x17509424: ???
> ==42858== by 0x1779BBA5: ???
> ==42858== by 0x16ADCCAA: ???
> ==42858== by 0xFFF0001B2: ???
> ==42858== by 0x772F72656B726F76: ???
> ==42858== by 0x65636170736B726E: ???
> ==42858== by 0x732F646C6975622E: ???
As it turns out, the version of valgrind we've been using for the valgrind *build* jobs (in desktop-test) is some rpm of uncertain provenence (I didn't dig too far):
https://dxr.mozilla.org/mozilla-central/rev/1b065ffd8a535a0ad4c39a912af18e948e6a42c1/taskcluster/docker/centos6-build/system-setup.sh#302
sewardj noted that elsewhere in the log Valgrind complained that vgpreload_memcheck-amd64-linux.so was stripped, and when I downloaded the RPM from tooltool and unpacked it, sure enough:
$ file usr/lib64/valgrind/vgpreload_core-amd64-linux.so
usr/lib64/valgrind/vgpreload_core-amd64-linux.so: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, BuildID[sha1]=6fc247b9ca7a6772443d9e7dc18b30065fa87a03, stripped
It turns out that this breaks Valgrind's ability to unwind past things like `operator new`, which is why we get that garbage stack.
For the valgrind-mochitest jobs, which run in the Ubuntu test images, we install a newer version of Valgrind from a tarball that sewardj built:
https://dxr.mozilla.org/mozilla-central/rev/1b065ffd8a535a0ad4c39a912af18e948e6a42c1/taskcluster/docker/recipes/ubuntu1604-test-system-setup.sh#109
He made sure this build was not stripped, because of this issue.
We need to, at the very least, build a new Valgrind for CentOS6 that is not stripped, and switch the centos6-build system-setup to install that instead. That should let us get a useful stack out of this failure and either fix or suppress it.
Extra niceties we could do:
1) Update to a much-newer Valgrind--that version is pretty old. sewardj recommended latest stable, which is 3.13.0 at this time.
2) Write a toolchain task to build Valgrind, instead of having someone build it manually and upload a tarball to tooltool.
3) Update the Valgrind version in the Ubuntu testers to match (also making a toolchain task for that binary would be nice).
Comment 1•7 years ago
|
||
3) doesn't seem useful. The result of 2 should be usable on ubuntu testers.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → wcosta
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•7 years ago
|
||
Going to be KISS now, I wrote a proposal [1] to not only valgrind, but all things built inside taskcluster be automatically downloaded as an artifact, avoiding manual, tedious and error prone interventions.
[1] https://github.com/taskcluster/taskcluster-rfcs/issues/84
Comment 3•7 years ago
|
||
*cough* *cough* work is under way for that in bug 1313111, where the actual task graph stuff is in bug 1374940, a review away from landing, and have patches mostly ready for clang, gcc, libdmg, hfsplug and cctools-port.
Comment 4•7 years ago
|
||
> and have patches
and *I* have patches
Comment 5•7 years ago
|
||
oh, and sccache too.
So really, for this bug, you should just care about creating a taskcluster job that builds valgrind.
Comment 6•7 years ago
|
||
Upgrading to V latest (3.13.0) is probably also important for bug 1365915.
Blocks: 1365915
Reporter | ||
Comment 7•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #1)
> 3) doesn't seem useful. The result of 2 should be usable on ubuntu testers.
Well, then making both Docker images install the same Valgrind build would be fine, but it's still useful to have a single version of Valgrind in use for all our CI.
Reporter | ||
Comment 8•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #2)
> Going to be KISS now, I wrote a proposal [1] to not only valgrind, but all
To be clear, it's 100% fine to just fix this bug as filed! I do think writing a toolchain task to do the Valgrind build in taskcluster would be a good payoff, as we've found with our other toolchain tasks, plus it sets things up to work nicely with glandium's work in bug 1313111.
Assignee | ||
Comment 9•7 years ago
|
||
I did rebuilt valgrind with full symbols, but *that* non symbol resolved case still happens...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=492a8578d6fae8042a42303f3960cf5b2024defa&selectedJob=116508990
I wonder if valgrind has some way to tell which module the stack frame comes from.
Comment 10•7 years ago
|
||
Old log (from comment 0):
> ==42858== at 0x4C2B525: operator new(unsigned long) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
New log (from comment 9):
> ==42927== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332)
So that *did* something, but it didn't fix the missing symbols. You can add -d to the flags passed to valgrind, and it will output some info that should end up interesting. I don't remember if there's something less verbose that would just display the mmapped stuff.
Assignee | ||
Comment 11•7 years ago
|
||
I ran a try push with -d [1]
I noticed that the unresolved stack trace started in libLLVM-3.6-mesa.so:
[snip]
# ************************************
# Notice the address range here...
# ************************************
--42915-- Discarding syms at 0x16b7bce0-0x17839bb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap()
[snip]
TEST-UNEXPECTED-FAIL | valgrind-test | 8 bytes in 1 blocks are definitely lost at operator / ??? / ??? / ???
==42915== 8 bytes in 1 blocks are definitely lost in loss record 29 of 256
==42915== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332)
# ************************************
# ...these stack frames fall into it
# ************************************
==42915== by 0x4C29DB5: ???
==42915== by 0x175BC08D: ???
==42915== by 0x1759E642: ???
==42915== by 0x175C0914: ???
==42915== by 0x17839BA5: ???
==42920== by 0x16B7ACAA: ???
[snip]
I then looked for libLLVM-3.6-mesa in logs and found:
--42920-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so
--42920-- object doesn't have a symbol table
Some internet search and I found that doing:
debuginfo-install -y mesa-private-llvm
Would install debug information for this library (also found out that my former method to install all system debug symbols was flawed).
Another try push [2]:
--42915-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so
--42915-- Considering /usr/lib/debug/.build-id/ec/5d9d345c2d547a19279d0ee052308bb1c77948.debug ..
--42915-- .. build-id is valid
But I still get the same behavior:
[snip]
--42915-- Discarding syms at 0x16b7cce0-0x1783abb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap()
[snip]
==42915== 8 bytes in 1 blocks are definitely lost in loss record 29 of 256
==42915== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332)
==42915== by 0x175BD08D: ???
==42915== by 0x1759F642: ???
==42915== by 0x175C1914: ???
==42915== by 0x1783ABA5: ???
[snip]
At the moment I am writing this comment, I have a new try push running with more debuginfo packages installed [3]. Will look at results tomorrow morning.
Any ideas/suggestions?
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=7380e7a67f31&selectedJob=116649677
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=9d515e6abc8fe25324ead5663bbe03b7d36e6592&selectedJob=116750252
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=78fd821f493f
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jseward)
Assignee | ||
Comment 12•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #11)
[snip]
> # ************************************
> # ...these stack frames fall into it
> # ************************************
> ==42915== by 0x4C29DB5: ???
> ==42915== by 0x175BC08D: ???
> ==42915== by 0x1759E642: ???
> ==42915== by 0x175C0914: ???
> ==42915== by 0x17839BA5: ???
> ==42920== by 0x16B7ACAA: ???
Oops, the correct log stack trace is:
==42920== at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332)
==42920== by 0x175BC08D: ???
==42920== by 0x1759E642: ???
==42920== by 0x175C0914: ???
==42920== by 0x17839BA5: ???
==42920== by 0x16B7ACAA: ???
[snip]
Comment 13•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #11)
> --42915-- Discarding syms at 0x16b7bce0-0x17839bb6 in
> /usr/lib64/libLLVM-3.6-mesa.so due to munmap()
This is an old (very old) limitation of Valgrind. When a library is
dlclosed (which appears at a low level as the munmap() that is mentioned)
then Valgrind throws away the debuginfo for the library immediately. As a
result, you have correct stacks but they can no longer be symbolised because
the relevant debuginfo has been discarded.
There's no feasible workaround, apart from "don't unmap any library", which
isn't practical because it's often not under our control. We've always
put off fixing it because it would be a substantial chunk of work
to redo how Valgrind manages debuginfo loading/unloading.
Flags: needinfo?(jseward)
Reporter | ||
Comment 14•7 years ago
|
||
Julian: any thoughts on how we can work around this to unblock this work? I'm guessing this is a leak in the Mesa LLVM internals, which is the sort of thing we'd normally just write a suppression for, but given the fact that it gets unloaded before shutdown we don't get a symbolicated stack there.
Flags: needinfo?(jseward)
Reporter | ||
Comment 15•7 years ago
|
||
One thing we might try is setting `LD_PRELOAD=/usr/lib64/libLLVM-3.6-mesa.so` in the environment for the Valgrind test run, to keep that shared library from being unloaded. You could do that here:
https://dxr.mozilla.org/mozilla-central/rev/7d2e89fb92331d7e4296391213c1e63db628e046/build/valgrind/mach_commands.py#90
Also, after some googling, I would bet that getting a version of Mesa with this patch would fix this leak:
https://github.com/mesa3d/mesa/commit/83c62597fc8eb38bf274fa1a3ca03c6adafb4bf9
(but that's idle speculation)
Comment 16•7 years ago
|
||
FTR, the underlying Valgrind bug is at https://bugs.kde.org/show_bug.cgi?id=79362.
Per irc chat with Ted, I'm contemplating whether it's feasible to partially
fix this, so that code addresses in objects that are unmapped are tagged
with the object name. This would mean that for example, the offending stack
trace, instead of appearing like this
8 bytes in 1 blocks are definitely lost in loss record 29 of 256
at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332)
by 0x175B208D: ???
by 0x17594642: ???
by 0x175B6914: ???
by 0x1782FBA5: ???
would appear like this
8 bytes in 1 blocks are definitely lost in loss record 29 of 256
at 0x4C29DB5: operator new(unsigned long) (vg_replace_malloc.c:332)
by 0x175B208D: (in libLLVM.so)
by 0x17594642: (in libLLVM.so)
by 0x175B6914: (in someOtherUnmappedLib.so)
by 0x1782FBA5: (in etc.so)
which would give at least some basis on which to write a suppression
for the leak, since suppression stack frames can also match object names.
Flags: needinfo?(jseward)
Comment 17•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #16)
> FTR, the underlying Valgrind bug is at
> https://bugs.kde.org/show_bug.cgi?id=79362.
> Per irc chat with Ted, I'm contemplating whether it's feasible to partially
> fix this, so that code addresses in objects that are unmapped are tagged
> with the object name. This would mean that for example, the offending stack
> trace, instead of appearing like this
Why not just take the fix from that bug?
Flags: needinfo?(mh+mozilla)
Comment 18•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> One thing we might try is setting
> `LD_PRELOAD=/usr/lib64/libLLVM-3.6-mesa.so` in the environment for the
> Valgrind test run, to keep that shared library from being unloaded. You
> could do that here:
> https://dxr.mozilla.org/mozilla-central/rev/
> 7d2e89fb92331d7e4296391213c1e63db628e046/build/valgrind/mach_commands.py#90
>
> Also, after some googling, I would bet that getting a version of Mesa with
> this patch would fix this leak:
> https://github.com/mesa3d/mesa/commit/
> 83c62597fc8eb38bf274fa1a3ca03c6adafb4bf9
>
> (but that's idle speculation)
Alternatively, can we prevent mesa from being upgraded by whatever docker image changes are happening?
Comment 19•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #17)
> (In reply to Julian Seward [:jseward] from comment #16)
> > FTR, the underlying Valgrind bug is at
> > https://bugs.kde.org/show_bug.cgi?id=79362.
> Why not just take the fix from that bug?
I commented more about it at https://bugs.kde.org/show_bug.cgi?id=79362#c59.
I'm fairly sure that fix is wrong -- not wrong in some small detail, but
broken at an architectural/concept level, so as to give wrong code-address
to function-name/line-number mappings in some circumstances.
That said, from the sound of comments in that bug, the fix actually works
fairly well. As a short term workaround, we could perhaps use a version
of Valgrind with a rebased version of the patch, so as to unblock this
bug (1382280, I mean). I wouldn't want the patch upstream in V (if it
indeed turns out to be wrong), but this would at least give more time to
fix V properly.
Ted, wcosta, if I rebase said patch against Valgrind 3.13.0, would you be
able to test it easily on automation?
Flags: needinfo?(wcosta)
Flags: needinfo?(ted)
Assignee | ||
Comment 20•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #19)
> (In reply to Mike Hommey [:glandium] from comment #17)
> > (In reply to Julian Seward [:jseward] from comment #16)
> > > FTR, the underlying Valgrind bug is at
> > > https://bugs.kde.org/show_bug.cgi?id=79362.
> > Why not just take the fix from that bug?
>
> I commented more about it at https://bugs.kde.org/show_bug.cgi?id=79362#c59.
> I'm fairly sure that fix is wrong -- not wrong in some small detail, but
> broken at an architectural/concept level, so as to give wrong code-address
> to function-name/line-number mappings in some circumstances.
>
> That said, from the sound of comments in that bug, the fix actually works
> fairly well. As a short term workaround, we could perhaps use a version
> of Valgrind with a rebased version of the patch, so as to unblock this
> bug (1382280, I mean). I wouldn't want the patch upstream in V (if it
> indeed turns out to be wrong), but this would at least give more time to
> fix V properly.
>
> Ted, wcosta, if I rebase said patch against Valgrind 3.13.0, would you be
> able to test it easily on automation?
Yep, I think so. Attach it to the bug and NI me and I will test it.
Flags: needinfo?(wcosta)
Comment 21•7 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #18)
> Alternatively, can we prevent mesa from being upgraded by whatever docker
> image changes are happening?
Yes. We've had to pin packages in Docker images before to avoid unwanted regressions like this.
IMO we should split that fix into its own bug and just do it. We can track Valgrind upgrade separately. But let's not let it block restoring some resemblance of sanity to the desktop-build Docker images (which have essentially been broken for a year because of this wonky issue).
Assignee | ||
Comment 22•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #21)
> (In reply to Mike Hommey [:glandium] from comment #18)
> > Alternatively, can we prevent mesa from being upgraded by whatever docker
> > image changes are happening?
>
> Yes. We've had to pin packages in Docker images before to avoid unwanted
> regressions like this.
>
> IMO we should split that fix into its own bug and just do it. We can track
> Valgrind upgrade separately. But let's not let it block restoring some
> resemblance of sanity to the desktop-build Docker images (which have
> essentially been broken for a year because of this wonky issue).
I actually have local patches that freeze the docker image used by valgrind [1]. It won't solve the issues with valgrind (not even upgrading it), but at least won't stop us from upgrading images for other reasons.
[1] https://treeherder.allizom.org/#/jobs?repo=try&author=wcosta@mozilla.com&selectedJob=111055993
Comment 23•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #22)
> (In reply to Gregory Szorc [:gps] from comment #21)
> > (In reply to Mike Hommey [:glandium] from comment #18)
> > > Alternatively, can we prevent mesa from being upgraded by whatever docker
> > > image changes are happening?
> >
> > Yes. We've had to pin packages in Docker images before to avoid unwanted
> > regressions like this.
> >
> > IMO we should split that fix into its own bug and just do it. We can track
> > Valgrind upgrade separately. But let's not let it block restoring some
> > resemblance of sanity to the desktop-build Docker images (which have
> > essentially been broken for a year because of this wonky issue).
>
> I actually have local patches that freeze the docker image used by valgrind
> [1]. It won't solve the issues with valgrind (not even upgrading it), but at
> least won't stop us from upgrading images for other reasons.
Actually, it doesn't really freeze. It only not-applies the home->builds change ; which means all the other problems that have prevented docker-images from being updated still are problems.
Also, AIUI, we now don't really have a reason to do the home->builds change anymore.
Comment 24•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #22)
> I actually have local patches that freeze the docker image used by valgrind
> [1]. It won't solve the issues with valgrind (not even upgrading it), but at
> least won't stop us from upgrading images for other reasons.
I think it would be wise for you to move ahead with that plan. I am
working on fixing the underlying Valgrind problem, but it is complex,
since it requires giving V a model of debuginfo that changes over the
life of a process, that it didn't have before. It will take me some
days to get that working. The workaround fix I mentioned in comment 19
is too out of date to be usable.
Reporter | ||
Comment 25•7 years ago
|
||
> Also, AIUI, we now don't really have a reason to do the home->builds change anymore.
Right--in light of bug 1383841, maybe we should abandon that patch series? It sounds like that fixed the perf issue in a different way. The Talos improvements on the bug speak to that. We should still figure out a fix for this bug *somehow*, whether it's pinning the version of mesa or getting a Valgrind fix that lets us suppress the new leak, since we've hit this many times when trying to make changes to desktop-build. I feel like we've tracked the root cause of that bustage down and we should fix it instead of leaving it to bite us again later.
Flags: needinfo?(ted)
Comment 26•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #20)
> (In reply to Julian Seward [:jseward] from comment #19)
> Yep, I think so. Attach it to the bug and NI me and I will test it.
Wander, there is now an initial patch available at https://bugs.kde.org/show_bug.cgi?id=79362#c62.
It only works for the Memcheck tool, but that's all you need. The
patch applies against the current stable release (3.13.0) and probably
against the trunk, although I haven't tried that.
To use it you need to add the flag --keep-debuginfo=yes to |valgrind_args|
in our tree build/valgrind/mach_commands.py. I would be interested to hear
if it helps. I am working to make a more complete and better tested patch,
but I think this is functionally what we need.
Assignee | ||
Comment 27•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #26)
> (In reply to Wander Lairson Costa [:wcosta] from comment #20)
> > (In reply to Julian Seward [:jseward] from comment #19)
> > Yep, I think so. Attach it to the bug and NI me and I will test it.
>
> Wander, there is now an initial patch available at
> https://bugs.kde.org/show_bug.cgi?id=79362#c62.
> It only works for the Memcheck tool, but that's all you need. The
> patch applies against the current stable release (3.13.0) and probably
> against the trunk, although I haven't tried that.
>
> To use it you need to add the flag --keep-debuginfo=yes to |valgrind_args|
> in our tree build/valgrind/mach_commands.py. I would be interested to hear
> if it helps. I am working to make a more complete and better tested patch,
> but I think this is functionally what we need.
Julian, thanks a lot, will work on applying it and test :-)
Assignee | ||
Comment 28•7 years ago
|
||
I have good and bad news. The good news is that the patch was successfully applied to Valgrind and it now handle unloaded libraries successfully. The bad news is that we only get partial symbol resolution in stack traces, for no clear reason [1]. From logs, I can see Valgrind can't find symbols for three objects: libICE, libasyncns and libogg. CentOS 6 doesn't provide debug info packages for these libraries (don't ask me why). Even so, I have no idea if it is from these libraries the unresolved addresses come from.
Julian, any hint from Valgrind logs I can't see?
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d4f241526d0&selectedJob=121050142
Flags: needinfo?(jseward)
Comment 29•7 years ago
|
||
More than those libraries, it's not finding the symbols for /usr/lib64/libLLVM-3.6-mesa.so
Assignee | ||
Comment 30•7 years ago
|
||
There symbol file is there, but for some reason the address is not resolved.
--42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so
--42435-- Reading syms from /lib64/libnss_files-2.12.so
--42435-- Considering /usr/lib/debug/.build-id/36/b3bd68cc86738a5131b35566f022ef2004b181.debug ..
--42435-- .. build-id is valid
--42440-- Considering /usr/lib/debug/.build-id/ec/5d9d345c2d547a19279d0ee052308bb1c77948.debug ..
--42440-- .. build-id is valid
Comment 31•7 years ago
|
||
It looks as if the patch worked as I intended:
--42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so
..
--42440-- Archiving syms at 0x16b44ce0-0x17802bb6 in /usr/lib64/libLLVM-3.6-mesa.so due to munmap()
And there are now some usable-ish suppressions offered for all four leaks.
I am unsure why unwinding of the leak stacks ends like this
by 0x17589914: _GLOBAL__sub_I_Timer.cpp (Timer.cpp:389)
by 0x17802BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so)
// garbage after this point
by 0x16B43CAA: ???
It may be that _GLOBAL__sub_I_Timer.cpp (a strange name for a function) is a
global static initialiser and therefore very near the bottom of the stack.
And that 0x17802BA5 is an unlabelled address in libLLVM-3.6-mesa.so.
Without more information I can't tell.
It seems to me we have two possible courses of action:
(1) investigate a bit more why the stacks are still bad
(2) use what we have and suppress these four leaks
We could possibly do (1) by doing another run with more debugging flags,
provided you can get me the matching libLLVM-3.6-mesa.so to check against
(and I mean). If you want to do that, I suggest:
re-run with the flags -v -v --sym-offsets=yes
(currently you have -v and -d; -d isn't much use here)
and send me libLLVM-3.6-mesa.so
Unfortunately that will create a log file of somewhere between 100 and 500
MB (I guess) since -v -v causes it to also log every malloc and free done by
Firefox.
> can't find symbols for three objects: libICE, libasyncns and
> libogg. CentOS 6 doesn't provide debug info packages for these libraries
That's bizarre. Are you 100% sure?
Flags: needinfo?(jseward)
Comment 32•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #31)
> provided you can get me the matching libLLVM-3.6-mesa.so to check against
> (and I mean). If you want to do that, I suggest:
Ach, I should not do bugmail uncaffeinated.
I meant to write "and I mean, the *exact same* libLLVM.so".
Assignee | ||
Comment 33•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #31)
> It looks as if the patch worked as I intended:
>
> --42440-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so
> ..
> --42440-- Archiving syms at 0x16b44ce0-0x17802bb6 in
> /usr/lib64/libLLVM-3.6-mesa.so due to munmap()
>
> And there are now some usable-ish suppressions offered for all four leaks.
>
> I am unsure why unwinding of the leak stacks ends like this
>
> by 0x17589914: _GLOBAL__sub_I_Timer.cpp (Timer.cpp:389)
> by 0x17802BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so)
> // garbage after this point
> by 0x16B43CAA: ???
>
> It may be that _GLOBAL__sub_I_Timer.cpp (a strange name for a function) is a
> global static initialiser and therefore very near the bottom of the stack.
Yeah probably, it comes from libclang
[root@e8a0c6d350fa ~]# grep -r --include="*.so.[0-9].*" _GLOBAL__sub_I_Timer /
Binary file /home/worker/workspace/build/src/clang/lib/libclang.so.3.9 matches
I guess it is derived from the __FILE__ macro.
> And that 0x17802BA5 is an unlabelled address in libLLVM-3.6-mesa.so.
> Without more information I can't tell.
>
> It seems to me we have two possible courses of action:
>
> (1) investigate a bit more why the stacks are still bad
>
> (2) use what we have and suppress these four leaks
>
> We could possibly do (1) by doing another run with more debugging flags,
> provided you can get me the matching libLLVM-3.6-mesa.so to check against
> (and I mean). If you want to do that, I suggest:
>
> re-run with the flags -v -v --sym-offsets=yes
> (currently you have -v and -d; -d isn't much use here)
>
> and send me libLLVM-3.6-mesa.so
>
I am running it here https://treeherder.mozilla.org/#/jobs?repo=try&revision=af702626bf03781a26c7b4c77db070ebac237117
libLLVM-3.6-mesa.so should be available as an artifact when the task finishes.
> Unfortunately that will create a log file of somewhere between 100 and 500
> MB (I guess) since -v -v causes it to also log every malloc and free done by
> Firefox.
>
> > can't find symbols for three objects: libICE, libasyncns and
> > libogg. CentOS 6 doesn't provide debug info packages for these libraries
>
> That's bizarre. Are you 100% sure?
Yes, I looked manually for debuginfo packages for these libs and I can confirm they are not available (or they are very well hidden).
Comment 34•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #33)
> I am running it here
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=af702626bf03781a26c7b4c77db070ebac237117
> libLLVM-3.6-mesa.so should be available as an artifact
Thanks. I retrieved both the log and library successfully. To recap, we have
both symbolisation and unwind failures in libLLVM-3.6-mesa.so:
8 bytes in 1 blocks are definitely lost in loss record 28 of 431
at 0x4C29DB5: operator new(unsigned long)+135 (vg_replace_malloc.c:332)
by 0x1758608D: void* llvm::object_creator<std::string>()+13 (ManagedStatic.h:26)
by 0x17568642: llvm::ManagedStaticBase::RegisterManagedStatic(void* (*)(), void (*)(void*)) const+114 (ManagedStatic.cpp:39)
by 0x1758A914: operator* (ManagedStatic.h:68)
by 0x1758A914: getLibSupportInfoOutputFilename (Timer.cpp:40)
by 0x1758A914: __static_initialization_and_destruction_0 (Timer.cpp:54)
by 0x1758A914: _GLOBAL__sub_I_Timer.cpp+628 (Timer.cpp:389)
by 0x17803BA5: ??? (in /usr/lib64/libLLVM-3.6-mesa.so)
From some digging, it looks as if libLLVM has neither symbol table nor Dwarf CFI
unwind info that covers 0x17803BA5. That address in the unmapped object is 0xf21ba5:
--42461-- Reading syms from /usr/lib64/libLLVM-3.6-mesa.so
--42461-- svma 0x0000263ce0, avma 0x0016b45ce0
bias = 0x0016b45ce0 - 0x0000263ce0 = 0x168e2000
so 0x17803BA5 mapped = 0x17803BA5 - 0x168e2000 unmapped = 0xf21ba5
0xf21ba5 is not covered by any symbol table entry:
(addr) (size)
0000000000f21ac0 18 FUNC GLOBAL DEFAULT 12 _ZNSt17bad_function_callD@@libLLVM-3.6-mesa.so
0000000000f21bb8 0 FUNC GLOBAL DEFAULT 13 _fini@@libLLVM-3.6-mesa.so
and it's also not covered by any CFI entry:
00127fc0 0000000000000014 00127fc4 FDE cie=00000000 pc=0000000000f21b60..0000000000f21b7f
LOC CFA ra
0000000000f21b60 rsp+8 c-8
0000000000f21b64 rsp+32 c-8
0000000000f21b7e rsp+8 c-8
00127fd8 ZERO terminator (end of the CFI info)
What's strange is that the address 0xf21ba5 is in clearly-valid code (see
attachment), and is even in a call instruction, as expected. But it looks
like some boilerplate for PLT linkage or other "not standard" code, so
perhaps this is the result either of a LLVM bug or some CentOS/libLLVM build
system bug that causes this code to be unlabelled. In any case I don't
think there's much we can do about it.
Comment 35•7 years ago
|
||
Comment 36•7 years ago
|
||
Wanda, given comment 33, can you try with this patch? To be clear:
* m-c + this patch, no other changes
* using the patched valgrind 3.13.0 that you already have (from comment 26)
Assignee | ||
Comment 37•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #36)
> Created attachment 8894456 [details] [diff] [review]
> bug1382280-1.diff
>
> Wanda, given comment 33, can you try with this patch? To be clear:
>
> * m-c + this patch, no other changes
>
> * using the patched valgrind 3.13.0 that you already have (from comment 26)
It feels like the suppression file worked: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5f856bd0c301c984ecbd231c0d51b451266aaeb9
Comment 38•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #37)
> It feels like the suppression file worked:
Good. Are you OK to move forward with this set of fixes? If yes, what
needs to happen? I can get the suppression and command line fixes in
comment 36 reviewed. Can you make the patched Valgrind "permanent"?
Flags: needinfo?(wcosta)
Assignee | ||
Comment 39•7 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #38)
> (In reply to Wander Lairson Costa [:wcosta] from comment #37)
> > It feels like the suppression file worked:
>
> Good. Are you OK to move forward with this set of fixes?
Yep
> If yes, what needs to happen?
Well, your and mine patch need to land in the same patch set.
> I can get the suppression and command line fixes in comment 36 reviewed. Can you make the patched Valgrind "permanent"?
Yep, it is already so, the patch is ready for review. I can push your push together with mine, just tell me who should review it.
Flags: needinfo?(wcosta)
Assignee | ||
Comment 40•7 years ago
|
||
Bug 1338651 was backed out because when building a newer image, there
was a valgrind leak report that couldn't resolve symbols. Further
investigation show the valgrind package installed had symbols stripped.
We upgrade valgrind version and build it from source with symbols.
We had to build inside the docker image because we need to run
"make install". Using "make dist" to generate a tar ball will also run
"make docs", and it is hard to make it work because of the outdated
texlive package present in CentOS 6.
Assignee | ||
Comment 41•7 years ago
|
||
valgrind test will try to load debug information for the modules present
in a stack trace. If it fails to do it, we endup with a stack trace with
only memory addresses.
We recursively look for all libs in the system common locations, and try
to install the corresponding debug information package.
This is acomplished with debuginfo-install yum utility script.
Assignee | ||
Updated•7 years ago
|
Attachment #8894876 -
Flags: review?(ted)
Assignee | ||
Updated•7 years ago
|
Attachment #8894877 -
Flags: review?(ted)
Assignee | ||
Updated•7 years ago
|
Attachment #8894876 -
Attachment is obsolete: true
Attachment #8894876 -
Flags: review?(ted)
Assignee | ||
Updated•7 years ago
|
Attachment #8894877 -
Attachment is obsolete: true
Attachment #8894877 -
Flags: review?(ted)
Assignee | ||
Comment 42•7 years ago
|
||
Bug 1338651 was backed out because when building a newer image, there
was a valgrind leak report that couldn't resolve symbols. Further
investigation showed the valgrind package installed had symbols stripped.
We upgrade valgrind version and build it from source with symbols.
We had to build inside the docker image because we need to run
"make install". Using "make dist" to generate a tar ball will also run
"make docs", and it is hard to make it work because of the outdated
texlive package present in CentOS 6.
We also apply a patch [1] to valgrind correctly generate symbols
for unloaded objects.
[1] https://bugs.kde.org/show_bug.cgi?id=79362#c62
Assignee | ||
Comment 43•7 years ago
|
||
valgrind test will try to load debug information for the modules present
in a stack trace. If it fails to do it, we endup with a stack trace with
only memory addresses.
We install debuginfo for all installed packages and look for all libs
in the system common locations, and try to install the corresponding
debug information package.
These are acomplished with debuginfo-install yum utility script.
Assignee | ||
Updated•7 years ago
|
Attachment #8894882 -
Flags: review?(ted)
Assignee | ||
Updated•7 years ago
|
Attachment #8894883 -
Flags: review?(ted)
Comment 44•7 years ago
|
||
Attachment #8894456 -
Attachment is obsolete: true
Updated•7 years ago
|
Attachment #8894914 -
Flags: review?(n.nethercote)
Comment 45•7 years ago
|
||
Comment on attachment 8894914 [details] [diff] [review]
part 3: Update Valgrind suppressions and command line flags. r=n.nethercote.
Review of attachment 8894914 [details] [diff] [review]:
-----------------------------------------------------------------
::: build/valgrind/mach_commands.py
@@ +118,5 @@
> '--fair-sched=yes',
> + # Keep debuginfo after library unmap. See bug 1382280.
> + '--keep-debuginfo=yes',
> + # Reduce noise level on rustc and/or LLVM compiled code.
> + # See bug 1365915
Nit: missing '.'
Attachment #8894914 -
Flags: review?(n.nethercote) → review+
Reporter | ||
Comment 46•7 years ago
|
||
Comment on attachment 8894882 [details] [diff] [review]
part 1: Upgrade valgrind r=ted
Review of attachment 8894882 [details] [diff] [review]:
-----------------------------------------------------------------
::: taskcluster/docker/centos6-build/system-setup.sh
@@ +324,5 @@
>
> +valgrind_version=3.13.0
> +tar -xjf valgrind-$valgrind_version.tar.bz2
> +cd valgrind-$valgrind_version
> +patch -p0 < /tmp/valgrind-epochs.patch
Can you put a note here about why we're using a patch? Maybe something like:
# This patch by Julian Seward allows us to write a suppression for
# a leak in a library that gets unloaded before shutdown.
Attachment #8894882 -
Flags: review?(ted) → review+
Reporter | ||
Comment 47•7 years ago
|
||
Comment on attachment 8894883 [details] [diff] [review]
part 2: Install debug information. r=ted
Review of attachment 8894883 [details] [diff] [review]:
-----------------------------------------------------------------
I'm not going to actually try to review the install-debug-symbols.sh script, I trust that it works. I wish that process was simpler. :-(
::: taskcluster/docker/recipes/install-debug-symbols.sh
@@ +1,4 @@
> +#!/bin/bash
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
Might not hurt to name this 'centos-install-debug-symbols.sh' since it's in the generic 'recipes' directory.
Attachment #8894883 -
Flags: review?(ted) → review+
Comment 48•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #46)
> Can you put a note here about why we're using a patch? Maybe something like:
> # This patch by Julian Seward allows us to write a suppression for
> # a leak in a library that gets unloaded before shutdown.
Plus maybe a link to the upstream bug report
https://bugs.kde.org/show_bug.cgi?id=79362
?
Comment 49•7 years ago
|
||
Pushed by wcosta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f020b3ad85f5
part 1: Upgrade valgrind r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/70fc0b62fc25
part 2: Install debug information. r=ted
https://hg.mozilla.org/integration/mozilla-inbound/rev/878f3ec362ff
Update the version of Valgrind in the desktop-build image to one with unstripped shared libraries (part 3). r=n.nethercote.
Comment 50•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/f020b3ad85f5
https://hg.mozilla.org/mozilla-central/rev/70fc0b62fc25
https://hg.mozilla.org/mozilla-central/rev/878f3ec362ff
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Updated•7 years ago
|
Product: Core → Firefox Build System
You need to log in
before you can comment on or make changes to this bug.
Description
•