Closed Bug 945427 Opened 11 years ago Closed 10 years ago

Upgrade Valgrind on build machines to 3.10.0

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

All
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gkw, Assigned: mshal)

References

Details

Attachments

(3 files)

+++ This bug was initially created as a clone of Bug #797573 +++

Valgrind 3.9.0 is out. Perhaps we could upgrade to it?
njn, will this be a good step forward?
Flags: needinfo?(n.nethercote)
I don't think it'll make much difference.  The main changes that are relevant for us:

- Improvements in the heuristics used for "possibly lost" leaks, but we use --show-possibly-lost=no, and I doubt the improvements are enough for us to use --show-possibly-lost=yes.

- Some minor optimizations.

It probably wouldn't hurt to be up-to-date, but it's certainly not urgent.
Flags: needinfo?(n.nethercote)
This doesn't block the general valgrind-on-tbpl work.
No longer blocks: valgrind-on-tbpl
Actually, I just found a reason to do this. Valgrind-3.9.0 has a flag --errors-for-leak-kinds=definite which you need to use in conjunction with --show-possibly-lost=no if you want the error counts to make sense, i.e. be zero when they should be zero. And we need that for bug 957410 to work.

BTW, this is silly behaviour on Valgrind's part (I just complained in https://bugs.kde.org/show_bug.cgi?id=307465), but we're stuck with it for now.
Blocks: 957410
catlee, is this something you could do? http://valgrind.org/downloads/valgrind-3.9.0.tar.bz2 is the tarball.

FWIW, bug 750856 was the bug for updating to 3.8.1, and might be a useful reference.
Flags: needinfo?(catlee)
(In reply to Nicholas Nethercote [:njn] from comment #5)
> FWIW, bug 750856 was the bug for updating to 3.8.1, and might be a useful
> reference.

From bug 750856 comment 24:
> Bug 797573 further upgraded Valgrind to a special 3.8.1 tarball for Mozilla.

Bug 797573 might be a better reference.
(In reply to Nicholas Nethercote [:njn] from comment #4)
> Actually, I just found a reason to do this. Valgrind-3.9.0 has a flag
> --errors-for-leak-kinds=definite which you need to use in conjunction with
> --show-possibly-lost=no if you want the error counts to make sense, i.e. be
> zero when they should be zero. And we need that for bug 957410 to work.

Per PhilippeW's comments at
https://bugs.kde.org/show_bug.cgi?id=307465#c13 maybe it would be
simpler to not use --show-possibly-lost=no and instead work with
--show-leak-kinds= and --errors-for-leak-kinds= since they make the
counting- and showing- behaviour directly controllable.

Note that those flags are new in 3.9.0 and will break runs done with
previous versions.
> Per PhilippeW's comments at
> https://bugs.kde.org/show_bug.cgi?id=307465#c13 maybe it would be
> simpler to not use --show-possibly-lost=no and instead work with
> --show-leak-kinds= and --errors-for-leak-kinds= since they make the
> counting- and showing- behaviour directly controllable.

Sure. --show-leak-kinds=definite is equivalent to --show-possibly-lost=no.

> Note that those flags are new in 3.9.0 and will break runs done with
> previous versions.

Hence this bug.
Sorry, what's the consensus here? Install vanilla valgrind-3.9.0, or do we need a custom build again?
Flags: needinfo?(catlee)
I think our custom build changes some time ago were cherry-picked after 3.8.1 was cut, from in Valgrind trunk which eventually made it into 3.9.0.

Thus, vanilla 3.9.0 *should* be enough.
Build instructions, based on https://bugzilla.mozilla.org/show_bug.cgi?id=797573#c7:

wget http://valgrind.org/downloads/valgrind-3.9.0.tar.bz2
tar -jvxf valgrind-3.9.0.tar.bz2
cp valgrind-3.9.0/valgrind.spec .
mock_mozilla -r mozilla-centos6-x86_64 --init
mock_mozilla -r mozilla-centos6-x86_64 --install gcc mpfr
mock_mozilla -r mozilla-centos6-x86_64 --shell --unpriv 'mkdir -p /builds/rpmbuild/{SOURCES,SPECS}'
mock_mozilla -r mozilla-centos6-x86_64 --copyin --unpriv valgrind-3.9.0.tar.bz2 /builds/rpmbuild/SOURCES
mock_mozilla -r mozilla-centos6-x86_64 --copyin --unpriv valgrind.spec /builds/rpmbuild/SPECS
mock_mozilla -r mozilla-centos6-x86_64 --shell --unpriv --cwd '/builds/rpmbuild/SPECS' 'rpmbuild -ba valgrind.spec'
cp /builds/mock_mozilla/mozilla-centos6-x86_64/root/builds/rpmbuild/RPMS/x86_64/valgrind-3.9.0-1.x86_64.rpm ./

mock_mozilla -r mozilla-centos6-i386 --init
mock_mozilla -r mozilla-centos6-i386 --install gcc mpfr
mock_mozilla -r mozilla-centos6-i386 --shell --unpriv 'mkdir -p /builds/rpmbuild/{SOURCES,SPECS}'
mock_mozilla -r mozilla-centos6-i386 --copyin --unpriv valgrind-3.9.0.tar.bz2 /builds/rpmbuild/SOURCES
mock_mozilla -r mozilla-centos6-i386 --copyin --unpriv valgrind.spec /builds/rpmbuild/SPECS
mock_mozilla -r mozilla-centos6-i386 --shell --unpriv --cwd '/builds/rpmbuild/SPECS' 'rpmbuild -ba valgrind.spec'
cp /builds/mock_mozilla/mozilla-centos6-i386/root/builds/rpmbuild/RPMS/i686/valgrind-3.9.0-1.i686.rpm ./

I'm going to get these deployed onto the Puppet servers, then we can look at changing the packages used by the Valgrind builders.
...except I just noticed the new build instructions on https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/HowTo/Build_RPMs, I need to adjust a few things.
Assignee: nobody → bhearsum
OK. I have packages built now. Is it safe to upgrade at any time, or do we need to do some testing first?
Flags: needinfo?(n.nethercote)
Flags: needinfo?(gary)
No need for a .pp, because this is only installed at build time inside of mock.
Attachment #8365154 - Flags: review?(dustin)
Attachment #8365154 - Flags: review?(dustin) → review+
Attachment #8365154 - Flags: checked-in+
In theory, upgrading should be safe. In practice, it wouldn't hurt to do a test run if that's not too hard :)
Flags: needinfo?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #16)
> In theory, upgrading should be safe. In practice, it wouldn't hurt to do a
> test run if that's not too hard :)

Turns out that there's no way I can find to test this ahead of time =(. I'm going to land this Monday morning (eastern) and hope for the best. It's easy to back out if something goes wrong.

I think I can put something together to make it possible to test the next upgrade though...
Flags: needinfo?(gary)
If it's easy to back out, that sounds fine.
I think this is better than what we're doing, and should work based on this test using an existing version:
[root@bld-centos6-hp-004.build.scl1.mozilla.com ~]# yum install valgrind-3.8.1-1
Loaded plugins: security
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package valgrind.x86_64 1:3.8.1-1 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

==========================================================================================================================================================================
 Package                             Arch                              Version                              Repository                                               Size
==========================================================================================================================================================================
Installing:
 valgrind                            x86_64                            1:3.8.1-1                            releng-public-CentOS6-x86_64                             14 M

Transaction Summary
==========================================================================================================================================================================
Install       1 Package(s)

Total download size: 14 M
Installed size: 69 M
Is this ok [y/N]: 


Can't be landed until 3.9.0-1 is deployed on Monday morning.
Attachment #8365333 - Flags: review?(catlee)
Attachment #8365333 - Flags: review?(catlee) → review+
Something urgent came up this morning and I wasn't able to land this. I'll try again tomorrow morning.
I got busy with other things again this morning. I'll be landing this around 1pm eastern today (really!)
I have to put this off again. Trees are closed for major issues. I'll try again tomorrow.

I'm very sorry for all the delays in getting this finished.
I landed this but had to back out due to bustage:
+ python2.7 ../src/mach valgrind-test
==544== Memcheck, a memory error detector
==544== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==544== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==544== Command: /builds/slave/m-in-l64-valgrind-000000000000/objdir/dist/bin/firefox http://127.0.0.1:37016/ -profile /tmp/tmp5sKQr1
==544==
==547==
==547== HEAP SUMMARY:
==547==     in use at exit: 216,374 bytes in 634 blocks
==547==   total heap usage: 791 allocs, 157 frees, 237,043 bytes allocated
==547==
==547== LEAK SUMMARY:
==547==    definitely lost: 0 bytes in 0 blocks
==547==    indirectly lost: 0 bytes in 0 blocks
==547==      possibly lost: 1,101 bytes in 32 blocks
==547==    still reachable: 215,273 bytes in 602 blocks
==547==         suppressed: 0 bytes in 0 blocks
==547== Reachable blocks (those to which a pointer was found) are not shown.
==547== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==547==
==547== For counts of detected and suppressed errors, rerun with: -v
==547== ERROR SUMMARY: 32 errors from 32 contexts (suppressed: 6 from 6)
Xlib:  extension "RANDR" missing on display ":2.0".
==544==
==544== HEAP SUMMARY:
==544==     in use at exit: 860,001 bytes in 9,296 blocks
==544==   total heap usage: 1,338,656 allocs, 1,329,360 frees, 955,339,648 bytes allocated
==544==
==544== LEAK SUMMARY:
==544==    definitely lost: 0 bytes in 0 blocks
==544==    indirectly lost: 4,800 bytes in 149 blocks
==544==      possibly lost: 2,041 bytes in 57 blocks
==544==    still reachable: 253,618 bytes in 2,090 blocks
==544==         suppressed: 599,542 bytes in 7,000 blocks
==544== Reachable blocks (those to which a pointer was found) are not shown.
==544== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==544==
==544== For counts of detected and suppressed errors, rerun with: -v
==544== ERROR SUMMARY: 53 errors from 53 contexts (suppressed: 159 from 153)
TEST-UNEXPECTED-FAIL | valgrind-test | non-zero exit code

Full log is here: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1391013612/mozilla-inbound-linux64-valgrind-bm61-build1-build174.txt.gz


If you need a hand debugging this (eg, access to a machine), please let me know.
Flags: needinfo?(n.nethercote)
The problem is something to do with the highly confusing way that Valgrind counts leak errors. I know how to fix it for 3.9.0 -- it requires adding some new, 3.9.0-only flags to the Valgrind invocation. But I also don't understand why things work with 3.8.1.

I'll keep digging, but the good news is that comment 4 was wrong, and this doesn't block bug 957410 after all.
No longer blocks: 957410
Flags: needinfo?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #24)
> The problem is something to do with the highly confusing way that Valgrind
> counts leak errors. I know how to fix it for 3.9.0 -- it requires adding
> some new, 3.9.0-only flags to the Valgrind invocation. But I also don't
> understand why things work with 3.8.1.
> 
> I'll keep digging, but the good news is that comment 4 was wrong, and this
> doesn't block bug 957410 after all.

OK, thanks. Can you reassign this bug once you've figured out how to work around this?
Assignee: bhearsum → n.nethercote
If we do the next upgrade, please also have a build that has the Valgrind fix in:

https://bugzilla.mozilla.org/show_bug.cgi?id=970643#c49

as well - it helps eliminate asm.js false positives. (either in a new version or a custom 3.9.0 build with only that fix)
3.10.0 is out.

http://www.valgrind.org/downloads/current.html#current
Summary: Upgrade Valgrind on build machines to 3.9.0 → Upgrade Valgrind on build machines to 3.10.0
Depends on: 1069034
One compelling feature of Valgrind 3.10 is that it produces better stack traces. Compare this, from 3.9.0:

> ==19385==    at 0x4C28A49: malloc (vg_replace_malloc.c:270)
> ==19385==    by 0x782D757: moz_xmalloc (mozalloc.cpp:52)
> ==19385==    by 0x84CF3F6: mozilla::net::nsHttpHandler::NewProxiedChannel(nsIURI*, nsIProxyInfo*, unsigned int, nsIURI*, nsIChannel**) (mozalloc.h:201)
> ==19385==    by 0x83A9AAF: nsIOService::NewChannelFromURIWithProxyFlags(nsIURI*, nsIURI*, unsigned int, nsIChannel**) (nsIOService.cpp:591)
> ==19385==    by 0x83A9C31: nsIOService::NewChannelFromURI(nsIURI*, nsIChannel**) (nsIOService.cpp:562)

with this, from 3.10.0:

> ==26147==    at 0x4C2ABBD: malloc (vg_replace_malloc.c:296)
> ==26147==    by 0x4034684: moz_xmalloc (mozalloc.cpp:52)
> ==26147==    by 0x7F390DA: operator new (mozalloc.h:201)
> ==26147==    by 0x7F390DA: mozilla::net::nsHttpHandler::NewProxiedChannel(nsIURI*, nsIProxyInfo*, unsigned int, nsIURI*, nsIChannel**) (nsHttpHandler.cpp:1695)
> ==26147==    by 0x7E1FF6A: nsIOService::NewChannelFromURIWithProxyFlags(nsIURI*, nsIURI*, unsigned int, nsIChannel**) (nsIOService.cpp:591)
> ==26147==    by 0x7E200E9: nsIOService::NewChannelFromURI(nsIURI*, nsIChannel**) (nsIOService.cpp:562)

So I'd like to get it upgraded. I *think* the bustage that happened last time we tried upgrading to 3.9.0 (see comment 23) won't happen this time, because the way the Valgrind test job detects failures has changed.

bhearsum, do you think you could schedule this? Bug 1069034 will need to land first. It would be good if you could try this at a time that we're both online so we could troubleshoot any problems together, but unfortunately we're 10 hours apart so that might be difficult.
Flags: needinfo?(bhearsum)
Sorry for the delay in response. I won't be able to pick this up myself, but I pinged a few other folks that may have time soon.
Flags: needinfo?(bhearsum)
(In reply to Ben Hearsum [:bhearsum] from comment #29)
> Sorry for the delay in response. I won't be able to pick this up myself, but
> I pinged a few other folks that may have time soon.

Ok. Bug 1069034 just landed on mozilla-inbound, so this should be able to happen as soon as it has propagated to the other repos.

mshal, is this something you'd be able to do?
Flags: needinfo?(mshal)
Yep, starting to look at it now.
Flags: needinfo?(mshal)
Attached patch puppet-bug945427Splinter Review
I'm not sure how to test this beforehand, but the packages should already be up on mockbuild-repos and synced to s3.
Attachment #8499577 - Flags: review?(dustin)
Comment on attachment 8499577 [details] [diff] [review]
puppet-bug945427

This looks good.  You could test by manually adding the repo to a mock config and running a build manually.  I think the big risk is that valgrind jobs start failing because of some issue with the upgraded version, but if it's been tested -- either by you or by developers -- then it's probably reasonable to just ship it.
Attachment #8499577 - Flags: review?(dustin) → review+
Comment on attachment 8499577 [details] [diff] [review]
puppet-bug945427

Guess we'll see what breaks...

https://hg.mozilla.org/build/puppet/rev/fe812b06762e
Attachment #8499577 - Flags: checked-in+
I merged the puppet change to production.
Backed out because of:

ERROR: Command failed: 
 # ['/usr/bin/yum', '--installroot', '/builds/mock_mozilla/mozilla-centos6-x86_64/root/', 'install', 'bash', 'bzip2', 'coreutils', 'cpio', 'diffutils', 'fedora-release', 'findutils', 'gawk', 'gmp', 'libstdc++', 'ppl', 'cpp', 'gcc', 'gcc-c++', 'grep', 'gzip', 'info', 'make', 'patch', 'redhat-rpm-config', 'rpm-build', 'sed', 'shadow-utils', 'tar', 'unzip', 'util-linux', 'which', 'xz']
http://mockbuild-repos.pub.build.mozilla.org/custom/valgrind/x86_64/repodata/repomd.xml: [Errno 14] PYCURL ERROR 22 - "The requested URL returned error: 404"
Trying other mirror.
Error: Cannot retrieve repository metadata (repomd.xml) for repository: custom-valgrind. Please verify its path and try again
Attachment #8499577 - Flags: checked-in+
Is it just not supposed to have the arch in the "baseurl" lines? So:

baseurl=%REPOROOT%/custom/valgrind

instead of:

baseurl=%REPOROOT%/custom/valgrind/x86_64
dustin figured out the problem was that I ran the "createrepo --update ." command in the wrong directory - it should be run in valgrind/i386 and valgrind/x86_64 instead of just valgrind. I fixed the mockbuild-repos and re-landed the puppet patch yesterday. It seems builds are starting to pick up the new valgrind:

https://tbpl.mozilla.org/php/getParsedLog.php?id=49679806&tree=Mozilla-Inbound&full=1

valgrind                x86_64 1:3.10.0-1          custom-valgrind        16 M
Attachment #8499577 - Flags: checked-in+
From the logs on inbound:

> ==31963== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info

Woo! Thanks, mshal.
Assignee: n.nethercote → mshal
njn, is there anything else you need here or can we mark it fixed?
Flags: needinfo?(n.nethercote)
Looks good to me! Thank you.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(n.nethercote)
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: