Last Comment Bug 631811 - (valgrind-on-tbpl) [tracking] Turn Valgrind builds into a full-fledged acceptable test
(valgrind-on-tbpl)
: [tracking] Turn Valgrind builds into a full-fledged acceptable test
Status: RESOLVED FIXED
[sg:want P2][MemShrink:P2]
: sec-want
Product: Testing
Classification: Components
Component: New Frameworks (show other bugs)
: Trunk
: x86 Linux
: -- normal with 3 votes (vote)
: ---
Assigned To: Nicholas Nethercote [:njn]
:
:
Mentors:
: 622205 (view as bug list)
Depends on: 948145 631838 631841 631842 631950 635799 637677 639407 639408 639410 663737 668691 696293 696297 696298 696299 696305 706695 793291 793508 793509 793551 793584 794063 797573 801955 822029 823787 837754 844387 854253 874656 885232 934257 934542 940069 944141 946002 946005 946011 947671 950223 950975 952605 956951 956999
Blocks: MemShrinkTools 800435
  Show dependency treegraph
 
Reported: 2011-02-05 10:04 PST by Phil Ringnalda (:philor)
Modified: 2014-01-07 16:14 PST (History)
17 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
tbpl logs after all "definitely lost" errors suppressed (13.35 KB, text/plain)
2012-09-26 12:55 PDT, Gary Kwong [:gkw] [:nth10sd]
no flags Details

Description Phil Ringnalda (:philor) 2011-02-05 10:04:41 PST
Cribbing from my list in http://groups.google.com/group/mozilla.dev.tree-management/msg/834a2146d0bd4894 we need to:

* have it running on try
* come as close as we can to letting developers run it themselves with something like `make valgrind`
* not require pushes to a separate repo (for the suppression files) to fix problems
* maybe improve error reporting with tinderbox logs
* get it green after bug 620828
Comment 1 Nicholas Nethercote [:njn] 2011-03-03 19:42:39 PST
*** Bug 622205 has been marked as a duplicate of this bug. ***
Comment 2 Nicholas Nethercote [:njn] 2011-06-14 19:01:23 PDT
Some clarifications and updates:

- The biggest obstacle:  we need the Valgrind runs on the try server (bug 631838) before enabling them on mozilla-central (they're currently hidden and can be viewed by appending '&ignore=1' to a TBPL URL).  Without that, many developers won't be able to learn about Valgrind failures until their patch lands on mozilla-central, which is not good.  But getting the try server runs happening isn't high on RelEng's priority list, apparently.

- We need to decide what to do about suppressing leaks.  Easiest thing is to suppress all of the current ones.  Even though some of them might be real leaks in our code, at least then any new ones will be detected.

- There's some breakage at the moment:

make[3]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/locales'
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make tools
make[2]: Entering directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make[2]: Nothing to be done for `tools'.
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
if test -d ../../dist/bin ; then touch ../../dist/bin/.purgecaches ; fi
make[1]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
Traceback (most recent call last):
  File "_profile/pgo/profileserver.py", line 57, in <module>
    MOZ_JAR_LOG_DIR = os.path.abspath(os.getenv("JARLOG_DIR"))
  File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 402, in abspath
    if not isabs(path):
  File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 49, in isabs
    return s.startswith('/')
AttributeError: 'NoneType' object has no attribute 'startswith'
program finished with exit code 1

  I don't know anything about that.

- philor says the current workload is to open a few sites in the browser, the same thing that is run for PGO.  This is better than nothing, but it'd be good to have something more strenuous.  Tp4 or Tp5, maybe?  Something that results in a runtime that's comparable to other test suites seems reasonable;  maybe 10--15 minutes.
Comment 3 Phil Ringnalda (:philor) 2011-06-14 19:37:21 PDT
The bustage is no big deal, bug 663737. It's only interesting in that I didn't notice it for a couple of days, and I wouldn't have noticed it even then except I was stuck having to watch &noignore=1 for other things. Now I'm not anymore, so I'd only see further bustage if I actually went looking for it, which I very rarely do. The current "once or maybe twice a day, but nobody will see it even when it does run" isn't quite worthless, but it's awfully close.
Comment 4 Jesse Ruderman 2011-06-21 17:51:54 PDT
> We need to decide what to do about suppressing leaks.

Let's suppress existing leaks, and include bug numbers in the suppressions file.
Comment 5 Nicholas Nethercote [:njn] 2011-11-29 14:37:25 PST
jst will talk to joduinn about this.
Comment 6 cmtalbert 2011-11-30 07:58:16 PST
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.

We've got two valgrind machines (one linux, one mac) running all unit tests each day as part of the bughunter automation.  Once the UI is ready from bughunter and you can query these messages and see what is being found, would that be enough to satisfy this goal?  I don't think we can run all unittests through valgrind on a per checkin basis and maintain our standard of reporting results within 2hours of checkin.

The bughunter UI will be ready for use at the end of December 2011.
Comment 7 Jesse Ruderman 2011-11-30 08:25:14 PST
That will probably be sufficient. I haven't seen bughunter mockups or prototypes, so I don't know whether it already does everything I want, but I'm happy to file bugs :)

Does bughunter use the in-tree Valgrind suppression files? Will logs be available? Can I get notified if it goes from green to red / if a new error appears? Can we (eventually) make it test both 32-bit and 64-bit Linux?
Comment 8 John O'Duinn [:joduinn] (please use "needinfo?" flag) 2011-12-02 01:40:05 PST
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.

jst + joduinn met :-) and details are in bug#706695. Worth noting that valgrind builds are already being run once a night on mozilla-central, but are consistently failing - see bug#696299 for more details.
Comment 9 Bob Clary [:bc:] 2011-12-02 08:53:59 PST
(In reply to Jesse Ruderman from comment #7)
> 
> Does bughunter use the in-tree Valgrind suppression files?

Not currently, but that should be easy to add.

> Will logs be available?

Yes.

> Can I get notified if it goes from green to red / if a new error appears?

We don't have automatic notification via email in the current plan, but that is something that could be added later.

> Can we (eventually) make it test both 32-bit and 64-bit Linux?

It really depends on the hardware. But yes, both are possible.
Comment 10 Nicholas Nethercote [:njn] 2012-04-03 16:36:19 PDT
We're downgrading this to MemShrink:P2 because the leaks that Valgrind finds are only a fraction of total leaks.  Having said that, the other stuff Valgrind finds -- undefined memory errors, etc -- are extremely useful, not to mention often security bugs, and so this would still be a very good thing to have.
Comment 11 Gary Kwong [:gkw] [:nth10sd] 2012-09-24 16:11:19 PDT
I moved most of the bugs to be suppressed out into bug 793882.
Comment 12 Gary Kwong [:gkw] [:nth10sd] 2012-09-26 12:55:59 PDT
Created attachment 665095 [details]
tbpl logs after all "definitely lost" errors suppressed

This comes from:

https://tbpl.mozilla.org/php/getParsedLog.php?id=15562383&tree=Firefox&full=1

(Linux mozilla-central valgrind on 2012-09-26 11:25:42 PDT for push 212cf709135c)

64-bit one is at: https://tbpl.mozilla.org/php/getParsedLog.php?id=15562882&tree=Firefox&full=1

The harness is still sending error exit code 1 out even though all errors have been suppessed.
Comment 13 Gary Kwong [:gkw] [:nth10sd] 2012-09-26 12:57:19 PDT
> The harness is still sending error exit code 1 out even though all errors
> have been suppressed.

Julian, is it because --error-exitcode=1 doesn't play nice with --show-possibly-lost=no, in addition to the suppression files?
Comment 14 Jesse Ruderman 2012-09-26 15:30:43 PDT
I think that's what is happening.  https://bugs.kde.org/show_bug.cgi?id=307465
Comment 15 Nicholas Nethercote [:njn] 2012-09-26 16:40:58 PDT
I think you're right.  Julian, I think in get_printing_rules() in mc_leakcheck.c you should probably use the same logic for |count_as_error| as you do for |print_record|.
Comment 16 Nicholas Nethercote [:njn] 2012-10-02 22:44:13 PDT
> The harness is still sending error exit code 1 out even though all errors
> have been suppessed.

I haven't kept track with the many Valgrind-on-TBPL fixes that have been happening.  Is this exit code blocking this bug from completion?  (I see there are other blocking bugs, I'm not sure which of those are critical and which would just be nice...)
Comment 17 Gary Kwong [:gkw] [:nth10sd] 2012-10-02 22:47:30 PDT
> Is this exit code blocking this bug from completion?  (I see
> there are other blocking bugs, I'm not sure which of those are critical and
> which would just be nice...)

As far I can tell, the exit code is the only issue preventing it from turning green. I'm not sure if merely turning green == full-fledged acceptable test, it certainly will be acceptable, but probably not full-fledged enough if the blocking bugs are not yet fixed.
Comment 18 Julian Seward [:jseward] 2012-10-03 11:37:54 PDT
(In reply to Nicholas Nethercote [:njn] from comment #15)
> I think you're right.  Julian, I think in get_printing_rules() in
> mc_leakcheck.c you should probably use the same logic for |count_as_error|
> as you do for |print_record|.

Ah, that makes it very easy.  Thanks!  A patch to exactly to do
exactly that is at https://bugs.kde.org/show_bug.cgi?id=307465#c2
for testing now.  I think it fixes the problem.
Comment 19 Justin Lebar (not reading bugmail) 2013-02-06 15:11:14 PST
Gary, this test is running and finding leaks.  It's not a "full-fleged acceptance test", but I'm tempted to close this bug, since it seems like we're as close as we expect to get.
Comment 20 Gary Kwong [:gkw] [:nth10sd] 2013-02-06 15:13:39 PST
(In reply to Justin Lebar [:jlebar] from comment #19)
> Gary, this test is running and finding leaks.  It's not a "full-fleged
> acceptance test", but I'm tempted to close this bug, since it seems like
> we're as close as we expect to get.

We're close, but not really entirely 100% there yet, since it's not yet exposed to TBPL by default.
Comment 21 Justin Lebar (not reading bugmail) 2013-02-06 15:48:28 PST
Do you still hope that we'll be able to expose it on TBPL by default sometime in the foreseeable future, or have you given up on that cause?
Comment 22 Ed Morley [:emorley] 2013-02-06 15:58:21 PST
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?

Personally I don't see it happening any time soon, see bug 800435 comment 23.
Comment 23 Gary Kwong [:gkw] [:nth10sd] 2013-02-06 16:11:15 PST
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?

In the future, yes, but probably not in the foreseeable future.

Please feel free to resolve this any way you like.
Comment 24 Justin Lebar (not reading bugmail) 2013-02-06 16:28:59 PST
We took the MemShrink tag off, so I guess we don't care whether this gets resolved or not anymore!  :)
Comment 25 Nicholas Nethercote [:njn] 2013-11-19 15:06:57 PST
(This is a copy of gkw's massively informative comment from bug 934257 comment
11.)

The last time this was ever green was likely about 6 months ago, last I filed
was in May 2013, see bug 793882.

Steps:
1 - Go to https://tbpl.mozilla.org/?noignore=1&jobname=valgrind and find the
    first V instance, sometimes I'd have to click the "green arrow" a few times
    to get the first one.

2 - If it's green, check that everything is indeed okay by going to the full
    log, going to the bottom and seeing that the browser quit properly after
    running the PGO tests. See bug 885232 for a "green" V but it really is red.
    Bug 823787 also describes another quirk.

3 - If it's red, click full log to find out why it is red. It usually might be
    a recent error. Or it might be a harness breakage, e.g. bug 854253.

4 - File a bug, e.g. bug 866959, inspect the regression window, e.g.
    http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3ada6a2fd0c6&tochange=64d6d002e888
    and cc the devs likely involved.

5 - add the suppression to any applicable file in
    https://hg.mozilla.org/mozilla-central/file/70de5e24d79b/build/valgrind/

5a - cross-architecture.sup is for suppressions that affect all platforms and
    architectures.

5b - i386-redhat-linux-gnu.sup is for suppressions that affect only x86 Linux.
    (now no longer being run due to bug 873904)

5c - x86_64-redhat-linux-gnu.sup is for suppressions that affect only x64
    Linux.

5d - Bug 817853 turns on --track-origins=yes by default (even though there was
    a speed penalty), because I didn't want to turn it on, wait a day, hope the
    conditional error shows up, get a stack, file it, add the suppression, then
    remove this flag. (I did not add this flag in the past and it got too
    cumbersome)

6 - In the list of bugs present in the suppression files, find out which ones
    were fixed recently, and remove them from the suppression list.

6b - Sometimes a bad changeset is pushed and causes errors in the nightly build
    (which then gets fixed by a subsequent backout *after* the nightly has been
    created), but I inspect this particular nightly build log and report the
    bug.  However, this bug may not exist anymore due to the subsequent
    backout, but I don't know if this is the case anymore, so I still file the
    bug. This should be mitigated by testing-on-checkin rather than
    testing-on-nightly.

7 - Next day, look at the results and cross your fingers for a green build.

Alternative path:
* After pushing a suppression, politely ping a buildduty releng folk to
  custom-start a Valgrind build to see that the suppression works as expected
  in a few hours, rather than waiting one more day, due to bug 793989 (which
  apparently prevents one from doing it oneself).

Other notes:
* Note that as mentioned above, due to bug 873904, 32-bit Linux TBPL builds
  were also disabled. 

* Valgrind 3.9.0 has also been released so probably an upgrade might be
  warranted. The last upgrade was to a special post-3.8.1 build in bug 797573
  about just over a year ago.

* We may want to consider running Valgrind TBPL on more tests (bug
  795124)/platforms/branches, e.g. bug 803763 (Android platforms), bug 803758
  (jsreftests), bug 803757 (crashtests), bug 803739 (xpcshell tests), bug
  801913 (reftests), bug 801911 (jstests and jit-tests), bug 794627/bug 799232
  (mochitests), bug 797158 (B2G tests?), bug 801955 (more branches, e.g.
  mozilla-inbound, fx-team, try, etc.), but we may run into issues like bug
  810753.

* I'm not sure if upgrading the Valgrind TBPL machines would help, but this
  might help mitigate some slowness.
Comment 26 Nicholas Nethercote [:njn] 2013-11-28 14:59:48 PST
FWIW, the runs on m-c are now successful -- no errors reported, and the suppression files have been minimized.  And it takes about 45 minutes for a run, which is pretty reasonable.  Time to start pushing on the infrastructure side.
Comment 27 Nicholas Nethercote [:njn] 2013-12-12 08:36:25 PST
Note to self:  once all the requirements for making this test visible are met, a bug should be filed as described by https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Requesting_changes_in_visibility.
Comment 28 Nicholas Nethercote [:njn] 2013-12-19 19:56:36 PST
All the blocking bugs are now fixed.  I believe the job is ready to be unhidden, but I won't file a bug on that until the new year -- making the job visible just before I disappear for two weeks is a bad idea.
Comment 29 Nicholas Nethercote [:njn] 2014-01-07 16:14:50 PST
Ok, it's really done now.  \o/

Note You need to log in before you can comment on or make changes to this bug.