Closed Bug 631811 (valgrind-on-tbpl) Opened 13 years ago Closed 11 years ago

[tracking] Turn Valgrind builds into a full-fledged acceptable test

Categories

(Testing :: General, defect)

x86
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: n.nethercote)

References

Details

(Keywords: sec-want, Whiteboard: [sg:want P2][MemShrink:P2])

Attachments

(1 obsolete file)

Cribbing from my list in http://groups.google.com/group/mozilla.dev.tree-management/msg/834a2146d0bd4894 we need to:

* have it running on try
* come as close as we can to letting developers run it themselves with something like `make valgrind`
* not require pushes to a separate repo (for the suppression files) to fix problems
* maybe improve error reporting with tinderbox logs
* get it green after bug 620828
Depends on: 631838
Depends on: 631841
Depends on: 631842
Depends on: 631950
Depends on: 635799
Depends on: 637677
Depends on: 639407
Depends on: 639408
Depends on: 639410
Whiteboard: [MemShrink:P1]
Some clarifications and updates:

- The biggest obstacle:  we need the Valgrind runs on the try server (bug 631838) before enabling them on mozilla-central (they're currently hidden and can be viewed by appending '&ignore=1' to a TBPL URL).  Without that, many developers won't be able to learn about Valgrind failures until their patch lands on mozilla-central, which is not good.  But getting the try server runs happening isn't high on RelEng's priority list, apparently.

- We need to decide what to do about suppressing leaks.  Easiest thing is to suppress all of the current ones.  Even though some of them might be real leaks in our code, at least then any new ones will be detected.

- There's some breakage at the moment:

make[3]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/locales'
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make tools
make[2]: Entering directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make[2]: Nothing to be done for `tools'.
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
if test -d ../../dist/bin ; then touch ../../dist/bin/.purgecaches ; fi
make[1]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
Traceback (most recent call last):
  File "_profile/pgo/profileserver.py", line 57, in <module>
    MOZ_JAR_LOG_DIR = os.path.abspath(os.getenv("JARLOG_DIR"))
  File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 402, in abspath
    if not isabs(path):
  File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 49, in isabs
    return s.startswith('/')
AttributeError: 'NoneType' object has no attribute 'startswith'
program finished with exit code 1

  I don't know anything about that.

- philor says the current workload is to open a few sites in the browser, the same thing that is run for PGO.  This is better than nothing, but it'd be good to have something more strenuous.  Tp4 or Tp5, maybe?  Something that results in a runtime that's comparable to other test suites seems reasonable;  maybe 10--15 minutes.
The bustage is no big deal, bug 663737. It's only interesting in that I didn't notice it for a couple of days, and I wouldn't have noticed it even then except I was stuck having to watch &noignore=1 for other things. Now I'm not anymore, so I'd only see further bustage if I actually went looking for it, which I very rarely do. The current "once or maybe twice a day, but nobody will see it even when it does run" isn't quite worthless, but it's awfully close.
Depends on: 663737
> We need to decide what to do about suppressing leaks.

Let's suppress existing leaks, and include bug numbers in the suppressions file.
Whiteboard: [MemShrink:P1] → [MemShrink:P1][sg:want P2]
Depends on: 668691
Depends on: 696293
Depends on: 696297
Depends on: 696298
Depends on: 696299
Depends on: 696305
jst will talk to joduinn about this.
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.

We've got two valgrind machines (one linux, one mac) running all unit tests each day as part of the bughunter automation.  Once the UI is ready from bughunter and you can query these messages and see what is being found, would that be enough to satisfy this goal?  I don't think we can run all unittests through valgrind on a per checkin basis and maintain our standard of reporting results within 2hours of checkin.

The bughunter UI will be ready for use at the end of December 2011.
That will probably be sufficient. I haven't seen bughunter mockups or prototypes, so I don't know whether it already does everything I want, but I'm happy to file bugs :)

Does bughunter use the in-tree Valgrind suppression files? Will logs be available? Can I get notified if it goes from green to red / if a new error appears? Can we (eventually) make it test both 32-bit and 64-bit Linux?
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.

jst + joduinn met :-) and details are in bug#706695. Worth noting that valgrind builds are already being run once a night on mozilla-central, but are consistently failing - see bug#696299 for more details.
Depends on: 706695
(In reply to Jesse Ruderman from comment #7)
> 
> Does bughunter use the in-tree Valgrind suppression files?

Not currently, but that should be easy to add.

> Will logs be available?

Yes.

> Can I get notified if it goes from green to red / if a new error appears?

We don't have automatic notification via email in the current plan, but that is something that could be added later.

> Can we (eventually) make it test both 32-bit and 64-bit Linux?

It really depends on the hardware. But yes, both are possible.
We're downgrading this to MemShrink:P2 because the leaks that Valgrind finds are only a fraction of total leaks.  Having said that, the other stuff Valgrind finds -- undefined memory errors, etc -- are extremely useful, not to mention often security bugs, and so this would still be a very good thing to have.
Whiteboard: [MemShrink:P1][sg:want P2] → [MemShrink:P2][sg:want P2]
Blocks: 784681
Alias: valgrind-green
I moved most of the bugs to be suppressed out into bug 793882.
This comes from:

https://tbpl.mozilla.org/php/getParsedLog.php?id=15562383&tree=Firefox&full=1

(Linux mozilla-central valgrind on 2012-09-26 11:25:42 PDT for push 212cf709135c)

64-bit one is at: https://tbpl.mozilla.org/php/getParsedLog.php?id=15562882&tree=Firefox&full=1

The harness is still sending error exit code 1 out even though all errors have been suppessed.
> The harness is still sending error exit code 1 out even though all errors
> have been suppressed.

Julian, is it because --error-exitcode=1 doesn't play nice with --show-possibly-lost=no, in addition to the suppression files?
I think that's what is happening.  https://bugs.kde.org/show_bug.cgi?id=307465
I think you're right.  Julian, I think in get_printing_rules() in mc_leakcheck.c you should probably use the same logic for |count_as_error| as you do for |print_record|.
> The harness is still sending error exit code 1 out even though all errors
> have been suppessed.

I haven't kept track with the many Valgrind-on-TBPL fixes that have been happening.  Is this exit code blocking this bug from completion?  (I see there are other blocking bugs, I'm not sure which of those are critical and which would just be nice...)
> Is this exit code blocking this bug from completion?  (I see
> there are other blocking bugs, I'm not sure which of those are critical and
> which would just be nice...)

As far I can tell, the exit code is the only issue preventing it from turning green. I'm not sure if merely turning green == full-fledged acceptable test, it certainly will be acceptable, but probably not full-fledged enough if the blocking bugs are not yet fixed.
(In reply to Nicholas Nethercote [:njn] from comment #15)
> I think you're right.  Julian, I think in get_printing_rules() in
> mc_leakcheck.c you should probably use the same logic for |count_as_error|
> as you do for |print_record|.

Ah, that makes it very easy.  Thanks!  A patch to exactly to do
exactly that is at https://bugs.kde.org/show_bug.cgi?id=307465#c2
for testing now.  I think it fixes the problem.
No longer blocks: 784681
Gary, this test is running and finding leaks.  It's not a "full-fleged acceptance test", but I'm tempted to close this bug, since it seems like we're as close as we expect to get.
Whiteboard: [MemShrink:P2][sg:want P2] → [sg:want P2]
(In reply to Justin Lebar [:jlebar] from comment #19)
> Gary, this test is running and finding leaks.  It's not a "full-fleged
> acceptance test", but I'm tempted to close this bug, since it seems like
> we're as close as we expect to get.

We're close, but not really entirely 100% there yet, since it's not yet exposed to TBPL by default.
Do you still hope that we'll be able to expose it on TBPL by default sometime in the foreseeable future, or have you given up on that cause?
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?

Personally I don't see it happening any time soon, see bug 800435 comment 23.
Blocks: 800435
No longer depends on: 800435
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?

In the future, yes, but probably not in the foreseeable future.

Please feel free to resolve this any way you like.
We took the MemShrink tag off, so I guess we don't care whether this gets resolved or not anymore!  :)
Depends on: 885232
Depends on: 934257
(This is a copy of gkw's massively informative comment from bug 934257 comment
11.)

The last time this was ever green was likely about 6 months ago, last I filed
was in May 2013, see bug 793882.

Steps:
1 - Go to https://tbpl.mozilla.org/?noignore=1&jobname=valgrind and find the
    first V instance, sometimes I'd have to click the "green arrow" a few times
    to get the first one.

2 - If it's green, check that everything is indeed okay by going to the full
    log, going to the bottom and seeing that the browser quit properly after
    running the PGO tests. See bug 885232 for a "green" V but it really is red.
    Bug 823787 also describes another quirk.

3 - If it's red, click full log to find out why it is red. It usually might be
    a recent error. Or it might be a harness breakage, e.g. bug 854253.

4 - File a bug, e.g. bug 866959, inspect the regression window, e.g.
    http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3ada6a2fd0c6&tochange=64d6d002e888
    and cc the devs likely involved.

5 - add the suppression to any applicable file in
    https://hg.mozilla.org/mozilla-central/file/70de5e24d79b/build/valgrind/

5a - cross-architecture.sup is for suppressions that affect all platforms and
    architectures.

5b - i386-redhat-linux-gnu.sup is for suppressions that affect only x86 Linux.
    (now no longer being run due to bug 873904)

5c - x86_64-redhat-linux-gnu.sup is for suppressions that affect only x64
    Linux.

5d - Bug 817853 turns on --track-origins=yes by default (even though there was
    a speed penalty), because I didn't want to turn it on, wait a day, hope the
    conditional error shows up, get a stack, file it, add the suppression, then
    remove this flag. (I did not add this flag in the past and it got too
    cumbersome)

6 - In the list of bugs present in the suppression files, find out which ones
    were fixed recently, and remove them from the suppression list.

6b - Sometimes a bad changeset is pushed and causes errors in the nightly build
    (which then gets fixed by a subsequent backout *after* the nightly has been
    created), but I inspect this particular nightly build log and report the
    bug.  However, this bug may not exist anymore due to the subsequent
    backout, but I don't know if this is the case anymore, so I still file the
    bug. This should be mitigated by testing-on-checkin rather than
    testing-on-nightly.

7 - Next day, look at the results and cross your fingers for a green build.

Alternative path:
* After pushing a suppression, politely ping a buildduty releng folk to
  custom-start a Valgrind build to see that the suppression works as expected
  in a few hours, rather than waiting one more day, due to bug 793989 (which
  apparently prevents one from doing it oneself).

Other notes:
* Note that as mentioned above, due to bug 873904, 32-bit Linux TBPL builds
  were also disabled. 

* Valgrind 3.9.0 has also been released so probably an upgrade might be
  warranted. The last upgrade was to a special post-3.8.1 build in bug 797573
  about just over a year ago.

* We may want to consider running Valgrind TBPL on more tests (bug
  795124)/platforms/branches, e.g. bug 803763 (Android platforms), bug 803758
  (jsreftests), bug 803757 (crashtests), bug 803739 (xpcshell tests), bug
  801913 (reftests), bug 801911 (jstests and jit-tests), bug 794627/bug 799232
  (mochitests), bug 797158 (B2G tests?), bug 801955 (more branches, e.g.
  mozilla-inbound, fx-team, try, etc.), but we may run into issues like bug
  810753.

* I'm not sure if upgrading the Valgrind TBPL machines would help, but this
  might help mitigate some slowness.
Assignee: nobody → n.nethercote
Whiteboard: [sg:want P2] → [sg:want P2][MemShrink:P2]
Attachment #665095 - Attachment is obsolete: true
FWIW, the runs on m-c are now successful -- no errors reported, and the suppression files have been minimized.  And it takes about 45 minutes for a run, which is pretty reasonable.  Time to start pushing on the infrastructure side.
Depends on: 934542
Depends on: 940069
Depends on: 944141
Depends on: 946002
Depends on: 946005
Depends on: 946011
Depends on: 947671
Alias: valgrind-green → valgrind-on-tbpl
Depends on: 948145
No longer depends on: 945427
Note to self:  once all the requirements for making this test visible are met, a bug should be filed as described by https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Requesting_changes_in_visibility.
Depends on: 950223
Depends on: 950975
All the blocking bugs are now fixed.  I believe the job is ready to be unhidden, but I won't file a bug on that until the new year -- making the job visible just before I disappear for two weeks is a bad idea.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 952397
No longer depends on: 952397
Depends on: 952605
Depends on: 956951
Blocks: 956999
No longer blocks: 956999
Depends on: 956999
Ok, it's really done now.  \o/
Component: New Frameworks → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: