Bug 631811 (valgrind-on-tbpl)

[tracking] Turn Valgrind builds into a full-fledged acceptable test

RESOLVED FIXED

Status

Testing
New Frameworks
RESOLVED FIXED
7 years ago
4 years ago

People

(Reporter: philor, Assigned: njn)

Tracking

(Depends on: 1 bug, {sec-want})

Trunk
x86
Linux
sec-want
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [sg:want P2][MemShrink:P2])

Attachments

(1 obsolete attachment)

(Reporter)

Description

7 years ago
Cribbing from my list in http://groups.google.com/group/mozilla.dev.tree-management/msg/834a2146d0bd4894 we need to:

* have it running on try
* come as close as we can to letting developers run it themselves with something like `make valgrind`
* not require pushes to a separate repo (for the suppression files) to fix problems
* maybe improve error reporting with tinderbox logs
* get it green after bug 620828
(Reporter)

Updated

7 years ago
Depends on: 631838
(Reporter)

Updated

7 years ago
Depends on: 631841
(Reporter)

Updated

7 years ago
Depends on: 631842
(Reporter)

Updated

7 years ago
Depends on: 631950
(Reporter)

Updated

6 years ago
Depends on: 635799
(Reporter)

Updated

6 years ago
Depends on: 637677
(Assignee)

Updated

6 years ago
Duplicate of this bug: 622205
(Reporter)

Updated

6 years ago
Depends on: 639407
(Reporter)

Updated

6 years ago
Depends on: 639408
(Reporter)

Updated

6 years ago
Depends on: 639410
(Assignee)

Updated

6 years ago
Whiteboard: [MemShrink:P1]
(Assignee)

Updated

6 years ago
Blocks: 640791
(Assignee)

Comment 2

6 years ago
Some clarifications and updates:

- The biggest obstacle:  we need the Valgrind runs on the try server (bug 631838) before enabling them on mozilla-central (they're currently hidden and can be viewed by appending '&ignore=1' to a TBPL URL).  Without that, many developers won't be able to learn about Valgrind failures until their patch lands on mozilla-central, which is not good.  But getting the try server runs happening isn't high on RelEng's priority list, apparently.

- We need to decide what to do about suppressing leaks.  Easiest thing is to suppress all of the current ones.  Even though some of them might be real leaks in our code, at least then any new ones will be detected.

- There's some breakage at the moment:

make[3]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/locales'
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make tools
make[2]: Entering directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make[2]: Nothing to be done for `tools'.
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
if test -d ../../dist/bin ; then touch ../../dist/bin/.purgecaches ; fi
make[1]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
Traceback (most recent call last):
  File "_profile/pgo/profileserver.py", line 57, in <module>
    MOZ_JAR_LOG_DIR = os.path.abspath(os.getenv("JARLOG_DIR"))
  File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 402, in abspath
    if not isabs(path):
  File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 49, in isabs
    return s.startswith('/')
AttributeError: 'NoneType' object has no attribute 'startswith'
program finished with exit code 1

  I don't know anything about that.

- philor says the current workload is to open a few sites in the browser, the same thing that is run for PGO.  This is better than nothing, but it'd be good to have something more strenuous.  Tp4 or Tp5, maybe?  Something that results in a runtime that's comparable to other test suites seems reasonable;  maybe 10--15 minutes.
(Reporter)

Comment 3

6 years ago
The bustage is no big deal, bug 663737. It's only interesting in that I didn't notice it for a couple of days, and I wouldn't have noticed it even then except I was stuck having to watch &noignore=1 for other things. Now I'm not anymore, so I'd only see further bustage if I actually went looking for it, which I very rarely do. The current "once or maybe twice a day, but nobody will see it even when it does run" isn't quite worthless, but it's awfully close.
Depends on: 663737

Comment 4

6 years ago
> We need to decide what to do about suppressing leaks.

Let's suppress existing leaks, and include bug numbers in the suppressions file.
Whiteboard: [MemShrink:P1] → [MemShrink:P1][sg:want P2]
(Reporter)

Updated

6 years ago
Depends on: 668691

Updated

6 years ago
Depends on: 696293

Updated

6 years ago
Depends on: 696297

Updated

6 years ago
Depends on: 696298

Updated

6 years ago
Depends on: 696299

Updated

6 years ago
Depends on: 696305
(Assignee)

Comment 5

6 years ago
jst will talk to joduinn about this.

Comment 6

6 years ago
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.

We've got two valgrind machines (one linux, one mac) running all unit tests each day as part of the bughunter automation.  Once the UI is ready from bughunter and you can query these messages and see what is being found, would that be enough to satisfy this goal?  I don't think we can run all unittests through valgrind on a per checkin basis and maintain our standard of reporting results within 2hours of checkin.

The bughunter UI will be ready for use at the end of December 2011.

Comment 7

6 years ago
That will probably be sufficient. I haven't seen bughunter mockups or prototypes, so I don't know whether it already does everything I want, but I'm happy to file bugs :)

Does bughunter use the in-tree Valgrind suppression files? Will logs be available? Can I get notified if it goes from green to red / if a new error appears? Can we (eventually) make it test both 32-bit and 64-bit Linux?
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.

jst + joduinn met :-) and details are in bug#706695. Worth noting that valgrind builds are already being run once a night on mozilla-central, but are consistently failing - see bug#696299 for more details.

Updated

6 years ago
Depends on: 706695

Comment 9

6 years ago
(In reply to Jesse Ruderman from comment #7)
> 
> Does bughunter use the in-tree Valgrind suppression files?

Not currently, but that should be easy to add.

> Will logs be available?

Yes.

> Can I get notified if it goes from green to red / if a new error appears?

We don't have automatic notification via email in the current plan, but that is something that could be added later.

> Can we (eventually) make it test both 32-bit and 64-bit Linux?

It really depends on the hardware. But yes, both are possible.
(Assignee)

Comment 10

5 years ago
We're downgrading this to MemShrink:P2 because the leaks that Valgrind finds are only a fraction of total leaks.  Having said that, the other stuff Valgrind finds -- undefined memory errors, etc -- are extremely useful, not to mention often security bugs, and so this would still be a very good thing to have.
Whiteboard: [MemShrink:P1][sg:want P2] → [MemShrink:P2][sg:want P2]
Keywords: sec-want
(Reporter)

Updated

5 years ago
Blocks: 784681
Depends on: 793291
Depends on: 793508
Depends on: 793509
Depends on: 793548
Depends on: 793549
Depends on: 793551
Depends on: 793584
Depends on: 793598
Depends on: 793600
Depends on: 793601
Depends on: 793602
Depends on: 793603
Depends on: 793605
Depends on: 793606
Depends on: 793607
Depends on: 793608
Depends on: 793611
Depends on: 793615
Depends on: 793616

Updated

5 years ago
Alias: valgrind-green
No longer depends on: 793535
No longer depends on: 793598
No longer depends on: 793602
No longer depends on: 793533
No longer depends on: 793611
No longer depends on: 793605
No longer depends on: 793607
No longer depends on: 793539
No longer depends on: 793606
No longer depends on: 793549
No longer depends on: 793603
No longer depends on: 793548
No longer depends on: 793608
No longer depends on: 793537
No longer depends on: 793601
No longer depends on: 793536
No longer depends on: 793615
No longer depends on: 793616
No longer depends on: 793600
No longer depends on: 793532
I moved most of the bugs to be suppressed out into bug 793882.
Created attachment 665095 [details]
tbpl logs after all "definitely lost" errors suppressed

This comes from:

https://tbpl.mozilla.org/php/getParsedLog.php?id=15562383&tree=Firefox&full=1

(Linux mozilla-central valgrind on 2012-09-26 11:25:42 PDT for push 212cf709135c)

64-bit one is at: https://tbpl.mozilla.org/php/getParsedLog.php?id=15562882&tree=Firefox&full=1

The harness is still sending error exit code 1 out even though all errors have been suppessed.
> The harness is still sending error exit code 1 out even though all errors
> have been suppressed.

Julian, is it because --error-exitcode=1 doesn't play nice with --show-possibly-lost=no, in addition to the suppression files?

Comment 14

5 years ago
I think that's what is happening.  https://bugs.kde.org/show_bug.cgi?id=307465
(Assignee)

Comment 15

5 years ago
I think you're right.  Julian, I think in get_printing_rules() in mc_leakcheck.c you should probably use the same logic for |count_as_error| as you do for |print_record|.
(Assignee)

Comment 16

5 years ago
> The harness is still sending error exit code 1 out even though all errors
> have been suppessed.

I haven't kept track with the many Valgrind-on-TBPL fixes that have been happening.  Is this exit code blocking this bug from completion?  (I see there are other blocking bugs, I'm not sure which of those are critical and which would just be nice...)
> Is this exit code blocking this bug from completion?  (I see
> there are other blocking bugs, I'm not sure which of those are critical and
> which would just be nice...)

As far I can tell, the exit code is the only issue preventing it from turning green. I'm not sure if merely turning green == full-fledged acceptable test, it certainly will be acceptable, but probably not full-fledged enough if the blocking bugs are not yet fixed.
(In reply to Nicholas Nethercote [:njn] from comment #15)
> I think you're right.  Julian, I think in get_printing_rules() in
> mc_leakcheck.c you should probably use the same logic for |count_as_error|
> as you do for |print_record|.

Ah, that makes it very easy.  Thanks!  A patch to exactly to do
exactly that is at https://bugs.kde.org/show_bug.cgi?id=307465#c2
for testing now.  I think it fixes the problem.
Depends on: 797573
Depends on: 794063
Depends on: 800435
Depends on: 801955
Depends on: 822029
Depends on: 823787

Updated

5 years ago
No longer blocks: 784681
Depends on: 837754
Gary, this test is running and finding leaks.  It's not a "full-fleged acceptance test", but I'm tempted to close this bug, since it seems like we're as close as we expect to get.
Whiteboard: [MemShrink:P2][sg:want P2] → [sg:want P2]
(In reply to Justin Lebar [:jlebar] from comment #19)
> Gary, this test is running and finding leaks.  It's not a "full-fleged
> acceptance test", but I'm tempted to close this bug, since it seems like
> we're as close as we expect to get.

We're close, but not really entirely 100% there yet, since it's not yet exposed to TBPL by default.
Do you still hope that we'll be able to expose it on TBPL by default sometime in the foreseeable future, or have you given up on that cause?
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?

Personally I don't see it happening any time soon, see bug 800435 comment 23.

Updated

4 years ago
Blocks: 800435
No longer depends on: 800435
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?

In the future, yes, but probably not in the foreseeable future.

Please feel free to resolve this any way you like.
We took the MemShrink tag off, so I guess we don't care whether this gets resolved or not anymore!  :)
Depends on: 844387
Depends on: 854253
Depends on: 874656

Updated

4 years ago
Depends on: 885232

Updated

4 years ago
Depends on: 934257
(Assignee)

Comment 25

4 years ago
(This is a copy of gkw's massively informative comment from bug 934257 comment
11.)

The last time this was ever green was likely about 6 months ago, last I filed
was in May 2013, see bug 793882.

Steps:
1 - Go to https://tbpl.mozilla.org/?noignore=1&jobname=valgrind and find the
    first V instance, sometimes I'd have to click the "green arrow" a few times
    to get the first one.

2 - If it's green, check that everything is indeed okay by going to the full
    log, going to the bottom and seeing that the browser quit properly after
    running the PGO tests. See bug 885232 for a "green" V but it really is red.
    Bug 823787 also describes another quirk.

3 - If it's red, click full log to find out why it is red. It usually might be
    a recent error. Or it might be a harness breakage, e.g. bug 854253.

4 - File a bug, e.g. bug 866959, inspect the regression window, e.g.
    http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3ada6a2fd0c6&tochange=64d6d002e888
    and cc the devs likely involved.

5 - add the suppression to any applicable file in
    https://hg.mozilla.org/mozilla-central/file/70de5e24d79b/build/valgrind/

5a - cross-architecture.sup is for suppressions that affect all platforms and
    architectures.

5b - i386-redhat-linux-gnu.sup is for suppressions that affect only x86 Linux.
    (now no longer being run due to bug 873904)

5c - x86_64-redhat-linux-gnu.sup is for suppressions that affect only x64
    Linux.

5d - Bug 817853 turns on --track-origins=yes by default (even though there was
    a speed penalty), because I didn't want to turn it on, wait a day, hope the
    conditional error shows up, get a stack, file it, add the suppression, then
    remove this flag. (I did not add this flag in the past and it got too
    cumbersome)

6 - In the list of bugs present in the suppression files, find out which ones
    were fixed recently, and remove them from the suppression list.

6b - Sometimes a bad changeset is pushed and causes errors in the nightly build
    (which then gets fixed by a subsequent backout *after* the nightly has been
    created), but I inspect this particular nightly build log and report the
    bug.  However, this bug may not exist anymore due to the subsequent
    backout, but I don't know if this is the case anymore, so I still file the
    bug. This should be mitigated by testing-on-checkin rather than
    testing-on-nightly.

7 - Next day, look at the results and cross your fingers for a green build.

Alternative path:
* After pushing a suppression, politely ping a buildduty releng folk to
  custom-start a Valgrind build to see that the suppression works as expected
  in a few hours, rather than waiting one more day, due to bug 793989 (which
  apparently prevents one from doing it oneself).

Other notes:
* Note that as mentioned above, due to bug 873904, 32-bit Linux TBPL builds
  were also disabled. 

* Valgrind 3.9.0 has also been released so probably an upgrade might be
  warranted. The last upgrade was to a special post-3.8.1 build in bug 797573
  about just over a year ago.

* We may want to consider running Valgrind TBPL on more tests (bug
  795124)/platforms/branches, e.g. bug 803763 (Android platforms), bug 803758
  (jsreftests), bug 803757 (crashtests), bug 803739 (xpcshell tests), bug
  801913 (reftests), bug 801911 (jstests and jit-tests), bug 794627/bug 799232
  (mochitests), bug 797158 (B2G tests?), bug 801955 (more branches, e.g.
  mozilla-inbound, fx-team, try, etc.), but we may run into issues like bug
  810753.

* I'm not sure if upgrading the Valgrind TBPL machines would help, but this
  might help mitigate some slowness.
Assignee: nobody → n.nethercote
(Assignee)

Updated

4 years ago
Whiteboard: [sg:want P2] → [sg:want P2][MemShrink:P2]
(Assignee)

Updated

4 years ago
Attachment #665095 - Attachment is obsolete: true
(Assignee)

Comment 26

4 years ago
FWIW, the runs on m-c are now successful -- no errors reported, and the suppression files have been minimized.  And it takes about 45 minutes for a run, which is pretty reasonable.  Time to start pushing on the infrastructure side.
(Assignee)

Updated

4 years ago
Depends on: 934542
(Assignee)

Updated

4 years ago
Depends on: 940069
(Assignee)

Updated

4 years ago
Depends on: 944141
Depends on: 945427
(Assignee)

Updated

4 years ago
Depends on: 946002
(Assignee)

Updated

4 years ago
Depends on: 946005
(Assignee)

Updated

4 years ago
Depends on: 946011
(Assignee)

Updated

4 years ago
Depends on: 947671
(Assignee)

Updated

4 years ago
Alias: valgrind-green → valgrind-on-tbpl
(Assignee)

Updated

4 years ago
Depends on: 948145
(Assignee)

Updated

4 years ago
No longer depends on: 945427
(Assignee)

Comment 27

4 years ago
Note to self:  once all the requirements for making this test visible are met, a bug should be filed as described by https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Requesting_changes_in_visibility.
(Assignee)

Updated

4 years ago
Depends on: 950223
(Assignee)

Updated

4 years ago
Depends on: 950975
(Assignee)

Comment 28

4 years ago
All the blocking bugs are now fixed.  I believe the job is ready to be unhidden, but I won't file a bug on that until the new year -- making the job visible just before I disappear for two weeks is a bad idea.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
(Assignee)

Updated

4 years ago
Depends on: 952397
(Assignee)

Updated

4 years ago
No longer depends on: 952397
(Assignee)

Updated

4 years ago
Depends on: 952605
(Assignee)

Updated

4 years ago
Depends on: 956951
(Assignee)

Updated

4 years ago
Blocks: 956999
(Assignee)

Updated

4 years ago
No longer blocks: 956999
Depends on: 956999
(Assignee)

Comment 29

4 years ago
Ok, it's really done now.  \o/
You need to log in before you can comment on or make changes to this bug.