Cribbing from my list in http://groups.google.com/group/mozilla.dev.tree-management/msg/834a2146d0bd4894 we need to: * have it running on try * come as close as we can to letting developers run it themselves with something like `make valgrind` * not require pushes to a separate repo (for the suppression files) to fix problems * maybe improve error reporting with tinderbox logs * get it green after bug 620828
Some clarifications and updates: - The biggest obstacle: we need the Valgrind runs on the try server (bug 631838) before enabling them on mozilla-central (they're currently hidden and can be viewed by appending '&ignore=1' to a TBPL URL). Without that, many developers won't be able to learn about Valgrind failures until their patch lands on mozilla-central, which is not good. But getting the try server runs happening isn't high on RelEng's priority list, apparently. - We need to decide what to do about suppressing leaks. Easiest thing is to suppress all of the current ones. Even though some of them might be real leaks in our code, at least then any new ones will be detected. - There's some breakage at the moment: make: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/locales' make: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer' make tools make: Entering directory `/builds/slave/valgrind-linux/objdir/browser/installer' make: Nothing to be done for `tools'. make: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer' if test -d ../../dist/bin ; then touch ../../dist/bin/.purgecaches ; fi make: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer' Traceback (most recent call last): File "_profile/pgo/profileserver.py", line 57, in <module> MOZ_JAR_LOG_DIR = os.path.abspath(os.getenv("JARLOG_DIR")) File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 402, in abspath if not isabs(path): File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 49, in isabs return s.startswith('/') AttributeError: 'NoneType' object has no attribute 'startswith' program finished with exit code 1 I don't know anything about that. - philor says the current workload is to open a few sites in the browser, the same thing that is run for PGO. This is better than nothing, but it'd be good to have something more strenuous. Tp4 or Tp5, maybe? Something that results in a runtime that's comparable to other test suites seems reasonable; maybe 10--15 minutes.
The bustage is no big deal, bug 663737. It's only interesting in that I didn't notice it for a couple of days, and I wouldn't have noticed it even then except I was stuck having to watch &noignore=1 for other things. Now I'm not anymore, so I'd only see further bustage if I actually went looking for it, which I very rarely do. The current "once or maybe twice a day, but nobody will see it even when it does run" isn't quite worthless, but it's awfully close.
> We need to decide what to do about suppressing leaks. Let's suppress existing leaks, and include bug numbers in the suppressions file.
jst will talk to joduinn about this.
(In reply to Nicholas Nethercote [:njn] from comment #5) > jst will talk to joduinn about this. We've got two valgrind machines (one linux, one mac) running all unit tests each day as part of the bughunter automation. Once the UI is ready from bughunter and you can query these messages and see what is being found, would that be enough to satisfy this goal? I don't think we can run all unittests through valgrind on a per checkin basis and maintain our standard of reporting results within 2hours of checkin. The bughunter UI will be ready for use at the end of December 2011.
That will probably be sufficient. I haven't seen bughunter mockups or prototypes, so I don't know whether it already does everything I want, but I'm happy to file bugs :) Does bughunter use the in-tree Valgrind suppression files? Will logs be available? Can I get notified if it goes from green to red / if a new error appears? Can we (eventually) make it test both 32-bit and 64-bit Linux?
(In reply to Nicholas Nethercote [:njn] from comment #5) > jst will talk to joduinn about this. jst + joduinn met :-) and details are in bug#706695. Worth noting that valgrind builds are already being run once a night on mozilla-central, but are consistently failing - see bug#696299 for more details.
(In reply to Jesse Ruderman from comment #7) > > Does bughunter use the in-tree Valgrind suppression files? Not currently, but that should be easy to add. > Will logs be available? Yes. > Can I get notified if it goes from green to red / if a new error appears? We don't have automatic notification via email in the current plan, but that is something that could be added later. > Can we (eventually) make it test both 32-bit and 64-bit Linux? It really depends on the hardware. But yes, both are possible.
We're downgrading this to MemShrink:P2 because the leaks that Valgrind finds are only a fraction of total leaks. Having said that, the other stuff Valgrind finds -- undefined memory errors, etc -- are extremely useful, not to mention often security bugs, and so this would still be a very good thing to have.
I moved most of the bugs to be suppressed out into bug 793882.
Created attachment 665095 [details] tbpl logs after all "definitely lost" errors suppressed This comes from: https://tbpl.mozilla.org/php/getParsedLog.php?id=15562383&tree=Firefox&full=1 (Linux mozilla-central valgrind on 2012-09-26 11:25:42 PDT for push 212cf709135c) 64-bit one is at: https://tbpl.mozilla.org/php/getParsedLog.php?id=15562882&tree=Firefox&full=1 The harness is still sending error exit code 1 out even though all errors have been suppessed.
> The harness is still sending error exit code 1 out even though all errors > have been suppressed. Julian, is it because --error-exitcode=1 doesn't play nice with --show-possibly-lost=no, in addition to the suppression files?
I think that's what is happening. https://bugs.kde.org/show_bug.cgi?id=307465
I think you're right. Julian, I think in get_printing_rules() in mc_leakcheck.c you should probably use the same logic for |count_as_error| as you do for |print_record|.
> The harness is still sending error exit code 1 out even though all errors > have been suppessed. I haven't kept track with the many Valgrind-on-TBPL fixes that have been happening. Is this exit code blocking this bug from completion? (I see there are other blocking bugs, I'm not sure which of those are critical and which would just be nice...)
> Is this exit code blocking this bug from completion? (I see > there are other blocking bugs, I'm not sure which of those are critical and > which would just be nice...) As far I can tell, the exit code is the only issue preventing it from turning green. I'm not sure if merely turning green == full-fledged acceptable test, it certainly will be acceptable, but probably not full-fledged enough if the blocking bugs are not yet fixed.
(In reply to Nicholas Nethercote [:njn] from comment #15) > I think you're right. Julian, I think in get_printing_rules() in > mc_leakcheck.c you should probably use the same logic for |count_as_error| > as you do for |print_record|. Ah, that makes it very easy. Thanks! A patch to exactly to do exactly that is at https://bugs.kde.org/show_bug.cgi?id=307465#c2 for testing now. I think it fixes the problem.
Gary, this test is running and finding leaks. It's not a "full-fleged acceptance test", but I'm tempted to close this bug, since it seems like we're as close as we expect to get.
(In reply to Justin Lebar [:jlebar] from comment #19) > Gary, this test is running and finding leaks. It's not a "full-fleged > acceptance test", but I'm tempted to close this bug, since it seems like > we're as close as we expect to get. We're close, but not really entirely 100% there yet, since it's not yet exposed to TBPL by default.
Do you still hope that we'll be able to expose it on TBPL by default sometime in the foreseeable future, or have you given up on that cause?
(In reply to Justin Lebar [:jlebar] from comment #21) > Do you still hope that we'll be able to expose it on TBPL by default > sometime in the foreseeable future, or have you given up on that cause? Personally I don't see it happening any time soon, see bug 800435 comment 23.
(In reply to Justin Lebar [:jlebar] from comment #21) > Do you still hope that we'll be able to expose it on TBPL by default > sometime in the foreseeable future, or have you given up on that cause? In the future, yes, but probably not in the foreseeable future. Please feel free to resolve this any way you like.
We took the MemShrink tag off, so I guess we don't care whether this gets resolved or not anymore! :)
(This is a copy of gkw's massively informative comment from bug 934257 comment 11.) The last time this was ever green was likely about 6 months ago, last I filed was in May 2013, see bug 793882. Steps: 1 - Go to https://tbpl.mozilla.org/?noignore=1&jobname=valgrind and find the first V instance, sometimes I'd have to click the "green arrow" a few times to get the first one. 2 - If it's green, check that everything is indeed okay by going to the full log, going to the bottom and seeing that the browser quit properly after running the PGO tests. See bug 885232 for a "green" V but it really is red. Bug 823787 also describes another quirk. 3 - If it's red, click full log to find out why it is red. It usually might be a recent error. Or it might be a harness breakage, e.g. bug 854253. 4 - File a bug, e.g. bug 866959, inspect the regression window, e.g. http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3ada6a2fd0c6&tochange=64d6d002e888 and cc the devs likely involved. 5 - add the suppression to any applicable file in https://hg.mozilla.org/mozilla-central/file/70de5e24d79b/build/valgrind/ 5a - cross-architecture.sup is for suppressions that affect all platforms and architectures. 5b - i386-redhat-linux-gnu.sup is for suppressions that affect only x86 Linux. (now no longer being run due to bug 873904) 5c - x86_64-redhat-linux-gnu.sup is for suppressions that affect only x64 Linux. 5d - Bug 817853 turns on --track-origins=yes by default (even though there was a speed penalty), because I didn't want to turn it on, wait a day, hope the conditional error shows up, get a stack, file it, add the suppression, then remove this flag. (I did not add this flag in the past and it got too cumbersome) 6 - In the list of bugs present in the suppression files, find out which ones were fixed recently, and remove them from the suppression list. 6b - Sometimes a bad changeset is pushed and causes errors in the nightly build (which then gets fixed by a subsequent backout *after* the nightly has been created), but I inspect this particular nightly build log and report the bug. However, this bug may not exist anymore due to the subsequent backout, but I don't know if this is the case anymore, so I still file the bug. This should be mitigated by testing-on-checkin rather than testing-on-nightly. 7 - Next day, look at the results and cross your fingers for a green build. Alternative path: * After pushing a suppression, politely ping a buildduty releng folk to custom-start a Valgrind build to see that the suppression works as expected in a few hours, rather than waiting one more day, due to bug 793989 (which apparently prevents one from doing it oneself). Other notes: * Note that as mentioned above, due to bug 873904, 32-bit Linux TBPL builds were also disabled. * Valgrind 3.9.0 has also been released so probably an upgrade might be warranted. The last upgrade was to a special post-3.8.1 build in bug 797573 about just over a year ago. * We may want to consider running Valgrind TBPL on more tests (bug 795124)/platforms/branches, e.g. bug 803763 (Android platforms), bug 803758 (jsreftests), bug 803757 (crashtests), bug 803739 (xpcshell tests), bug 801913 (reftests), bug 801911 (jstests and jit-tests), bug 794627/bug 799232 (mochitests), bug 797158 (B2G tests?), bug 801955 (more branches, e.g. mozilla-inbound, fx-team, try, etc.), but we may run into issues like bug 810753. * I'm not sure if upgrading the Valgrind TBPL machines would help, but this might help mitigate some slowness.
FWIW, the runs on m-c are now successful -- no errors reported, and the suppression files have been minimized. And it takes about 45 minutes for a run, which is pretty reasonable. Time to start pushing on the infrastructure side.
Note to self: once all the requirements for making this test visible are met, a bug should be filed as described by https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Requesting_changes_in_visibility.
All the blocking bugs are now fixed. I believe the job is ready to be unhidden, but I won't file a bug on that until the new year -- making the job visible just before I disappear for two weeks is a bad idea.
Ok, it's really done now. \o/