Closed Bug 934257 Opened 11 years ago Closed 11 years ago

Give up on running Valgrind

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Unassigned)

References

()

Details

(Whiteboard: [capacity])

We're unwilling to run it other than as a nightly.

Because of that, it will never be unhidden.

Because of that, it will never be dealt with by sheriffs and automation fixes for it will always be P∞.

The inevitable result is exactly what we have: it's gkw's job to look at the logs, it's gkw's job to file bugs, it's gkw's job to figure out what caused bustage, it's gkw's job to patch automation, it's gkw's job to fix suppression files, absolutely everything to do with it in any way is gkw's job.

We should just be honest about it, shut it off and wontfix all the bugs about it, and tell him that it's also his job to run it himself.
> The inevitable result is exactly what we have: it's gkw's job to look at the
> logs, it's gkw's job to file bugs, it's gkw's job to figure out what caused
> bustage, it's gkw's job to patch automation, it's gkw's job to fix
> suppression files, absolutely everything to do with it in any way is gkw's
> job.

I have been increasingly moving attention to other aspects of Mozilla especially these few months. I picked up TBPL Valgrind back when nobody was looking at it some time ago, so it had only been myself looking at it (because nobody else was).

> 
> We should just be honest about it, shut it off and wontfix all the bugs
> about it, and tell him that it's also his job to run it himself.

... and I don't think it's fair to put all this on me and me myself alone.

About the future, let's leave it up to our new Low-Level Tools' overlords, or JS folks themselves, to make a decision.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(jorendorff)
That URL doesn't seem to show anything...or is that the point?  (Did we turn this off already?)
Flags: needinfo?(gary)
No, we haven't yet turned this off - currently we only run this nightly, so click the bottom green arrow until you see a V letter.

(Last one I see is https://tbpl.mozilla.org/?jobname=valgrind&showall=1&rev=47c8e9b16918 )
Flags: needinfo?(gary)
If we don't monitor valgrind, someone else will. I think we have little choice but to automate.

Gary, what exactly are the manual tasks involved in checking Valgrind logs every night? Can you describe it in enough detail that I could do it? (Or really: what would it take to automate all of it?)
Flags: needinfo?(jorendorff) → needinfo?(gary)
I agree that having one person who has to track it is sucky.

And it's generally true that a test that is run only once per day isn't a good fit for TBPL, because when it turns orange/red it's hard to know which change caused the failure.  However, Valgrind has a potential escape from that trouble because you can suppress errors.  So a possible procedure when it fails is this:

- sheriff notices failure
- sheriff adds suppression (which is present in the output) to the suppression file
- sheriff files a bug about the failure
- the person responsible fixes the failure at their leisure, and then removes the suppression

Obviously, this is additional work for sheriffs and hopefully something that would only happen occasionally.  E.g. if it's once per month it's not bad;  if it's every second day it is bad.  gkw, do you have rough data on the frequency?  IIRC you had this green at one point.

Another thing:  looking at the failure in the run linked to by comment 3, there are bazillions of "possibly lost" leak warnings.  I thought we were running with --show-possibly-lost=no so that they wouldn't show up.

Anyway, I am happy to take this over from gkw and try to get some traction.  I'll talk to gkw more about this offline.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(gary)
(In reply to Nicholas Nethercote [:njn] from comment #5)
> So a possible procedure when it fails is this:
> 
> - sheriff notices failure
> - sheriff adds suppression (which is present in the output) to the
> suppression file

I agree frequency of needing to do this will play a big part in the hassle/viability - however this would represent a significant change to the current 'the tree can't be broken' model (https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy ; see bug 800435 for context specific to Valgrind) - and that's before we consider things like what we do at weekends when only devs not familiar with the suppression process are around etc.
(In reply to Nicholas Nethercote [:njn] from comment #5)
> sheriff sheriff sheriff

You missed some steps there, because you don't believe me about the automation being completely the responsibility of the person who says "no, don't shut these runs off," despite that being true. It's more like

- sheriff notices the run is red from a 1200 second timeout
- sheriff notices that the timeout isn't visible because the logs exceed the maximum size, and have for months now despite not turning red
- sheriff fixes bug 885232
- sheriff fixes the massive spew that overflows the logs
- sheriff fixes the timeout

> - the person responsible fixes the failure at their leisure, and then
> removes the suppression

This part interests me too. It looks from the dependencies of bug 793882 like we've fixed six bugs, INVA/WFM some more, and for the most part, the leisure at which someone is going to fix the bugs this finds never actually comes.
philor, thanks for the additional context.  I'm just going to investigate things, try to understand them better, and determine if there's a reasonable path forward that gets us Valgrind coverage without unduely burdening sheriffs.  Until I've done that, let's not give up.
Giving up on Valgrind would be terrible for the tree.  V reliably catches bugs that would otherwise show up as intermittent failures.  (ASan + MSan + LSan can too, but of those, we only have ASan working.)
Did it catch things which aren't in the dependency tree of bug 793882? Not counting the last five months, when nobody has looked at it at all, but before that.
(In reply to Jason Orendorff [:jorendorff] from comment #4)
> Gary, what exactly are the manual tasks involved in checking Valgrind logs
> every night? Can you describe it in enough detail that I could do it? (Or
> really: what would it take to automate all of it?)

Sorry for the delay in responding, I was catching up on backlog.

The last time this was ever green was likely about 6 months ago, last I filed was in May 2013, see bug 793882.

Steps:
1 - Go to https://tbpl.mozilla.org/?noignore=1&jobname=valgrind and find the first V instance, sometimes I'd have to click the "green arrow" a few times to get the first one.
2 - If it's green, check that everything is indeed okay by going to the full log, going to the bottom and seeing that the browser quit properly after running the PGO tests. See bug 885232 for a "green" V but it really is red. Bug 823787 also describes another quirk.
3 - If it's red, click full log to find out why it is red. It usually might be a recent error. Or it might be a harness breakage, e.g. bug 854253.
4 - File a bug, e.g. bug 866959, inspect the regression window, e.g. http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3ada6a2fd0c6&tochange=64d6d002e888 and cc the devs likely involved.

5 - add the suppression to any applicable file in https://hg.mozilla.org/mozilla-central/file/70de5e24d79b/build/valgrind/
5a - cross-architecture.sup is for suppressions that affect all platforms and architectures.
5b - i386-redhat-linux-gnu.sup is for suppressions that affect only x86 Linux. (now no longer being run due to bug 873904)
5c - x86_64-redhat-linux-gnu.sup is for suppressions that affect only x64 Linux.
5d - Bug 817853 turns on --track-origins=yes by default (even though there was a speed penalty), because I didn't want to turn it on, wait a day, hope the conditional error shows up, get a stack, file it, add the suppression, then remove this flag. (I did not add this flag in the past and it got too cumbersome)

6 - In the list of bugs present in the suppression files, find out which ones were fixed recently, and remove them from the suppression list.
6b - Sometimes a bad changeset is pushed and causes errors in the nightly build (which then gets fixed by a subsequent backout *after* the nightly has been created), but I inspect this particular nightly build log and report the bug. However, this bug may not exist anymore due to the subsequent backout, but I don't know if this is the case anymore, so I still file the bug. This should be mitigated by testing-on-checkin rather than testing-on-nightly.

7 - Next day, look at the results and cross your fingers for a green build.

Alternative path:
* After pushing a suppression, politely ping a buildduty releng folk to custom-start a Valgrind build to see that the suppression works as expected in a few hours, rather than waiting one more day, due to bug 793989 (which apparently prevents one from doing it oneself).

Other notes:
* Note that as mentioned above, due to bug 873904, 32-bit Linux TBPL builds were also disabled. 
* Valgrind 3.9.0 has also been released so probably an upgrade might be warranted. The last upgrade was to a special post-3.8.1 build in bug 797573 about just over a year ago.
* We may want to consider running Valgrind TBPL on more tests (bug 795124)/platforms/branches, e.g. bug 803763 (Android platforms), bug 803758 (jsreftests), bug 803757 (crashtests), bug 803739 (xpcshell tests), bug 801913 (reftests), bug 801911 (jstests and jit-tests), bug 794627/bug 799232 (mochitests), bug 797158 (B2G tests?), bug 801955 (more branches, e.g. mozilla-inbound, fx-team, try, etc.), but we may run into issues like bug 810753.
* I'm not sure if upgrading the Valgrind TBPL machines would help, but this might help mitigate some slowness.

> gkw, do you have rough data on the
> frequency?  IIRC you had this green at one point.

Depending on the state of the tree, it probably breaks 1-2 times a week, when we're lucky, maybe once every fortnight, but I don't really recall a month when nothing broke.

> Another thing:  looking at the failure in the run linked to by comment 3,
> there are bazillions of "possibly lost" leak warnings.  I thought we were
> running with --show-possibly-lost=no so that they wouldn't show up.

Bug 934542 tracks this.
Flags: needinfo?(n.nethercote)
(In reply to Phil Ringnalda (:philor) from comment #10)
> Did it catch things which aren't in the dependency tree of bug 793882? Not
> counting the last five months, when nobody has looked at it at all, but
> before that.

Yes, see:

https://bugzilla.mozilla.org/buglist.cgi?keywords=valgrind%2C%20testcase%2C%20&f1=blocked&keywords_type=allwords&emailreporter1=1&list_id=8478818&o1=notsubstring&emailtype1=exact&query_format=advanced&email1=gary%40rumblingedge.com&v1=793882&component=JavaScript%20Engine&component=JavaScript%20Engine%3A%20JIT&product=Core

which lists 38 bugs currently, including some found recently.
Also, the Valgrind TBPL script is at:

http://hg.mozilla.org/build/tools/file/535d6fd19a1f/scripts/valgrind/valgrind.sh

which probably should be converted to mozharness (bug 909915).

I'm not sure what time the Valgrind TBPL builds are run. I'm guessing the usual nightly build schedule of 3am+? Pacific Time, and the Valgrind PGO tests used to take ~30-odd minutes IIRC (before the crazy "possibly-lost" errors started showing up irregardless of the presence of --show-possibly-lost=no).
Thanks, gkw!  Very helpful.

Getting the options working properly (bug 934542) is definitely the first order of business.
Flags: needinfo?(n.nethercote)
I've taken ownership of bug 631811, and so will close this bug.

Briefly, my goal is to get the existing Valgrind PGO runs happening on every push, on Linux64 only.  (I can see that the "run once a day and push suppressions on breakage" idea won't fly.)  They currently take ~1 hour, which is reasonable -- similar to the static analysis runs -- and Linux is our most scalable build/test platform.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.