Closed Bug 801955 Opened 10 years ago Closed 9 years ago

Run Valgrind builds on all mozilla-central-based trees (inbound, try, fx-team, etc)

Categories

(Release Engineering :: General, defect, P3)

All
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gkw, Assigned: catlee)

References

Details

(Whiteboard: [valgrind])

Attachments

(2 files, 1 obsolete file)

We should adapt Valgrind tbpl builds to run on other branches. Some that I can think of, off the top of my head, include:

* mozilla-inbound
* fx-team
* services-central

and possibly:

* mozilla-aurora
* mozilla-beta
* mozilla-release
(any future ESR branch)
I think we should stick to trunk for now - given that unless we backport fixes, the Valgrind runs on them are going to be red until everything works its way through the trains.

At the minimum we should run them on:
* mozilla-central
* mozilla-inbound
* fx-team
* try

...but I suppose we could enable on all trunk repos if the load wasn't an issue.

Also, to reduce load, we could enable them on Try, but not include in the default 'all' TryChooser group, meaning people have to request them specifically (if this is easily doable).

Adjusting summary to mention Try (also, TBPL is just the dashboard, it's the buildbot scheduling that needs to be adjusted).
Blocks: 800435
Summary: Adapt Valgrind tbpl builds to run on other branches, e.g. mozilla-inbound, fx-team etc. → Run Valgrind builds on more trunk trees + Try
You mean you Need a way to exclude jobs from Trychooser's "all"? Bug 691177.
Ah yes that was the bug (I'm even CCed) :-)
Enabling on mozilla-inbound seems like the most important of those cases in comment 1.  As for enabling it on try... I'm trying to imagine the use case whereby someone would decide that their patch needs to enable the Valgrind testing.  It wouldn't hurt to have, but I suspect it wouldn't get used much.
Also, bug 631838 is the bug to get Valgrind builds on Try.
(In reply to Nicholas Nethercote [:njn] from comment #4)
> As for enabling it on try... I'm trying to imagine the use case
> whereby someone would decide that their patch needs to enable the Valgrind
> testing.  It wouldn't hurt to have, but I suspect it wouldn't get used much.

The use-case is "something in the last 24 hours turned Valgrind red, but we don't know what and have no way to run on Try".

Running on Try is normally a prerequisite for unhiding things on TBPL (though as mentioned in bug 800435, the suppression file does make this slightly different; though without Try and thus a simple way to bisect, I imagine new suppressed failures will just end of being filed and ignored).
> I imagine new suppressed failures will just end of being filed and ignored).

We track them in bug 793882.
(In reply to Gary Kwong [:gkw, :nth10sd] from comment #7)
> > I imagine new suppressed failures will just end of being filed and ignored).
> 
> We track them in bug 793882.

Yeah I realise - and you've done an amazing job of getting everything filed and suppressed so we have Valgrind green - but I'm just wondering how sustainable it is for only a couple of people to have a working Valgrind environment to determine regression-ranges in the future? It just seems like running on Try would alleviate some of the load on you...
> Yeah I realise - and you've done an amazing job of getting everything filed
> and suppressed so we have Valgrind green

Thank you! I wouldn't have been able to do it without help from you and some of the other folks around here though. It's a team effort.

> but I'm just wondering how
> sustainable it is for only a couple of people to have a working Valgrind
> environment to determine regression-ranges in the future? It just seems like
> running on Try would alleviate some of the load on you...

Yes, I agree. Now if only someone could make some magic and help us put Valgrind on Try.. :)
> The use-case is "something in the last 24 hours turned Valgrind red, but we
> don't know what and have no way to run on Try".
> 
> Running on Try is normally a prerequisite for unhiding things on TBPL

Gotcha.  Thanks!
Clarifying summary.

This bug is the major blocker that blocks unhiding a green tree on TBPL.
See Also: → 631838
Summary: Run Valgrind builds on more trunk trees + Try → Run Valgrind builds on mozilla-inbound, fx-team and Try
How about we run valgrind per-push on these branches?
(In reply to Chris AtLee [:catlee] from comment #12)
> How about we run valgrind per-push on these branches?

If Releng is fine with the load, I'm fine with it. :) The concern in the past was that it would take up too much resources, but now that they're on AWS we might be able to scale up easily.

Adjusting summary.
Summary: Run Valgrind builds on mozilla-inbound, fx-team and Try → Run Valgrind builds on mozilla-inbound, fx-team and Try, per push
Priority: -- → P3
Whiteboard: [valgrind]
Getting Valgrind green on tbpl is a Q4 goal for us, and this bug is a major blocker to that.

Chris, may I know what else needs to be done to help move this bug forward?
Flags: needinfo?(catlee)
(In reply to Gary Kwong [:gkw, :nth10sd] from comment #14)
> Getting Valgrind green on tbpl is a Q4 goal for us, and this bug is a major
> blocker to that.
> 
> Chris, may I know what else needs to be done to help move this bug forward?

There's certainly a cost associated with running these per-checkin. Running both linux-32 and linux-64 results in about 2 hours of compute time per push.

Is it worth it? Are we catching good bugs with valgrind?

Can we choose a different schedule (e.g. every 4 hours) that keeps it visible on the tree, but doesn't use as many resources?
Flags: needinfo?(catlee)
> Is it worth it? Are we catching good bugs with valgrind?

Yes, we've found bug 795395, bug 795631 and bug 803386 - all initially marked security-sensitive (some still are!)


> There's certainly a cost associated with running these per-checkin. Running
> both linux-32 and linux-64 results in about 2 hours of compute time per push.

This will benefit the sheriffs as they can be clear as to which checkin caused the error.

> Can we choose a different schedule (e.g. every 4 hours) that keeps it
> visible on the tree, but doesn't use as many resources?

I'll let the sheriffs chime in on this. Ed, maybe you have some ideas?
Flags: needinfo?(bmo)
That depends on whether we've done a complete 180 degree turn from the plan as laid out in bug 800435, which was that whenever it failed, we would just file a bug and push a suppression.

If we have, and it's now to be something which may not ever be made to fail and a push which causes it to fail must be backed out, then we want it running on push on every push on every tree.

If not, then we're perfectly happy with nightly-only on mozilla-central (well, unless we're also expected to determine who is at fault, along with filing the bug and pushing the suppression), and the only question is whether or not it has to be made visible in order to shift the burden of having to file the bug and push the suppression onto edmorley except when he's gone when we'd push it onto me or RyanVM.
> That depends on whether we've done a complete 180 degree turn from the plan
> as laid out in bug 800435, which was that whenever it failed, we would just
> file a bug and push a suppression.

I think file-a-bug-and-push-a-suppression is probably the better way to go, for the following reasons.

- Valgrind is resource-hungry even on the short test runs that we currently have enabled.  We have bugs open to run it on much larger test suites, for which running on every push seems like a bad idea.  

- I suspect we'll get failures only moderately often.

- A nice thing about Valgrind is that when it does complain it's usually not that hard to work out the cause -- i.e. not having the exact changeset that caused the problem probably isn't dire.

Please speak up if you disagree with these reasons!
It's nice to almost have a consensus. I agree with philor and njn.

If edmorley is fine with the plan to have Valgrind running on more branches once a day but not per push, and to file a bug and add a suppression each time one shows up, I'd say let's go with it. (Ed, please change the summary to remove "per push" as well)
Going by comments 17-19, I think we should:

1) Just run once per day, but possibly on more trees (though how likely are say changes to browser/ in fx-team likely to cause Valgrind regressions?).

2) Unhide to start with, until we get an idea as to how often it's going to break.

3) Once we have the answer to #2, make a final call as to whether it stays unhidden, given the extra burden on (majority volunteer) sheriffs & the fact that people like the jetpack team manage to maintain their hidden builds pretty well. (And also the possible slippery slope of off-loading team-specific tree tasks onto sheriffs).
Flags: needinfo?(bmo)
You forgot the most important slippery slope: right now, there are exactly zero visible builds which you can ignore and just push on top of unstarred unfiled undealt-with red or orange.

Because we are not doing anything about these failures other than hiding them via the suppression file, there is absolutely no point in telling someone who has a patch to push which is blocking a beta but needs to land on the trunk first that they have to learn the (knowing us, undocumented, or documented where nobody knows how to find it) method for hiding this orange, so this will be the first visible build for which it's permissible to push on unstarred failure.

And, why make it visible? What is there to be gained by that, in exchange for even more devaluation of the already deeply devalued tree rules?

(And, followup, why are we discussing this in an ever-increasing number of different bugs, rather than a newsgroup?)
I happened to chance on another failure - suppressed at https://hg.mozilla.org/mozilla-central/rev/a50a77da3cf9

Let's aim to run on multiple branches and on Try first, if per-push is not recommended, we can not do that now, and when circumstances change in the future, we can revisit that decision another day.
Summary: Run Valgrind builds on mozilla-inbound, fx-team and Try, per push → Run Valgrind builds on mozilla-inbound, fx-team and Try
Product: mozilla.org → Release Engineering
Duplicate of this bug: 631838
Assignee: nobody → catlee
catlee volunteered to do this.

As for exactly which trees need it,  https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#3.29_Runs_on_mozilla-central_and_all_trees_that_merge_into_it says:

> 3) Runs on mozilla-central and all trees that merge into it
> 
>     Otherwise job failures when tree X merges into mozilla-central will not
>     be attributable to a single changeset, resulting in either tree closure
>     or backout of the entire merge (see requirement #2).
>
>     When filing the release engineering bug to enable your job on all the
>     required trees, ask to enable it on "mozilla-central based trees" and
>     release engineering will enable it in the default config from which all
>     trunk trees inherit (unless the various tree owners have explicitly opted
>     out). As a rough guide, mozilla-central based trees include
>     mozilla-inbound, fx-team, services-central, ionmonkey, graphics as well
>     as many of the other project/disposable repositories.
Summary: Run Valgrind builds on mozilla-inbound, fx-team and Try → Run Valgrind builds on all mozilla-central-based trees (inbound, try, fx-team, etc)
this changes the default on project branches to True. All the release branches are set as False already, so this patch ends up enabling valgrind builds on all non-release branches.
Attachment #8343717 - Flags: review?(jhopkins)
Attachment #8343717 - Flags: review?(jhopkins) → review+
In production
It's working on m-c, but not on m-i, for example:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&showall=1&jobname=valgrind

There are multiple of these in the log:

  abort: unknown revision '6ecda297fb371a680efcd49e4a90e8164c561498'!

philor suggested it's because m-c is being cloned?
Yes, m-c is being cloned because of http://hg.mozilla.org/build/tools/file/d21da1c08036/scripts/valgrind/valgrind.sh#l7

I'm working on a patch
Attachment #8344843 - Flags: review?(nthomas)
Comment on attachment 8344843 [details] [diff] [review]
Use build properties to determine which repo to clone

r+ on the later version from IRC (https://diff.pastebin.mozilla.org/3744166) which 

>diff --git a/scripts/valgrind/valgrind.sh b/scripts/valgrind/valgrind.sh
>+    wget -O branches.json $BRANCHES_JSON

swaps this to the copy of tools we're running out of:
 BRANCHES_JSON=$SCRIPTS_DIR/buildfarm/maintenance/production-branches.json
 HG_REPO=$($JSONTOOL -k ${branch}.repo branches.json)
and no wget.

>+    echo python $SCRIPTS_DIR/clobberer/clobberer.py -s scripts -s $(basename $PROPERTIES_FILE) \
>+        $CLOBBERER_URL $branch "$builder" $builddir $slavename $master)

And removed this echo.
Attachment #8344843 - Flags: review?(nthomas) → review+
Use the production-branches.json we already have! Also, remove extra echos.
Attachment #8344843 - Attachment is obsolete: true
Attachment #8344865 - Flags: review?(nthomas)
Attachment #8344865 - Flags: review?(nthomas) → review+
Attachment #8344865 - Flags: checked-in+
Appears to be working -- Valgrind is running successfully (well, failing, but not due to buildbot config issues) on mozilla-inbound and fx-team.

If I do a |-p all| push to try, will Valgrind run there?
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.