Last Comment Bug 800435 - Unhide green Valgrind tbpl builds
: Unhide green Valgrind tbpl builds
Status: RESOLVED INCOMPLETE
:
Product: Tree Management
Classification: Other
Component: Visibility Requests (show other bugs)
: Trunk
: All Linux
: -- normal
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on: valgrind-on-tbpl 801955
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-11 10:26 PDT by Gary Kwong [:gkw] [:nth10sd]
Modified: 2014-12-23 17:03 PST (History)
11 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Gary Kwong [:gkw] [:nth10sd] 2012-10-11 10:26:02 PDT
Valgrind builds on:

https://tbpl.mozilla.org/?noignore=1&jobname=valgrind

are green as of:

https://tbpl.mozilla.org/?noignore=1&jobname=valgrind&rev=2fae8bd461da

Please unhide them.
Comment 1 Gary Kwong [:gkw] [:nth10sd] 2012-10-11 10:35:16 PDT
(by unhide I mean take them off the noignore section)
Comment 2 :Ehsan Akhgari 2012-10-11 16:55:05 PDT
Done.  For future reference, you can click on Tree Info and then Adjust Hidden Builders and do it yourself.
Comment 3 Gary Kwong [:gkw] [:nth10sd] 2012-10-11 16:59:02 PDT
(In reply to Ehsan Akhgari [:ehsan] from comment #2)
> Done.  For future reference, you can click on Tree Info and then Adjust
> Hidden Builders and do it yourself.

I don't have a sheriff password.

Please also unhide "Linux x86-64 mozilla-central valgrind".
Comment 4 :Ehsan Akhgari 2012-10-11 17:00:46 PDT
(In reply to Gary Kwong [:gkw, :nth10sd] from comment #3)
> (In reply to Ehsan Akhgari [:ehsan] from comment #2)
> > Done.  For future reference, you can click on Tree Info and then Adjust
> > Hidden Builders and do it yourself.
> 
> I don't have a sheriff password.

Lies!  ;-)

> Please also unhide "Linux x86-64 mozilla-central valgrind".

Done.
Comment 5 Phil Ringnalda (:philor) 2012-10-11 23:36:28 PDT
Um. A job which is not hidden on mozilla-central is a tier 1 job which may not be broken, and when it fails the cause must be immediately backed out.

Valgrind doesn't run on try, it doesn't run on inbound, it doesn't run on fx-team, it doesn't run on services-central, it doesn't run on-push, it is not tier-1.

Rehidden.
Comment 6 Gary Kwong [:gkw] [:nth10sd] 2012-10-12 00:03:17 PDT
> Valgrind doesn't run on try, it doesn't run on inbound, it doesn't run on
> fx-team, it doesn't run on services-central, it doesn't run on-push, it is
> not tier-1.
> 
> Rehidden.

Valgrind builds can eventually be made to run on try and all the other branches, but it will take up a lot of resources to be run on-push, and we never planned for it to be run on-push.

Does this mean it can never be tier-1, and never be unhidden?
Comment 7 Phil Ringnalda (:philor) 2012-10-12 00:23:36 PDT
No, the other option would be to make it the first thing ever which is tier 1 despite running only once a day (desktop nightly builds don't really count, because they are 99% identical to jobs that run throughout the day, and the other 1% is absolutely essential even if it's miserable to have it broken by something within the last 24 hours).

Well, the first thing since the Netscape days - from what I hear, they used to close the tree to build nightly builds, manually test them, and then chase after people who committed during the previous day to figure out who had broken what, because they didn't have on-push tests to speak of. Seemed like a pretty miserable way to develop, to me.

But that would be fairly similar to what would need to happen with a 3am nightly-only tier 1 job - at 4am, edmorley would close mozilla-central, close mozilla-inbound since that would almost certainly be where the bustage had come from, and begin bisecting, all the while leaving mozilla-central and mozilla-inbound closed. Of course, it would be pointless to have him do the bisecting, since Valgrind is not his field of expertise, and it would be pointless to have mozilla-central and mozilla-inbound closed, since the reason you close a tree when it is busted is to prevent adding more undetectable bustage, but piling bustage on bustage is exactly what a once-a-day job is all about. So, you could make it the only visible thing which does not require an immediate backout or tree closure, and does not require the sheriff to figure out what busted it, at which point... why is it visible again?
Comment 8 Nicholas Nethercote [:njn] 2012-10-12 01:25:42 PDT
So you're saying that a test can/should only be visible if it's on-push?
Comment 9 Gary Kwong [:gkw] [:nth10sd] 2012-10-12 01:54:04 PDT
The way to deal with Valgrind issues could be to file a bug and add a suppression for the new bug, then (get Releng to) retrigger only Valgrind builds after the suppression is landed as a DONTBUILD.

Assuming it is the only problem found, the retriggered Valgrind build should be green again, and we'd have a bug on file to chase down the problem.

Just giving a bit of overview here.
Comment 10 Ed Morley [:emorley] 2012-10-12 01:56:39 PDT
(In reply to Nicholas Nethercote [:njn] from comment #8)
> So you're saying that a test can/should only be visible if it's on-push?

Up until now, pretty much yes (PGO is every 3/6 hours depending on tree; Nightly is unavoidable + 99% similar as philor said). They also need to run on Try + all trunk trees that merge into mozilla-central. 

I've thought for a while it would be useful to have this documented somewhere, since it comes up most of the times we add something new to TBPL that isn't running on all trees (last one was Marionette iirc). 

/me adds to pile of sheriffing things to update on the wiki.
Comment 11 Nicholas Nethercote [:njn] 2012-10-12 02:56:47 PDT
Gary has a good point in comment 9 -- this case is a bit different because there's a mechanical process for getting the test green again while yielding a bug on file.  IIRC we run with --gen-suppressions=yes so Valgrind even spits out the necessary suppression.
Comment 12 Ed Morley [:emorley] 2012-10-12 03:11:51 PDT
I agree that it's more of a grey area than for other cases, but it still puts extra load on our (other than myself) volunteer sheriffs, who until now have not had to actively check things into mozilla-central on a daily basis just to maintain the green status-quo. 

That said, given that the Valgrind builds only run once a day & will presumably complete within my timezone, I guess perhaps it may not affect other sheriffs as much (other than weekends). I also don't have any idea how often it will turn red - if infrequently, then perhaps we are worrying over nothing (especially if it's just a case of copy-pasta-ing a snippet from the log & filing a short bug).

Philor, RyanVM, thoughts?
Comment 13 Andrew McCreight [:mccr8] 2012-10-12 04:09:59 PDT
Is there some reason to unhide it?  If any time it goes orange, you are just going to push an autogenerated patch to ignore the failure, I'm not sure what the value is in showing it.
Comment 14 Ryan VanderMeulen [:RyanVM] 2012-10-12 04:56:08 PDT
(In reply to Andrew McCreight [:mccr8] from comment #13)

What he said. If we're pushing changes to the suppression file to keep things green, then it seems that we're effectively ignoring it anyway. Seems like it would just end up causing confusion amongst people to have it showing. I agree that it's probably best left hidden by default unless we get to a point of running it at the same frequency as all of our other regression tests.
Comment 15 Ryan VanderMeulen [:RyanVM] 2012-10-12 06:06:42 PDT
That said, maybe there's a compromise here. If enough machine resources could be allocated to run Valgrind builds on the main branches (m-c, m-i, m-a, m-b, m-r, m-esr10/17, s-c, fx-team, try maybe optionally), could we have an arrangement similar to PGO builds where they run as often as possible, but not necessarily on each push? That would at least narrow down a regression window to a shorter timeframe.
Comment 16 Nicholas Nethercote [:njn] 2012-10-12 13:54:47 PDT
> Is there some reason to unhide it?

If it's hidden, will anyone notice if it stops being green?

Running the tests more often would be great, if possible.  How long do the tests take?

> I also don't have any idea how often it will turn red - if
> infrequently, then perhaps we are worrying over nothing

Indeed.  If it turns out to happen frequently, we could go to plan B.
Comment 17 Gary Kwong [:gkw] [:nth10sd] 2012-10-12 13:57:55 PDT
> Running the tests more often would be great, if possible.  How long do the
> tests take?

They take < 1 hour now, but that's because they're only running PGO tests. The more tests we run, the longer it would take.

Julian has done tests on our mochitest suite w/ Valgrind and we think it takes about 11-12 hours.

> Indeed.  If it turns out to happen frequently, we could go to plan B.

fwiw, we had another green Valgrind build this morning. I'd say it depends on what the developers land to influence prevalence of redness.
Comment 18 Nicholas Nethercote [:njn] 2012-10-15 10:43:51 PDT
> fwiw, we had another green Valgrind build this morning.

And we've had them for the five days since, AFAICT.  

(Well, https://tbpl.mozilla.org/?noignore=1&jobname=valgrind&rev=942ed5747b63 is red today but that looks like "abort: data/toolkit/Makefile.in.i@1b08914858da: no match found!" is the problem -- maybe an infrastructure problem?)
Comment 19 Julian Seward [:jseward] 2012-10-15 10:58:02 PDT
(In reply to Gary Kwong [:gkw, :nth10sd] from comment #17)
> Julian has done tests on our mochitest suite w/ Valgrind and we think it
> takes about 11-12 hours.

Roughly 10 CPU hours and 2.5GB real memory, for x86_64-linux built at
gcc -O2, running on a 3.47 GHz Core i5.

Being able to run Mochitests here would be awesome (for lack of a
better word :).  It routinely picks up new bugs on the once-per-month
basis that I've been running it by hand for a while.  It will take
some effort to get it green, but it'd be well worth the effort.
Comment 20 Gary Kwong [:gkw] [:nth10sd] 2012-10-15 11:58:10 PDT
> (Well,
> https://tbpl.mozilla.org/?noignore=1&jobname=valgrind&rev=942ed5747b63 is
> red today but that looks like "abort:
> data/toolkit/Makefile.in.i@1b08914858da: no match found!" is the problem --
> maybe an infrastructure problem?)

The subsequent rebuilds are all green.
Comment 21 :Ehsan Akhgari 2012-10-15 12:13:52 PDT
(In reply to comment #20)
> > (Well,
> > https://tbpl.mozilla.org/?noignore=1&jobname=valgrind&rev=942ed5747b63 is
> > red today but that looks like "abort:
> > data/toolkit/Makefile.in.i@1b08914858da: no match found!" is the problem --
> > maybe an infrastructure problem?)
> 
> The subsequent rebuilds are all green.

That's hg repository corruption.
Comment 22 Ed Morley [:emorley] 2012-10-16 06:30:21 PDT
(In reply to Nicholas Nethercote [:njn] from comment #18)
> (Well,
> https://tbpl.mozilla.org/?noignore=1&jobname=valgrind&rev=942ed5747b63 is
> red today but that looks like "abort:
> data/toolkit/Makefile.in.i@1b08914858da: no match found!" is the problem --
> maybe an infrastructure problem?)

That run should have retried, but didn't due to mock; filed bug 802114.
Comment 23 Ed Morley [:emorley] 2013-02-06 15:57:37 PST
Like any other build/testsuite, displaying by default is dependant on running per push (see bug 801955); which I don't see happening given the resource usage and comments from releng. Please reopen the bug if it does.

In a TBPLv2 world (TBPL rewrite, happening this year), Valgrind is the kind of job that could be made more visible if we implement team-specific view modes etc. In the meantime, you'll just need to use &noignore=1.

Note You need to log in before you can comment on or make changes to this bug.