Closed Bug 952605 Opened 11 years ago Closed 10 years ago

Add Valgrind output support to TBPL

Categories

(Tree Management Graveyard :: TBPL, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

Attachments

(1 file, 1 obsolete file)

In the Valgrind-on-TBPL job, Currently we pass --error-exitcode=1 to Valgrind, 
which means that any Valgrind error causes a non-zero exit code.  This turns
the job red, and we print "TEST-UNEXPECTED-FAILURE | valgrind-test | non-zero
exit code", which shows up in the TBPL summary.

Unfortunately, this conflates errors found by Valgrind with other failures
(e.g. Valgrind crashes).  Also, if any child processes (from forking) have
errors, I think they won't turn the job non-green as they should, because their
non-zero exit code won't be seen by the invoking process.

So here's a proposal to fix these problems.

- Remove --error-exitcode=1 from Valgrind's args.

- Add /ERROR SUMMARY: [1-9]\d+ errors from \d+ contexts/ to
  https://hg.mozilla.org/webtools/tbpl/file/tip/php/inc/GeneralErrorFilter.php?
  This will identify any Valgrind errors, even in child processes.  (Will this
  change trigger red-ness or orange-ness?  Not that it really matters, because
  both red and orange runs need to be looked at, but it would be nice it it
  could be gotten right.)

- If |mach valgrind-test| returns non-zero, print "TEST-UNEXPECTED-FAIL |
  valgrind-test | non-zero exit code".  This will catch Valgrind crashes and
  other similar problems.  (And I should do this in the mach command, rather
  than in valgrind.sh as is currently done.)

Some cases to test.

- No Valgrind errors:  should be green.

- Valgrind error in the parent:  should be orange, with the ERROR SUMMARY line
  highlighted.

- Valgrind error in a forked child:  ditto.

- Valgrind crash:  should be red, and print TEST-UNEXPECTED-FAILURE.

Assuming this works, the Valgrind job still would meet the letter but not the
spirit of the requirements at
https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#6.29_Outputs_failures_in_a_TBPL-starrable_
format.
For example, if an error occurs, the "ERROR SUMMARY" line won't identify it,
and the log will have to be opened.  But the ASan jobs have the same property,    
AFAICT, so maybe this isn't a problem.  (I don't see how to avoid this problem    
without writing a parser for Valgrind's output, which I don't particularly want   
to do.)                                                                           
                                                                                  
Sheriffs, does this sound like it would work and be acceptable?
Flags: needinfo?(ryanvm)
Flags: needinfo?(emorley)
From what I understand from comment 0, this seems fine :-)
Thank you for working on this!
Flags: needinfo?(emorley)
Flags: needinfo?(ryanvm)
edmorley, how are TBPL changes deployed? Is landing the patch enough, or is
there a separate deployment step?
Attachment #8355929 - Flags: review?(emorley)
Landing in the TBPL repo gets you on to the staging instance: https://tbpl-dev.allizom.org/

Once you've confirmed that everything works as expected, you need to file a bug to get it into production. See bug 949255 as the most recent example.
Can I push to try on the staging instance?  That would let me properly check that it's working.
The staging instance uses the live trees. Your Try push will work the same on tbpl-dev as the regular tbpl :)
Depends on: 952397
Whoops, these strings don't appear at the start of a line, so the regexp
shouldn't start with |^|.
Attachment #8355936 - Flags: review?(emorley)
Attachment #8355929 - Attachment is obsolete: true
Attachment #8355929 - Flags: review?(emorley)
> - If |mach valgrind-test| returns non-zero, print "TEST-UNEXPECTED-FAIL |
>   valgrind-test | non-zero exit code".  This will catch Valgrind crashes and
>   other similar problems.  (And I should do this in the mach command, rather
>   than in valgrind.sh as is currently done.)

I did this in bug 952397.
Comment on attachment 8355936 [details] [diff] [review]
Highlight Valgrind's errors summary line when a non-zero number of errors are present.

Review of attachment 8355936 [details] [diff] [review]:
-----------------------------------------------------------------

::: php/inc/GeneralErrorFilter.php
@@ +45,5 @@
>         || preg_match("/ error\([0-9]*\):/", $line)  // . . . . . . . . . . . . C
>         || preg_match("/ error R?C[0-9]*:/", $line) //  . . . . . . . . . . . . MSVC
>         || preg_match("/^[A-Za-z]+Error:/", $line)  // . . . . . . . . . .  . . Python exceptions
>         || preg_match("/^BaseException:/", $line) // . . . . . . . . . .  . . . Python exception
> +       || preg_match("/ERROR SUMMARY: [1-9]\d+ errors from \d+ contexts/", $line) // Valgrind errors

This will only match double-digit error counts, did you mean this?
[1-9]\d*
Attachment #8355936 - Flags: review?(emorley) → review+
> This will only match double-digit error counts, did you mean this?

I did not!  Thanks for the catch.

http://hg.mozilla.org/webtools/tbpl/rev/3c03d7a0962b
I did some test pushes, and it didn't quite work as expected.  For example:

  https://tbpl-dev.allizom.org/?tree=Try&rev=c5cb9f15974b&showall=1

This push had two patches:
- One that removed the use of --error-exitcode=1
- One that introduced two deliberate errors for Valgrind to complain about

The following line was correctly highlighted in bold in the output, and appeared in the summary:

  ==46261== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 71 from 65)

So the regexp is working, which is god.

However, the run was green, which is bad.  Does the exit code need to be non-zero for it turn orange or red?  I'll try adding --error-exitcode=1 back in...
(In reply to Nicholas Nethercote [:njn] from comment #10)
> Does the exit code need to be non-zero for it turn orange or red?

Yes
What causes red and what causes orange?  IME, non-zero exit always causes red, and I don't know what causes orange.
(In reply to Nicholas Nethercote [:njn] from comment #12)
> What causes red and what causes orange?  IME, non-zero exit always causes
> red, and I don't know what causes orange.

Valgrind builds have never been orange IIRC.

Ideally, I'd suggested the following colour scheme:

Red: Broken build, compile.
Orange: Valgrind test failure.
Green: Everything works properly.
IIUC, it depends on the buildbot return value:
https://mxr.mozilla.org/webtools-central/source/tbpl/dataimport/import-buildbot-data.py#80

But Ed can probably clarify better tomorrow.
> IIUC, it depends on the buildbot return value:
> https://mxr.mozilla.org/webtools-central/source/tbpl/dataimport/import-buildbot-data.py#80

Aha! That makes sense. valgrind.sh currently returns 2 if Valgrind finds any errors, which explains why it shows up red.
> Red: Broken build, compile.
> Orange: Valgrind test failure.
> Green: Everything works properly.

Yes, I think I can do that, though if Valgrind crashes that can't be easily distinguished from Valgrind reporting errors and so will also show up as orange.  But that's a rare case.
> But that's a rare case.

I agree, Valgrind crashing while running Firefox or the tests is extremely rare. In that case, the bug can always be forwarded to a Valgrind dev / upstream.
I just landed a patch (which gkw r+'d via IRC) to make valgrind.sh pass through |mach valgrind-test|'s return value.  This will allow Valgrind errors to turn the TBPL job orange instead of red.

https://hg.mozilla.org/build/tools/rev/42ae433da5fc
So, to conclude:

- The "ERROR SUMMARY" highlighting is working.

- We should keep the --error-exitcode=1.

I filed bug 956951 to get the TBPL change into production.  That's all that's needed here.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Hmm, in this run the exit code from Valgrind was 1:

  https://tbpl-dev.allizom.org/php/getParsedLog.php?id=32608368&tree=Try&full=1#error0

but it still showed up as red.  And I checked the logs; every step had an exit code of zero except for the Valgrind step.

Not that it really matters -- both red and orange need investigating -- but still, surprising.
Component: General Automation → Tinderboxpushlog
Product: Release Engineering → Webtools
QA Contact: catlee
Summary: Valgrind-on-TBPL: fix up output and failure detection → Add Valgrind output support to TBPL
Depends on: 956951
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #14)
> IIUC, it depends on the buildbot return value:
> https://mxr.mozilla.org/webtools-central/source/tbpl/dataimport/import-
> buildbot-data.py#80
> 
> But Ed can probably clarify better tomorrow.

The integer referred to by import-buildbot-data.py isn't the exit code, it's just the psuedo-enum representation from:
http://hg.mozilla.org/build/buildbot/file/4b5a85b140b3/master/buildbot/status/builder.py#l25

The job state is set by buildbot's evaluateCommand() which uses log_eval_func, if set. I'm not sure what the default is for our buildbot configs, but if a custom mapping is required (you can use a combination of exit code and buildbot log parsing), you'll need to set log_eval_func on the valgrind addStep, roughly here (double check with someone from releng):
https://hg.mozilla.org/build/buildbotcustom/file/a7ba78a35d87/process/factory.py#l1493

An example custom log_eval_func instance that uses rc_eval_func() (return code):
https://mxr.mozilla.org/build-central/source/buildbotcustom/misc.py#2676
Product: Webtools → Tree Management
Version: other → unspecified
Product: Tree Management → Tree Management Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: