Closed Bug 823787 Opened 13 years ago Closed 12 years ago

Valgrind job given 'success' result by buildbot when it should be 'failure'

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 631842

People

(Reporter: gkw, Assigned: n.nethercote)

References

Details

Attachments

(3 files, 1 obsolete file)

Attached file ending portion of log
In https://tbpl.mozilla.org/php/getParsedLog.php?id=18125018&tree=Firefox&full=1 : there is a Valgrind error present (filed as bug 823782) However, in its tbpl display, it is shown as green, which is wrong: https://tbpl.mozilla.org/?noignore=1&jobname=valgrind&rev=21195f52311c If there is an error present, it should be shown as red. Is this related to the first-run instance of Firefox?
TBPL only displays the result that buildbot gives it. At the top of the log there is: { builder: mozilla-central-linux64-valgrind slave: bld-linux64-ec2-334 starttime: 1356002675.67 results: success (0) buildid: 20121220030912 builduid: ffff565924214e9ba3187dd8a3c6f4e4 revision: 21195f52311c } The results row is what TBPL uses, and comes from buildbot.
Component: Tinderboxpushlog → Release Engineering: Automation (General)
Product: Webtools → mozilla.org
QA Contact: catlee
Version: Trunk → other
Summary: Valgrind on tbpl shows green when it should be red. → Valgrind job given 'success' result by buildbot when it should be 'failure'
The main Firefox process (PID 7098) exits without leaks, so the Python caller sees an exit code of zero. What are the other processes (PIDs 7099 and 7120)? They both appear to be forked from Firefox. One of them should be Firefox's glxinfo process (see bug 639842), but what's the other one? If we decide we don't care about these other processes, we can just add --child-silent-after-fork=yes to the Valgrind command.
Product: mozilla.org → Release Engineering
Valgrind runs use the |pgo-profile-run| make target. It's returning 0 on failure. That make target executes build/pgo/profileserver.py, which doesn't appear to have any way to return a non-zero exit code. It should. It uses a FirefoxRunner object to do most of the work... ted, is there a way to get the exit code for a FirefoxRunner-managed command invocation?
Flags: needinfo?(ted)
runner.wait() returns the returncode: https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/runner.py#L69 We just need to do something like: ret = runner.wait() ... sys.exit(ret) Looks like I lost that in my bug 746244 refactoring. Oops!
Flags: needinfo?(ted)
I normalize because |status| was 256 on a failing run locally, and sys.exit(256) causes the process to exit with a 0, because 255 is the max exit code. (Perhaps FirefoxRunner shouldn't return a number greater than 255? But that's beyond the scope of this bug.) Also, should the |finally| case return a non-zero exit code?
Attachment #8343964 - Flags: review?(ted)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Minor tweak: an exit code of 255 worked for me in bash, but the Python docs say "Most systems require it to be in the range 0-127 and produce undefined results otherwise", so I changed the comment accordingly.
Attachment #8343967 - Flags: review?(ted)
Attachment #8343964 - Attachment is obsolete: true
Attachment #8343964 - Flags: review?(ted)
ted: review ping. This is a tiny patch and I'd *really* like for Valgrind-on-TBPL colours to be trustworthy now that bug 948145 appears to be fixed.
Flags: needinfo?(ted)
Comment on attachment 8343967 [details] [diff] [review] Make profilerserver.py use a non-zero exit code on failure. Review of attachment 8343967 [details] [diff] [review]: ----------------------------------------------------------------- Sorry, must have missed the original review request email. ::: build/pgo/profileserver.py @@ +69,2 @@ > httpd.stop() > + if (status != 0): nit: no parens necessary @@ +69,4 @@ > httpd.stop() > + if (status != 0): > + status = 1 # normalize status, in case it's larger than 127 > + sys.exit(status) I'm hoping that Python is sane and the finally block gets invoked even though we sys.exit. (I suspect so.)
Attachment #8343967 - Flags: review?(ted) → review+
Flags: needinfo?(ted)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #10) > @@ +69,4 @@ > > httpd.stop() > > + if (status != 0): > > + status = 1 # normalize status, in case it's larger than 127 > > + sys.exit(status) > > I'm hoping that Python is sane and the finally block gets invoked even > though we sys.exit. (I suspect so.) Got curious and checked: http://docs.python.org/2/library/sys.html#sys.exit "This is implemented by raising the SystemExit exception, so cleanup actions specified by finally clauses of try statements are honored, and it is possible to intercept the exit attempt at an outer level." So yeah :-)
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Backed out in http://hg.mozilla.org/mozilla-central/rev/8b5875dc7e31 for turning Windows PGO builds red. https://tbpl.mozilla.org/php/getParsedLog.php?id=31909776&tree=Mozilla-Central https://tbpl.mozilla.org/php/getParsedLog.php?id=31910914&tree=Mozilla-Central This was near the end of the logs in the failed jobs: Could not pull for prefType 64: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIToolkitProfileService.selectedProfile]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: resource://gre/modules/WindowsPrefSync.jsm :: <TOP_LEVEL> :: line 79" data: no] Could not pull for prefType 128: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIToolkitProfileService.selectedProfile]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: resource://gre/modules/WindowsPrefSync.jsm :: <TOP_LEVEL> :: line 79" data: no] Could not pull for prefType 32: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIToolkitProfileService.selectedProfile]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: resource://gre/modules/WindowsPrefSync.jsm :: <TOP_LEVEL> :: line 79" data: no] c:\builds\moz2_slave\m-cen-w32-pgo-0000000000000000\build\testing\testsuite-targets.mk:404:0: command 'c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/obj-firefox/_virtualenv/Scripts/python.exe c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/build/pgo/profileserver.py 10' failed, return code 1 0 c:\builds\moz2_slave\m-cen-w32-pgo-0000000000000000\build\client.mk:247:0: command 'MOZ_PGO_INSTRUMENTED=1 JARLOG_FILE=jarlog/en-US.log EXTRA_TEST_ARGS=10 C:/mozilla-build/python27/python.exe c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/build/pymake/pymake/../make.py -C obj-firefox pgo-profile-run' failed, return code 2 c:\builds\moz2_slave\m-cen-w32-pgo-0000000000000000\build\client.mk:185:0: command 'C:/mozilla-build/python27/python.exe c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/build/pymake/pymake/../make.py -f c:/builds/moz2_slave/m-cen-w32-pgo-0000000000000000/build/client.mk profiledbuild' failed, return code 2 program finished with exit code 2 elapsedTime=1029.565000
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Note to self: if you want PGO to work on your try push, dd "mk_add_options MOZ_PGO=1" to build/mozconfig.common.override.
Do those show up in a normal PGO build log?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15) > Do those show up in a normal PGO build log? Yes. https://tbpl.mozilla.org/php/getParsedLog.php?id=31926519&tree=Mozilla-Central&full=1 is an example. So these jobs have probably been returning a non-zero exit code for ages, but we didn't notice. KWierso, are you certain those lines are responsible, or was it an educated guess?
Flags: needinfo?(kwierso)
(In reply to Nicholas Nethercote [:njn] from comment #16) > KWierso, are you certain those lines are responsible, or was it an educated > guess? Unsure, I've never really looked at the PGO logs.
Flags: needinfo?(kwierso)
FWIW, when the attached patch is applied and a failing Valgrind run occurs, the highlighted output on TBPL is just this: make: *** [pgo-profile-run] Error 1 which isn't informative, but hopefully good enough.
In bug 631842 I'm going to clone profileserver.py as valgrind_test.py. Then I can add the checking of the exit code to valgrind_test.py but leave profileserver.py alone, which avoids the problems with Windows PGO builds. (It will also allow the code run by the Valgrind job to change independently of the PGO job, which is a good thing.) Once that's done, this bug will be moot. Moot!
> In bug 631842 I'm going to clone profileserver.py as valgrind_test.py. So I'll mark this is a dup.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → DUPLICATE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: