684328 - buildbot should support a 'remove failing jobs from the system as quickly as possible'

Reporter

Description

•

13 years ago

make[3]: Leaving directory `/builds/slave/try-lnx64/build'
TEST-UNEXPECTED-FAIL | check-sync-dirs.py | build file copies are not in sync
TEST-INFO | check-sync-dirs.py | file(s) found in:               /builds/slave/try-lnx64/build/js/src/config
TEST-INFO | check-sync-dirs.py | differ from their originals in: /builds/slave/try-lnx64/build/config
TEST-INFO | check-sync-dirs.py | differing file:                 ./makefiles/autotargets.mk
In general, the files in '/builds/slave/try-lnx64/build/js/src/config'
should always be exact copies of originals in '/builds/slave/try-
lnx64/build/config'.  A change made to one should also be made to the
other.  See 'check-sync-dirs.py' for more details.
make[2]: *** [realbuild] Error 1

Joey Armstrong [:joey]

Reporter

Comment 1

•

13 years ago

The build system should support a 'delete jobs at the earliest sign of failure' option.

Most of the time it is useful to have the system return as much status (make -k) as possible.
Another helpful mode is not wasting resources when jobs are doomed to failure.

Checkin tests are one good example for these.  js/src/config contains a unit test 'check-sync-dirs.py' that will verify makefiles build/autoconf/*, etc files are all at the same version.

When config/rules.mk is modified to test a few edits but you forget to copy the edits under js/src the checkin test will force a failure.  At this point testing is incomplete because only part of the tree will be building with the new makefiles.  Failure condition is known very early.  Current behaivor is to continue on and run the job unless there is manual intervention.

Being able to completely flush failing jobs like these from the try server would free up resources for other jobs to make use of.

Default behavior should remain the same -- run to completion.  An override option like --exit-on-failure could conditionally enable the --flush-on-failure behavior.

Chris AtLee [:catlee]

Comment 2

•

13 years ago

not sure exactly what you're asking for here, which jobs should be removed from the system?

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

Priority: -- → P5

Whiteboard: [buildapi][selfserve]

Joey Armstrong [:joey]

Reporter

Comment 3

•

12 years ago

Bug 733172  would be one example:

If a failure occurs during setup, in this case hg was not able to checkout/apply source changes for a submission, the resulting job will likely be a nop that will needlessly hold onto resources that could be better used processing another queue job:

abort: unknown revision 'd1fa189b44cea5f8110c2f4789bffaee66ab132b'!


Also there is an entire class of unit/check tests that can {should imho} be considered fatal - removed from the queue asap on error.  One example: checkin tests for makefile logic, if these tests begin failing something is seriously wrong or will be broken by the current checkin.

Binaries built in this state should be considered unreliable because some fundamental assumptions about the environment or building have been violated.

But again aside from preventing garbage from making it's way in this op is also to free up resources quickly for the next job in line.  This could be conditional, fail early/often unless make -k is required { or the reverse, gather all status possible or add an extra try arg to allow early exit on the first sign of error }.

Phil Ringnalda (:philor)

Comment 4

•

12 years ago

If you look at the log for one of the hg failures, like https://tbpl.mozilla.org/php/getParsedLog.php?id=9831245&tree=Try, you'll see that it did bail as quickly as possible (because the hg buildstep is haltOnFailure=True).

I *think* what you really want is to split make check into two pieces, a haltOnFailure=True `make sanity-check` which runs before the build is packaged and uploaded, which can stop packaging, and then a separate `make check` with the tests that are just tests which runs after packaging and uploading.

That's completely doable now from a releng perspective, but releng just runs commands which exist, so whether it's actually `make sanity-check` which knows what directories to run `make check` in or a single directory, a command which runs those tests would need to exist before it becomes a releng bug.

Joey Armstrong [:joey]

Reporter

Comment 5

•

12 years ago

(In reply to Phil Ringnalda (:philor) from comment #4)
> If you look at the log for one of the hg failures, like
> https://tbpl.mozilla.org/php/getParsedLog.php?id=9831245&tree=Try, you'll
> see that it did bail as quickly as possible (because the hg buildstep is
> haltOnFailure=True).

Have a peek at the raw log rather than the formatted version.  There is a long stream of hg commands being run after the initial failure logged [linux64 dump below].

All of which fail for the same basic reason over and over -- a change set could not be checked out -or- the filesystem on disk is not in a good state.  The build environment is crud...

Passing -k on make calls may be contributing to some of the behavior.  Maybe a new flag could be added for try: submissions that could conditionally inhibit passing -k and bail out as early as possible.


% zcat bad/d1fa189b44ce/try-linux64-build3856.txt.gz | ./parse.pl

abort: unknown revision 'd1fa189b44cea5f8110c2f4789bffaee66ab132b'!
abort: unknown revision 'd1fa189b44cea5f8110c2f4789bffaee66ab132b'!
not found!
abort: destination '/builds/hg-shared/try' is not empty
abort: destination '/builds/hg-shared/try' is not empty
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00changelog.i\
.a
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00changelog.i\
.a
abort: No such file or directory: /builds/hg-shared/try/.hg/00manifest.d
abort: No such file or directory: /builds/hg-shared/try/.hg/00manifest.d
not found!
abort: integrity check failed on 00manifest.i:74182!
abort: integrity check failed on 00manifest.i:74182!
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00manifest.d
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00manifest.d
not found!
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00changelog.i\
.a
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00changelog.i\
.a
abort: connection ended unexpectedly
abort: connection ended unexpectedly
not found!
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00manifest.d
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00manifest.d
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00changelog.i\
.a
abort: No such file or directory: /builds/hg-shared/try/.hg/store/00changelog.i\
.a
program finished with exit code 1


> I *think* what you really want is to split make check into two pieces, a
> haltOnFailure=True `make sanity-check` which runs before the build is
> packaged and uploaded, which can stop packaging, and then a separate `make
> check` with the tests that are just tests which runs after packaging and
> uploading.

If 'check' is currently testing more than checkin/smoke tests then yes it would probably make sense to split them to have finer grain control over testing and allow an early failure valve.

But I am not ready to make the jump into testing.  There are more bugs open related to failures in this ticket.  This job should have failed outright because of the hg/setup failure.

Of all the platforms this job ran on *ONLY THREE* reported the error.  All others reported green/success -- which should imply that status is being overlooked or masked somewhere.  Which might also have a hand in inhibiting being able to exit early on failure (?).

> That's completely doable now from a releng perspective, but releng just runs
> commands which exist, so whether it's actually `make sanity-check` which
> knows what directories to run `make check` in or a single directory, a
> command which runs those tests would need to exist before it becomes a
> releng bug.

Ed Morley [:emorley]

Updated

•

12 years ago

Keywords: buildapi

Whiteboard: [buildapi][selfserve]

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → INCOMPLETE

Bugzilla

Quick Search

buildbot should support a 'remove failing jobs from the system as quickly as possible'

Categories

(Release Engineering :: General, defect, P5)

Tracking

(Not tracked)

People

(Reporter: joey, Unassigned)

References

Details

(Keywords: buildapi)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Updated

Updated

Updated