Closed Bug 733172 Opened 12 years ago Closed 12 years ago

try server: inconsistent platform status reported for a "failing" job

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: joey, Unassigned)

Details

https://build.mozilla.org/buildapi/self-serve/try/rev/d1fa189b44ce

Try args: try: -b do -e -p all -u none -t none 

Per the failure in bug #733050 - underlying problem: try was unable to checkout a change set and reported a few command errors:

=======================================================================
abort: unknown revision 'd1fa189b44cea5f8110c2f4789bffaee66ab132b'!

hg: unknown command 'share'
command: hg path default
command: cwd: e:\builds\moz2_slave\try-w64\build
command: output:
============================================================================

This is a low level/setup problem that should have been logged as a failure on all platforms.  3 distinct statuses were logged:
  1 - infra exception, win64
  2 - busted, linux64 opt/debug
 10 - green/success, everything else

Email status:
   3 - error: win64, linux64-{debug,build}
  11 - non-failing email, all other platforms

Status reporting ideally would be consistent since this particular error was present in all logs.  At a minimum all platforms should have reported a failure status or the condition could imply status is being overlooked.
(In reply to Joey Armstrong [:joey] from comment #2)
> https://build.mozilla.org/buildapi/self-serve/try/rev/d1fa189b44ce
> 
> Try args: try: -b do -e -p all -u none -t none 
> 
> Per the failure in bug #733050 - underlying problem: try was unable to
> checkout a change set and reported a few command errors:
> 
> =======================================================================
> abort: unknown revision 'd1fa189b44cea5f8110c2f4789bffaee66ab132b'!
> 
> hg: unknown command 'share'
> command: hg path default
> command: cwd: e:\builds\moz2_slave\try-w64\build
> command: output:
> ============================================================================
> 
> This is a low level/setup problem that should have been logged as a failure
> on all platforms. 

found in triage - I *think* this belongs in new component RelEng:DevTools
Component: Release Engineering → Release Engineering: Developer Tools
QA Contact: release → lsblakk
https://tbpl.mozilla.org/?tree=Try&rev=d1fa189b44ce

Why should it have been reported as a failure on all platforms? It *wasn't* - the two Linux64 slaves did fail the checkout step, because when they tried to update their shared repo they hit failures, and then failed when they tried to just straight clone try from hg.m.o, the rest did not. There are typically all sorts of ugly looking things in the hg clone step, because it tries multiple layers of things to avoid ever having to hit hg.m.o, but sometimes it can't avoid it. What matters is what happened on the last thing it tried in that step, which worked for everything else, and didn't work for those two Linux64 runs. But it isn't "a low level/setup problem" because it isn't just one setup, it's 14 separate machines with local repos in separate states all trying to update. Some got it, a couple didn't.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
(In reply to Phil Ringnalda (:philor) from comment #3)
> https://tbpl.mozilla.org/?tree=Try&rev=d1fa189b44ce
> 
> Why should it have been reported as a failure on all platforms? It *wasn't*
> - the two Linux64 slaves did fail the checkout step, because when they tried
> to update their shared repo they hit failures, and then failed when they
> tried to just straight clone try from hg.m.o, the rest did not.

Then I would suggest status reporting was not generated/handled correctly in this case and needs to be fixed {possibly along with gathering commands which contributed to the odd state }.

If I submit a job to test edits and that job is able to run with something other than what was submitted ... what exactly are we testing ?  Worse if linux64 was not in the mix to record the failure should this imply the job would have succeded reporting a false positive ?

At the least it sounds like all of the sandbox setup commands may need to be gathered into a central script where hard failures like hg checkout from a job can be handled.

A larger concern for a job like this reporting success is what other failure conditions may be flying under the radar when a platform reports success when there are very blatent errors in the logs.
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Joey I don't think that the job was run against 'something other than what was submitted' - we just try various ways to get the right repo/cset for the try push.  Not sure what you expect to see happen here, the error reporting is an 'all attempts failed' as philor pointed out - so if one method succeeds, that is considered a valid run for that platform and builder.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → INVALID
Product: mozilla.org → Release Engineering
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.