Closed Bug 938543 Opened 11 years ago Closed 10 years ago

when blobber fails to upload talos results in a green run on tbpl

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jmaher, Unassigned)

References

Details

I have been seeing this in testing .etl file uploads, but there was a crash on try server for an existing test and it was trying to upload the crash report and failed.  So we have talos crashing and blobber failing, but we still return 0 and the job is green?

here is a link to the try server job where it failed to upload a crash report:
https://tbpl.mozilla.org/php/getParsedLog.php?id=30553443&tree=Try#error0

here is a link to a try server job where it has failed to upload a .etl file:
https://tbpl.mozilla.org/php/getParsedLog.php?id=30540421&tree=Try&full=1

I believe the logic we should use is:
talosreturncode = runtalos()
blobberreturncode = runblobber()
if failed(talosreturncode) return talosreturncode
return blobberreturncode
In general, we usually get told not to set WARNINGS (orange) or FAILURE (red) for infra issues. Should we use EXCEPTION (which ends up purple) instead?
we could ignore the blobber error code and just proxy the talos return code then.   I would vote for if talos fails report it as such (red or orange), and if blobber fails report it as purple.
I think that's a good plan long-term, but there are still some things to be worked out before that. Until we're ready to rely 100% on blobber, I'd prefer if failures didn't affect the status of the build.
so to confirm, we will still report the original error code of the test job, and ignore any and all failures from blobber?
I think the exact final behaviour is TBD, but there are a few options if blobber fails:
- turn the job red (like graph server post failures)
- turn the job orange
- turn the job purple (infra failure)
- silently ignore the failure and report the original error code of the test job

I don't think we'd automatically retry on blobber failure.

I'm lean towards orange or purple...or some kind of "infra warning" colour that we don't have yet.
thinking larger now- all our blobber uploads are when we have failures- so technically we should be turning the job orange (screenshots, .etl files) or red (crash reports).  Right now blobber is masking real test failures (on non production branches)
(In reply to Joel Maher (:jmaher) from comment #6)
> thinking larger now- all our blobber uploads are when we have failures- so
> technically we should be turning the job orange (screenshots, .etl files) or
> red (crash reports).  Right now blobber is masking real test failures (on
> non production branches)

I think *currently* all our blobber uploads are when we have failures.
In the future we could potentially upload logs, localconfig.json, buildprops.json, or other files, even on success.
I haven't looked at where and how blobber is deciding its status, but ignorantly assuming its a buildstep with a worst_status, it seems like it just needs to put EXCEPTION after WARNING and FAILURE instead of before in a custom worst_status to implement comment 0's "purple if we were green before blobber failed, red or orange if we were that before blobber failed." And it's true what they say, ignorance is bliss.
To sum up:

Current situation: blobber upload success or failure doesn't impact job status one way or another

Desired future state: failure on blobber upload should turn a green job purple, otherwise leave job status alone

Is that accurate?
just saw another case where blobber turned a job green that should have been red.
catlee, was blobber and mozharness scripts changed to preserve the original return code?  funny how we both commented at the same time.
Have a link to the job from comment #10? At this point blobber shouldn't be affecting the job status one way or another.
what's wrong with that? unzip fails, exits with 9, and mozharness then exits with 9.
ok, I was mistaken.  I don't have access to the webserver, so I was relying on an irc conversation that could easily be out of context.

given this information, I would wager that comment 9 is accurate.
catlee, please see bug 946922, we still have blobber messing up the return codes.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.