Closed Bug 1512733 Opened 6 years ago Closed 4 years ago

Docker Worker tasks can be successful even when the task's artifacts do not exist on-disk

Categories

(Taskcluster :: Workers, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: Callek, Unassigned)

References

Details

(Whiteboard: [releaseduty])

Over in Bug 1512732 we hit a release issue on a docker-worker (gecko-3-b-linux) worker where we had an explicitly specified artifact which wasn't found. This led to the downstream task trying to use it and it was unable to proceed due to the missing download. I purport that we shouldn't consider a task successful if an explicitly defined artifact is missing.
Whiteboard: [releaseduty]
Hey Wander, what's the best way to handle this? Over here[0] you see we fail to create and upload an expected artifact: [taskcluster 2018-12-06 02:35:27.495Z] === Task Finished === [taskcluster 2018-12-06 02:35:27.497Z] Artifact "releng/partner/mozilla-EME-free/mozilla-EME-free/ga-IE/target.dmg" not found at "/builds/worker/workspace/build/outputs/mozilla-EME-free/mozilla-EME-free/ga-IE/target.dmg" [taskcluster 2018-12-06 02:35:27.498Z] Artifact "releng/partner" not found at "/builds/worker/artifacts/" [taskcluster 2018-12-06 02:35:28.078Z] Successful task run with exit code: 0 completed in 52.526 seconds https://github.com/taskcluster/docker-worker/blob/ef974e09acbf6d36c641af0468893e9355f951cc/src/lib/features/artifacts.js#L53-L54 We would like the task to fail as it breaks downstream tasks in a release graph. [0] https://taskcluster-artifacts.net/WxeF9Z_eTYyN5yE8VSL3nQ/0/public/logs/live_backing.log
Flags: needinfo?(wcosta)
(In reply to Jordan Lund (:jlund) from comment #1) > Hey Wander, what's the best way to handle this? > > Over here[0] you see we fail to create and upload an expected artifact: > > [taskcluster 2018-12-06 02:35:27.495Z] === Task Finished === > [taskcluster 2018-12-06 02:35:27.497Z] Artifact > "releng/partner/mozilla-EME-free/mozilla-EME-free/ga-IE/target.dmg" not > found at > "/builds/worker/workspace/build/outputs/mozilla-EME-free/mozilla-EME-free/ga- > IE/target.dmg" > [taskcluster 2018-12-06 02:35:27.498Z] Artifact "releng/partner" not found > at "/builds/worker/artifacts/" > [taskcluster 2018-12-06 02:35:28.078Z] Successful task run with exit code: 0 > completed in 52.526 seconds > > > https://github.com/taskcluster/docker-worker/blob/ > ef974e09acbf6d36c641af0468893e9355f951cc/src/lib/features/artifacts.js#L53- > L54 > > We would like the task to fail as it breaks downstream tasks in a release > graph. > > > [0] > https://taskcluster-artifacts.net/WxeF9Z_eTYyN5yE8VSL3nQ/0/public/logs/ > live_backing.log Sorry for the delay. We are aware of this problem, and it has never been fixed because I was told there are a bunch of tasks that would be broken because they specify invalid artifacts.
Flags: needinfo?(wcosta)

Hit this again:

RyanVM> rerunning FHsXDzyTSx2JDdx5ZAlR_Q
11:19:42
<•jlund> Jordan Lund 2019-01-08T19:15:19 ERROR - Hit ScriptWorkerException: 'path public/build/uk/target.complete.mar not in signing:repackage aUEmuNSLR6eBUNlufRH_4A chain of trust artifacts!'
https://irccloud.mozilla.com/pastebin/VlkYMwrY/
Plain Text • 8 lines raw | line numbers
11:26:50 https://taskcluster-artifacts.net/aUEmuNSLR6eBUNlufRH_4A/0/public/logs/live_backing.log
11:27:06 I have a feeling that rerun won't work based on earlier task
11:28:32
<RyanVM> jlund: yeah
11:29:05 jlund: so, rerun aUEmuNSLR6eBUNlufRH_4A, then rerun FHsXDzyTSx2JDdx5ZAlR_Q ?
11:31:03
<•jlund> Jordan Lund I think so. I think this is: https://bugzilla.mozilla.org/show_bug.cgi?id=1512733
11:31:17
<RyanVM> ok, aUEmuNSLR6eBUNlufRH_4A rerun
11:31:25 will attempt the latter after that one finishes

Hit this again,

second release in a row. breaks many downstream tasks after they start.

https://taskcluster-artifacts.net/W68_TQjBTWGdkeN9eNOvUQ/0/public/logs/live_backing.log

I think generic-worker behaves differently than this. Until that switch takes place, perhaps the task itself could check that the artifact exists and, if not, exit with a nonzero status?

Summary: Docker Worker reports green even when an artifact of type "file" is not found. → Docker Worker tasks can be successful even when the task's artifacts fail to upload
Component: Worker → Workers

hit again today:

[task 2019-03-07T17:10:05.161Z] 17:10:05 INFO - Return code: 0
[task 2019-03-07T17:10:05.161Z] 17:10:05 INFO - [mozharness: 2019-03-07 17:10:05.161883Z] Finished repackage step (success)
[task 2019-03-07T17:10:05.168Z] cleanup
[task 2019-03-07T17:10:05.168Z] + cleanup
[task 2019-03-07T17:10:05.168Z] + local rv=0
[task 2019-03-07T17:10:05.168Z] + cleanup_xvfb
[task 2019-03-07T17:10:05.168Z] pidof Xvfb
[task 2019-03-07T17:10:05.168Z] ++ pidof Xvfb
[task 2019-03-07T17:10:05.169Z] + local xvfb_pid=
[task 2019-03-07T17:10:05.169Z] + local vnc=false
[task 2019-03-07T17:10:05.169Z] + local interactive=false
[task 2019-03-07T17:10:05.169Z] + '[' -n '' ']'
[task 2019-03-07T17:10:05.169Z] + exit 0
[fetches 2019-03-07T17:10:05.169Z] removing /builds/worker/fetches
[fetches 2019-03-07T17:10:05.178Z] finished
[taskcluster 2019-03-07 17:10:05.481Z] === Task Finished ===
[taskcluster 2019-03-07 17:10:05.484Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-03-07 17:10:05.485Z] Artifact "public/build/hsb/target.dmg" not found at "/builds/worker/workspace/build/outputs/hsb/target.dmg"
[taskcluster 2019-03-07 17:10:05.486Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-03-07 17:10:05.487Z] Artifact "public/build/hsb/target.complete.mar" not found at "/builds/worker/workspace/build/outputs/hsb/target.complete.mar"
[taskcluster 2019-03-07 17:10:06.174Z] Successful task run with exit code: 0 completed in 79.956 seconds

jlund> wonder if we have hit Bug 1512733 - Docker Worker tasks can be successful even when the task's artifacts fail to upload
10:07:36 https://taskcluster-web.netlify.com/tasks/TzwwZ_srRj-4W6_KhGgwtw/runs/0/logs/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FTzwwZ_srRj-4W6_KhGgwtw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log
10:08:34 dep task that had a good cmd run (green) but artifact upload errors
10:11:12 downstream failing: https://taskcluster-web.netlify.com/tasks/EVZKYYxQRLCFFd4-aET0pw/runs/0/logs/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FEVZKYYxQRLCFFd4-aET0pw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Fchain_of_trust.log
10:11:32 rerunning dep task that was green

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5)

I think generic-worker behaves differently than this. Until that switch takes place, perhaps the task itself could check that the artifact exists and, if not, exit with a nonzero status?

Possible. Is there a way to determine which artifacts are defined, in-task, or would we need to download the task definition for $TASKID ?

I was thinking of addressing the specific cases where this occurs -- but if it occurs as a result of a programming error, then maybe that's not practical (or at least, the eventual 404 is error-message enough). Are these failures intermittent or somehow otherwise not due to a task that promises artifacts it doesn't deliver?

the three release times I've seen this this has been:

docker-worker, gecko-3-b-linux, fulfilling a macosx related l10n task, runs taskcluster/scripts/builder/repackage.sh

so there are correlations between the tasks..

https://taskcluster-web.netlify.com/tasks/WxeF9Z_eTYyN5yE8VSL3nQ
https://taskcluster-web.netlify.com/tasks/aUEmuNSLR6eBUNlufRH_4A
https://taskcluster-web.netlify.com/tasks/TzwwZ_srRj-4W6_KhGgwtw/

Yeah, it looks like all 3 are running mozharness/scripts/repackage.py, which simplifies things.
If we can add a check to that script to verify the artifacts are in the expected location(s), we may be good from the release perspective.

Hit again on this push: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception&searchStr=linux%2Cx64%2Cshippable%2Copt%2Cgd&revision=57763214c9e85c28922c3eb5471357b698109f5d&selectedJob=250274480

mar signing exception: https://tools.taskcluster.net/groups/B80yniJbQLqMVrKAd9t5pw/tasks/e7daH3XOS3Cxu-ydjNCvyw/runs/2/logs/public%2Flogs%2Fchain_of_trust.log#L118

first run on Rpk successfull: https://tools.taskcluster.net/groups/B80yniJbQLqMVrKAd9t5pw/tasks/ZvKbFn4aQhiVFN9LUAWwfg/runs/0/logs/public%2Flogs%2Flive_backing.log#L440
with
[task 2019-06-06T00:10:14.762Z] 00:10:14 INFO - [mozharness: 2019-06-06 00:10:14.762091Z] Finished repackage step (success)
[task 2019-06-06T00:10:14.784Z] cleanup
[task 2019-06-06T00:10:14.784Z] + cleanup
[task 2019-06-06T00:10:14.784Z] + local rv=0
[task 2019-06-06T00:10:14.784Z] + cleanup_xvfb
[task 2019-06-06T00:10:14.784Z] pidof Xvfb
[task 2019-06-06T00:10:14.784Z] ++ pidof Xvfb
[task 2019-06-06T00:10:14.785Z] + local xvfb_pid=
[task 2019-06-06T00:10:14.785Z] + local vnc=false
[task 2019-06-06T00:10:14.785Z] + local interactive=false
[task 2019-06-06T00:10:14.785Z] + '[' -n '' ']'
[task 2019-06-06T00:10:14.785Z] + exit 0
[fetches 2019-06-06T00:10:14.786Z] removing /builds/worker/fetches
[fetches 2019-06-06T00:10:14.795Z] finished
[taskcluster 2019-06-06 00:10:15.098Z] === Task Finished ===
[taskcluster 2019-06-06 00:10:15.100Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-06-06 00:10:15.102Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-06-06 00:10:15.103Z] Artifact "public/build/gd/target.complete.mar" not found at "/builds/worker/workspace/build/outputs/gd/target.complete.mar"
[taskcluster 2019-06-06 00:10:15.842Z] Successful task run with exit code: 0 completed in 103.976 seconds

Rerunning the repack job did the trick and the other green jobs followed.

See Also: → 1562993
Blocks: 1585578

happened again: https://firefoxci.taskcluster-artifacts.net/VD8zBY0vSGeOeLTb7wGV6g/0/public/logs/live_backing.log

above task should not have been green. It broke the following task downstream: https://firefoxci.taskcluster-artifacts.net/VD8zBY0vSGeOeLTb7wGV6g/0/public/logs/live_backing.log

trick was to rerun --force VD8zBY0vSGeOeLTb7wGV6g and then rerun VD8zBY0vSGeOeLTb7wGV6g

IMO:

  1. We need to determine how long docker-worker will exist.

    a. if we replace it with generic-worker sooner, awesome, that will fix the issue.
    b. if docker-worker is longer-lived, then

  2. we could craft a docker-worker release with this issue fixed, and create new AMIs. We can then roll out those AMIs to test workerTypes, and see what tasks break. Once we have a list, we should fix those, and roll out completely.

No longer blocks: 1585578
Blocks: 1578289

Still biting us. I'm leaning towards either:

  1. logging something visible whenever this happens; we can parse logs through stackdriver or papertrail or whatever to figure out if fixing this will break more things than we expect, or
  2. just fixing it, and dealing with fallout. We may want to let this ride the trains, which may mean a new set of pools.
  1. Just fixing, and rolling out to Try to see if anything unexpected breaks in a full push and/or staging release.

Another option may be to add a config -- either worker config or in the task -- to control this behavior. It would be good to be consistent across worker implementations.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #10)

I was thinking of addressing the specific cases where this occurs -- but if it occurs as a result of a programming error, then maybe that's not practical (or at least, the eventual 404 is error-message enough). Are these failures intermittent or somehow otherwise not due to a task that promises artifacts it doesn't deliver?

I never got an answer to this -- why are tasks executing and returning a zero exit status without creating the files they should create?

Summary: Docker Worker tasks can be successful even when the task's artifacts fail to upload → Docker Worker tasks can be successful even when the task's artifacts do not exist on-disk

In this latest instance, https://firefoxci.taskcluster-artifacts.net/Wppwg3n3Td2-lhBn_PWh9w/0/public/logs/live_backing.log is the run with missing artifacts, and https://firefoxci.taskcluster-artifacts.net/Wppwg3n3Td2-lhBn_PWh9w/1/public/logs/live_backing.log is the run with existing artifacts. Nothing's jumping out at me, other than the last few lines of run 0.

Huh, so maybe something is deleting the container too early. Or maybe Docker's just being Docker and working 80% of the time. That suggests this might be harder to fix than it seems!

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.