Docker Worker tasks can be successful even when the task's artifacts do not exist on-disk
Categories
(Taskcluster :: Workers, enhancement)
Tracking
(Not tracked)
People
(Reporter: Callek, Unassigned)
References
Details
(Whiteboard: [releaseduty])
Updated•6 years ago
|
Comment 1•6 years ago
|
||
Comment 2•6 years ago
|
||
Comment 3•6 years ago
|
||
Hit this again:
RyanVM> rerunning FHsXDzyTSx2JDdx5ZAlR_Q
11:19:42
<•jlund> Jordan Lund 2019-01-08T19:15:19 ERROR - Hit ScriptWorkerException: 'path public/build/uk/target.complete.mar not in signing:repackage aUEmuNSLR6eBUNlufRH_4A chain of trust artifacts!'
https://irccloud.mozilla.com/pastebin/VlkYMwrY/
Plain Text • 8 lines raw | line numbers
11:26:50 https://taskcluster-artifacts.net/aUEmuNSLR6eBUNlufRH_4A/0/public/logs/live_backing.log
11:27:06 I have a feeling that rerun won't work based on earlier task
11:28:32
<RyanVM> jlund: yeah
11:29:05 jlund: so, rerun aUEmuNSLR6eBUNlufRH_4A, then rerun FHsXDzyTSx2JDdx5ZAlR_Q ?
11:31:03
<•jlund> Jordan Lund I think so. I think this is: https://bugzilla.mozilla.org/show_bug.cgi?id=1512733
11:31:17
<RyanVM> ok, aUEmuNSLR6eBUNlufRH_4A rerun
11:31:25 will attempt the latter after that one finishes
Comment 4•6 years ago
|
||
Hit this again,
second release in a row. breaks many downstream tasks after they start.
https://taskcluster-artifacts.net/W68_TQjBTWGdkeN9eNOvUQ/0/public/logs/live_backing.log
Comment 5•6 years ago
|
||
I think generic-worker behaves differently than this. Until that switch takes place, perhaps the task itself could check that the artifact exists and, if not, exit with a nonzero status?
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Comment 8•6 years ago
|
||
hit again today:
[task 2019-03-07T17:10:05.161Z] 17:10:05 INFO - Return code: 0
[task 2019-03-07T17:10:05.161Z] 17:10:05 INFO - [mozharness: 2019-03-07 17:10:05.161883Z] Finished repackage step (success)
[task 2019-03-07T17:10:05.168Z] cleanup
[task 2019-03-07T17:10:05.168Z] + cleanup
[task 2019-03-07T17:10:05.168Z] + local rv=0
[task 2019-03-07T17:10:05.168Z] + cleanup_xvfb
[task 2019-03-07T17:10:05.168Z] pidof Xvfb
[task 2019-03-07T17:10:05.168Z] ++ pidof Xvfb
[task 2019-03-07T17:10:05.169Z] + local xvfb_pid=
[task 2019-03-07T17:10:05.169Z] + local vnc=false
[task 2019-03-07T17:10:05.169Z] + local interactive=false
[task 2019-03-07T17:10:05.169Z] + '[' -n '' ']'
[task 2019-03-07T17:10:05.169Z] + exit 0
[fetches 2019-03-07T17:10:05.169Z] removing /builds/worker/fetches
[fetches 2019-03-07T17:10:05.178Z] finished
[taskcluster 2019-03-07 17:10:05.481Z] === Task Finished ===
[taskcluster 2019-03-07 17:10:05.484Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-03-07 17:10:05.485Z] Artifact "public/build/hsb/target.dmg" not found at "/builds/worker/workspace/build/outputs/hsb/target.dmg"
[taskcluster 2019-03-07 17:10:05.486Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-03-07 17:10:05.487Z] Artifact "public/build/hsb/target.complete.mar" not found at "/builds/worker/workspace/build/outputs/hsb/target.complete.mar"
[taskcluster 2019-03-07 17:10:06.174Z] Successful task run with exit code: 0 completed in 79.956 seconds
jlund> wonder if we have hit Bug 1512733 - Docker Worker tasks can be successful even when the task's artifacts fail to upload
10:07:36 https://taskcluster-web.netlify.com/tasks/TzwwZ_srRj-4W6_KhGgwtw/runs/0/logs/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FTzwwZ_srRj-4W6_KhGgwtw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log
10:08:34 dep task that had a good cmd run (green) but artifact upload errors
10:11:12 downstream failing: https://taskcluster-web.netlify.com/tasks/EVZKYYxQRLCFFd4-aET0pw/runs/0/logs/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FEVZKYYxQRLCFFd4-aET0pw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Fchain_of_trust.log
10:11:32 rerunning dep task that was green
Comment 9•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5)
I think generic-worker behaves differently than this. Until that switch takes place, perhaps the task itself could check that the artifact exists and, if not, exit with a nonzero status?
Possible. Is there a way to determine which artifacts are defined, in-task, or would we need to download the task definition for $TASKID ?
Comment 10•6 years ago
|
||
I was thinking of addressing the specific cases where this occurs -- but if it occurs as a result of a programming error, then maybe that's not practical (or at least, the eventual 404 is error-message enough). Are these failures intermittent or somehow otherwise not due to a task that promises artifacts it doesn't deliver?
Comment 11•6 years ago
|
||
the three release times I've seen this this has been:
docker-worker, gecko-3-b-linux, fulfilling a macosx related l10n task, runs taskcluster/scripts/builder/repackage.sh
so there are correlations between the tasks..
https://taskcluster-web.netlify.com/tasks/WxeF9Z_eTYyN5yE8VSL3nQ
https://taskcluster-web.netlify.com/tasks/aUEmuNSLR6eBUNlufRH_4A
https://taskcluster-web.netlify.com/tasks/TzwwZ_srRj-4W6_KhGgwtw/
Comment 12•6 years ago
|
||
Yeah, it looks like all 3 are running mozharness/scripts/repackage.py
, which simplifies things.
If we can add a check to that script to verify the artifacts are in the expected location(s), we may be good from the release perspective.
Comment 13•5 years ago
|
||
mar signing exception: https://tools.taskcluster.net/groups/B80yniJbQLqMVrKAd9t5pw/tasks/e7daH3XOS3Cxu-ydjNCvyw/runs/2/logs/public%2Flogs%2Fchain_of_trust.log#L118
first run on Rpk successfull: https://tools.taskcluster.net/groups/B80yniJbQLqMVrKAd9t5pw/tasks/ZvKbFn4aQhiVFN9LUAWwfg/runs/0/logs/public%2Flogs%2Flive_backing.log#L440
with
[task 2019-06-06T00:10:14.762Z] 00:10:14 INFO - [mozharness: 2019-06-06 00:10:14.762091Z] Finished repackage step (success)
[task 2019-06-06T00:10:14.784Z] cleanup
[task 2019-06-06T00:10:14.784Z] + cleanup
[task 2019-06-06T00:10:14.784Z] + local rv=0
[task 2019-06-06T00:10:14.784Z] + cleanup_xvfb
[task 2019-06-06T00:10:14.784Z] pidof Xvfb
[task 2019-06-06T00:10:14.784Z] ++ pidof Xvfb
[task 2019-06-06T00:10:14.785Z] + local xvfb_pid=
[task 2019-06-06T00:10:14.785Z] + local vnc=false
[task 2019-06-06T00:10:14.785Z] + local interactive=false
[task 2019-06-06T00:10:14.785Z] + '[' -n '' ']'
[task 2019-06-06T00:10:14.785Z] + exit 0
[fetches 2019-06-06T00:10:14.786Z] removing /builds/worker/fetches
[fetches 2019-06-06T00:10:14.795Z] finished
[taskcluster 2019-06-06 00:10:15.098Z] === Task Finished ===
[taskcluster 2019-06-06 00:10:15.100Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-06-06 00:10:15.102Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-06-06 00:10:15.103Z] Artifact "public/build/gd/target.complete.mar" not found at "/builds/worker/workspace/build/outputs/gd/target.complete.mar"
[taskcluster 2019-06-06 00:10:15.842Z] Successful task run with exit code: 0 completed in 103.976 seconds
Rerunning the repack job did the trick and the other green jobs followed.
Comment hidden (Intermittent Failures Robot) |
Comment 15•5 years ago
|
||
happened again: https://firefoxci.taskcluster-artifacts.net/VD8zBY0vSGeOeLTb7wGV6g/0/public/logs/live_backing.log
above task should not have been green. It broke the following task downstream: https://firefoxci.taskcluster-artifacts.net/VD8zBY0vSGeOeLTb7wGV6g/0/public/logs/live_backing.log
trick was to rerun --force VD8zBY0vSGeOeLTb7wGV6g
and then rerun VD8zBY0vSGeOeLTb7wGV6g
Comment 16•5 years ago
•
|
||
IMO:
-
We need to determine how long docker-worker will exist.
a. if we replace it with generic-worker sooner, awesome, that will fix the issue.
b. if docker-worker is longer-lived, then -
we could craft a docker-worker release with this issue fixed, and create new AMIs. We can then roll out those AMIs to test workerTypes, and see what tasks break. Once we have a list, we should fix those, and roll out completely.
Comment 17•4 years ago
|
||
Still biting us. I'm leaning towards either:
- logging something visible whenever this happens; we can parse logs through stackdriver or papertrail or whatever to figure out if fixing this will break more things than we expect, or
- just fixing it, and dealing with fallout. We may want to let this ride the trains, which may mean a new set of pools.
Comment 18•4 years ago
|
||
- Just fixing, and rolling out to Try to see if anything unexpected breaks in a full push and/or staging release.
Comment hidden (Intermittent Failures Robot) |
Comment 20•4 years ago
|
||
Another option may be to add a config -- either worker config or in the task -- to control this behavior. It would be good to be consistent across worker implementations.
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #10)
I was thinking of addressing the specific cases where this occurs -- but if it occurs as a result of a programming error, then maybe that's not practical (or at least, the eventual 404 is error-message enough). Are these failures intermittent or somehow otherwise not due to a task that promises artifacts it doesn't deliver?
I never got an answer to this -- why are tasks executing and returning a zero exit status without creating the files they should create?
Comment 21•4 years ago
|
||
In this latest instance, https://firefoxci.taskcluster-artifacts.net/Wppwg3n3Td2-lhBn_PWh9w/0/public/logs/live_backing.log is the run with missing artifacts, and https://firefoxci.taskcluster-artifacts.net/Wppwg3n3Td2-lhBn_PWh9w/1/public/logs/live_backing.log is the run with existing artifacts. Nothing's jumping out at me, other than the last few lines of run 0.
Comment 22•4 years ago
|
||
Huh, so maybe something is deleting the container too early. Or maybe Docker's just being Docker and working 80% of the time. That suggests this might be harder to fix than it seems!
Comment 23•4 years ago
|
||
Description
•