Bugzilla

Comment 1

•

6 years ago

Hey Wander, what's the best way to handle this? Over here[0] you see we fail to create and upload an expected artifact: [taskcluster 2018-12-06 02:35:27.495Z] === Task Finished === [taskcluster 2018-12-06 02:35:27.497Z] Artifact "releng/partner/mozilla-EME-free/mozilla-EME-free/ga-IE/target.dmg" not found at "/builds/worker/workspace/build/outputs/mozilla-EME-free/mozilla-EME-free/ga-IE/target.dmg" [taskcluster 2018-12-06 02:35:27.498Z] Artifact "releng/partner" not found at "/builds/worker/artifacts/" [taskcluster 2018-12-06 02:35:28.078Z] Successful task run with exit code: 0 completed in 52.526 seconds https://github.com/taskcluster/docker-worker/blob/ef974e09acbf6d36c641af0468893e9355f951cc/src/lib/features/artifacts.js#L53-L54 We would like the task to fail as it breaks downstream tasks in a release graph. [0] https://taskcluster-artifacts.net/WxeF9Z_eTYyN5yE8VSL3nQ/0/public/logs/live_backing.log

Flags: needinfo?(wcosta)

Wander Lairson Costa

Comment 2

•

6 years ago

(In reply to Jordan Lund (:jlund) from comment #1) > Hey Wander, what's the best way to handle this? > > Over here[0] you see we fail to create and upload an expected artifact: > > [taskcluster 2018-12-06 02:35:27.495Z] === Task Finished === > [taskcluster 2018-12-06 02:35:27.497Z] Artifact > "releng/partner/mozilla-EME-free/mozilla-EME-free/ga-IE/target.dmg" not > found at > "/builds/worker/workspace/build/outputs/mozilla-EME-free/mozilla-EME-free/ga- > IE/target.dmg" > [taskcluster 2018-12-06 02:35:27.498Z] Artifact "releng/partner" not found > at "/builds/worker/artifacts/" > [taskcluster 2018-12-06 02:35:28.078Z] Successful task run with exit code: 0 > completed in 52.526 seconds > > > https://github.com/taskcluster/docker-worker/blob/ > ef974e09acbf6d36c641af0468893e9355f951cc/src/lib/features/artifacts.js#L53- > L54 > > We would like the task to fail as it breaks downstream tasks in a release > graph. > > > [0] > https://taskcluster-artifacts.net/WxeF9Z_eTYyN5yE8VSL3nQ/0/public/logs/ > live_backing.log Sorry for the delay. We are aware of this problem, and it has never been fixed because I was told there are a bunch of tasks that would be broken because they specify invalid artifacts.

Flags: needinfo?(wcosta)

Comment 3

•

6 years ago

Hit this again:

RyanVM> rerunning FHsXDzyTSx2JDdx5ZAlR_Q
11:19:42
<•jlund> Jordan Lund 2019-01-08T19:15:19 ERROR - Hit ScriptWorkerException: 'path public/build/uk/target.complete.mar not in signing:repackage aUEmuNSLR6eBUNlufRH_4A chain of trust artifacts!'
https://irccloud.mozilla.com/pastebin/VlkYMwrY/
Plain Text • 8 lines raw | line numbers
11:26:50 https://taskcluster-artifacts.net/aUEmuNSLR6eBUNlufRH_4A/0/public/logs/live_backing.log
11:27:06 I have a feeling that rerun won't work based on earlier task
11:28:32
<RyanVM> jlund: yeah
11:29:05 jlund: so, rerun aUEmuNSLR6eBUNlufRH_4A, then rerun FHsXDzyTSx2JDdx5ZAlR_Q ?
11:31:03
<•jlund> Jordan Lund I think so. I think this is: https://bugzilla.mozilla.org/show_bug.cgi?id=1512733
11:31:17
<RyanVM> ok, aUEmuNSLR6eBUNlufRH_4A rerun
11:31:25 will attempt the latter after that one finishes

https://taskcluster-artifacts.net/W68_TQjBTWGdkeN9eNOvUQ/0/public/logs/live_backing.log

Comment 4

•

6 years ago

Hit this again,

second release in a row. breaks many downstream tasks after they start.

Comment 5

•

6 years ago

I think generic-worker behaves differently than this. Until that switch takes place, perhaps the task itself could check that the artifact exists and, if not, exit with a nonzero status?

Nobody; OK to take it and work on it

Updated

•

6 years ago

Summary: Docker Worker reports green even when an artifact of type "file" is not found. → Docker Worker tasks can be successful even when the task's artifacts fail to upload

Assignee

Updated

•

6 years ago

Component: Worker → Workers

Comment 8

•

6 years ago

hit again today:

[task 2019-03-07T17:10:05.161Z] 17:10:05 INFO - Return code: 0
[task 2019-03-07T17:10:05.161Z] 17:10:05 INFO - [mozharness: 2019-03-07 17:10:05.161883Z] Finished repackage step (success)
[task 2019-03-07T17:10:05.168Z] cleanup
[task 2019-03-07T17:10:05.168Z] + cleanup
[task 2019-03-07T17:10:05.168Z] + local rv=0
[task 2019-03-07T17:10:05.168Z] + cleanup_xvfb
[task 2019-03-07T17:10:05.168Z] pidof Xvfb
[task 2019-03-07T17:10:05.168Z] ++ pidof Xvfb
[task 2019-03-07T17:10:05.169Z] + local xvfb_pid=
[task 2019-03-07T17:10:05.169Z] + local vnc=false
[task 2019-03-07T17:10:05.169Z] + local interactive=false
[task 2019-03-07T17:10:05.169Z] + '[' -n '' ']'
[task 2019-03-07T17:10:05.169Z] + exit 0
[fetches 2019-03-07T17:10:05.169Z] removing /builds/worker/fetches
[fetches 2019-03-07T17:10:05.178Z] finished
[taskcluster 2019-03-07 17:10:05.481Z] === Task Finished ===
[taskcluster 2019-03-07 17:10:05.484Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-03-07 17:10:05.485Z] Artifact "public/build/hsb/target.dmg" not found at "/builds/worker/workspace/build/outputs/hsb/target.dmg"
[taskcluster 2019-03-07 17:10:05.486Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-03-07 17:10:05.487Z] Artifact "public/build/hsb/target.complete.mar" not found at "/builds/worker/workspace/build/outputs/hsb/target.complete.mar"
[taskcluster 2019-03-07 17:10:06.174Z] Successful task run with exit code: 0 completed in 79.956 seconds

jlund> wonder if we have hit Bug 1512733 - Docker Worker tasks can be successful even when the task's artifacts fail to upload
10:07:36 https://taskcluster-web.netlify.com/tasks/TzwwZ_srRj-4W6_KhGgwtw/runs/0/logs/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FTzwwZ_srRj-4W6_KhGgwtw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Flive.log
10:08:34 dep task that had a good cmd run (green) but artifact upload errors
10:11:12 downstream failing: https://taskcluster-web.netlify.com/tasks/EVZKYYxQRLCFFd4-aET0pw/runs/0/logs/https%3A%2F%2Fqueue.taskcluster.net%2Fv1%2Ftask%2FEVZKYYxQRLCFFd4-aET0pw%2Fruns%2F0%2Fartifacts%2Fpublic%2Flogs%2Fchain_of_trust.log
10:11:32 rerunning dep task that was green

Comment 9

•

6 years ago

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #5)

I think generic-worker behaves differently than this. Until that switch takes place, perhaps the task itself could check that the artifact exists and, if not, exit with a nonzero status?

Possible. Is there a way to determine which artifacts are defined, in-task, or would we need to download the task definition for $TASKID ?

Comment 10

•

6 years ago

I was thinking of addressing the specific cases where this occurs -- but if it occurs as a result of a programming error, then maybe that's not practical (or at least, the eventual 404 is error-message enough). Are these failures intermittent or somehow otherwise not due to a task that promises artifacts it doesn't deliver?

https://taskcluster-web.netlify.com/tasks/WxeF9Z_eTYyN5yE8VSL3nQ
https://taskcluster-web.netlify.com/tasks/aUEmuNSLR6eBUNlufRH_4A
https://taskcluster-web.netlify.com/tasks/TzwwZ_srRj-4W6_KhGgwtw/

Comment 11

•

6 years ago

the three release times I've seen this this has been:

docker-worker, gecko-3-b-linux, fulfilling a macosx related l10n task, runs taskcluster/scripts/builder/repackage.sh

so there are correlations between the tasks..

Hit again on this push: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception&searchStr=linux%2Cx64%2Cshippable%2Copt%2Cgd&revision=57763214c9e85c28922c3eb5471357b698109f5d&selectedJob=250274480

Comment 12

•

6 years ago

Yeah, it looks like all 3 are running mozharness/scripts/repackage.py, which simplifies things.
If we can add a check to that script to verify the artifacts are in the expected location(s), we may be good from the release perspective.

Cosmin Sabou [:CosminS]

Comment 13

•

5 years ago

mar signing exception: https://tools.taskcluster.net/groups/B80yniJbQLqMVrKAd9t5pw/tasks/e7daH3XOS3Cxu-ydjNCvyw/runs/2/logs/public%2Flogs%2Fchain_of_trust.log#L118

first run on Rpk successfull: https://tools.taskcluster.net/groups/B80yniJbQLqMVrKAd9t5pw/tasks/ZvKbFn4aQhiVFN9LUAWwfg/runs/0/logs/public%2Flogs%2Flive_backing.log#L440
with
[task 2019-06-06T00:10:14.762Z] 00:10:14 INFO - [mozharness: 2019-06-06 00:10:14.762091Z] Finished repackage step (success)
[task 2019-06-06T00:10:14.784Z] cleanup
[task 2019-06-06T00:10:14.784Z] + cleanup
[task 2019-06-06T00:10:14.784Z] + local rv=0
[task 2019-06-06T00:10:14.784Z] + cleanup_xvfb
[task 2019-06-06T00:10:14.784Z] pidof Xvfb
[task 2019-06-06T00:10:14.784Z] ++ pidof Xvfb
[task 2019-06-06T00:10:14.785Z] + local xvfb_pid=
[task 2019-06-06T00:10:14.785Z] + local vnc=false
[task 2019-06-06T00:10:14.785Z] + local interactive=false
[task 2019-06-06T00:10:14.785Z] + '[' -n '' ']'
[task 2019-06-06T00:10:14.785Z] + exit 0
[fetches 2019-06-06T00:10:14.786Z] removing /builds/worker/fetches
[fetches 2019-06-06T00:10:14.795Z] finished
[taskcluster 2019-06-06 00:10:15.098Z] === Task Finished ===
[taskcluster 2019-06-06 00:10:15.100Z] Artifact "public/logs" not found at "/builds/worker/logs/"
[taskcluster 2019-06-06 00:10:15.102Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2019-06-06 00:10:15.103Z] Artifact "public/build/gd/target.complete.mar" not found at "/builds/worker/workspace/build/outputs/gd/target.complete.mar"
[taskcluster 2019-06-06 00:10:15.842Z] Successful task run with exit code: 0 completed in 103.976 seconds

Rerunning the repack job did the trick and the other green jobs followed.

Comment hidden (Intermittent Failures Robot)

Updated

•

5 years ago

Updated

•

5 years ago

Blocks: 1585578

Comment 15

•

5 years ago

happened again: https://firefoxci.taskcluster-artifacts.net/VD8zBY0vSGeOeLTb7wGV6g/0/public/logs/live_backing.log

above task should not have been green. It broke the following task downstream: https://firefoxci.taskcluster-artifacts.net/VD8zBY0vSGeOeLTb7wGV6g/0/public/logs/live_backing.log

trick was to rerun --force VD8zBY0vSGeOeLTb7wGV6g and then rerun VD8zBY0vSGeOeLTb7wGV6g

Comment 16

•

5 years ago

•

Edited

IMO:

We need to determine how long docker-worker will exist.

a. if we replace it with generic-worker sooner, awesome, that will fix the issue.
b. if docker-worker is longer-lived, then
we could craft a docker-worker release with this issue fixed, and create new AMIs. We can then roll out those AMIs to test workerTypes, and see what tasks break. Once we have a list, we should fix those, and roll out completely.

[:owlish] 🦉 PST

Updated

•

5 years ago

No longer blocks: 1585578

Updated

•

5 years ago

Blocks: 1578289

Comment 17

•

4 years ago

Still biting us. I'm leaning towards either:

logging something visible whenever this happens; we can parse logs through stackdriver or papertrail or whatever to figure out if fixing this will break more things than we expect, or
just fixing it, and dealing with fallout. We may want to let this ride the trains, which may mean a new set of pools.

Comment 18

•

4 years ago

Just fixing, and rolling out to Try to see if anything unexpected breaks in a full push and/or staging release.

Comment hidden (Intermittent Failures Robot)

Comment 20

•

4 years ago

Another option may be to add a config -- either worker config or in the task -- to control this behavior. It would be good to be consistent across worker implementations.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #10)

I was thinking of addressing the specific cases where this occurs -- but if it occurs as a result of a programming error, then maybe that's not practical (or at least, the eventual 404 is error-message enough). Are these failures intermittent or somehow otherwise not due to a task that promises artifacts it doesn't deliver?

I never got an answer to this -- why are tasks executing and returning a zero exit status without creating the files they should create?

Summary: Docker Worker tasks can be successful even when the task's artifacts fail to upload → Docker Worker tasks can be successful even when the task's artifacts do not exist on-disk

Comment 21

•

4 years ago

In this latest instance, https://firefoxci.taskcluster-artifacts.net/Wppwg3n3Td2-lhBn_PWh9w/0/public/logs/live_backing.log is the run with missing artifacts, and https://firefoxci.taskcluster-artifacts.net/Wppwg3n3Td2-lhBn_PWh9w/1/public/logs/live_backing.log is the run with existing artifacts. Nothing's jumping out at me, other than the last few lines of run 0.

Comment 22

•

4 years ago

Huh, so maybe something is deleting the container too early. Or maybe Docker's just being Docker and working 80% of the time. That suggests this might be harder to fix than it seems!