Closed Bug 1360198 Opened 2 years ago Closed Last year

Don't explicitly set artifact expiry for generic-worker tasks, if it is just task expiry

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1486532

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(1 file)

The generic-worker supports an optional `expiry` field for artifacts that allows the task author to expire a task artifact prior to the task expiry.

The benefit of this feature is that expensive artifacts that might not be required for so long can be expired, although the task definition can live around for longer to know what the task was, and what it did.

This comes at some cost however: when visiting a task which has expired artifacts, it might not be clear initially that there were artifacts there, and they expired - instead it can be difficult to deduce whether the artifact was there, and expired, or was never there. Also maybe the artifact was not there, and was uploaded as an error artifact, and then expired, or maybe it was an S3 artifact, and expired. In general it can be confusing to have artifacts expiring, so this feature should be used with extreme caution.

Furthermore, when recreating a task from a resolved task, via the Task Inspector link(s) that take you to the Task Creator, the artifact expiry timestamps will not be considered when the Task Creator refreshes timestamps. This leads to recreated tasks often having artifacts expire before the task expires, even if in the original task, they expired together with the task.

In order to avoid this pitfall (and also users believing that artifact expiry is a mandatory field) I propose we remove the artifact expiry timestamps from the task payload for generic-worker tasks, in every case where it matches the task expiry, and potentially other cases too when it might be confusing that the artifact expires before the task.

This will lead to a more compact task definition, less manual errors when users interact with the Task Creator, and fewer matters to concern users in general.

I understand in particular :grenade encountered several frustrating delays due to encountering these types of issues, so I think this is already affecting real-world users.
Note, I left the line in the code, commented out - to be clear so that people don't blindly add it back in again. Some people might prefer not to have commented out code checked in, however I think in this case, it is beneficial because it draws the developers attention to the comment above it, when they go to write a similar line and notice it is already there, but commented out.

Not everyone might share my view though - but I wanted to justify the use of it, as typically I also prefer not to leave commented-out code checked in.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8862418 - Flags: review?(dustin)
Local test was successful:


  "test-windows10-64-asan/opt-cppunit": {
    "attributes": {
      "build_platform": "win64-asan",
      "build_type": "opt",
      "e10s": false,
      "kind": "test",
      "run_on_projects": [],
      "test_chunk": "1",
      "test_platform": "windows10-64-asan/opt",
      "unittest_flavor": "cppunittest",
      "unittest_suite": "cppunittest",
      "unittest_try_name": "cppunit"
    },
    "dependencies": {},
    "kind": "test",
    "label": "test-windows10-64-asan/opt-cppunit",
    "optimizations": [
      [
        "seta"
      ]
    ],
    "task": {
      "created": {
        "relative-datestamp": "0 seconds"
      },
      "deadline": {
        "relative-datestamp": "1 day"
      },
      "expires": {
        "relative-datestamp": "14 days"
      },
      "extra": {
        "chunks": {
          "current": 1,
          "total": 1
        },
        "suite": {
          "flavor": "cppunittest",
          "name": "cppunittest"
        },
        "treeherder": {
          "collection": {
            "asan": true
          },
          "groupName": "Executed by TaskCluster",
          "groupSymbol": "tc",
          "jobKind": "test",
          "machine": {
            "platform": "windows10-64"
          },
          "symbol": "Cpp",
          "tier": 2
        },
        "treeherderEnv": [
          "production",
          "staging"
        ]
      },
      "metadata": {
        "description": "CPP Unit Tests ([Treeherder push](https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cfdc494e3f06c02db70baf6b1c3bb309418ca7c))",
        "name": "test-windows10-64-asan/opt-cppunit",
        "owner": "pmoore@mozilla.com",
        "source": "https://hg.mozilla.org/try//file/2cfdc494e3f06c02db70baf6b1c3bb309418ca7c/taskcluster/ci/test"
      },
      "payload": {
        "artifacts": [
          {
            "name": "public/logs/localconfig.json",
            "path": "logs/localconfig.json",
            "type": "file"
          },
          {
            "name": "public/logs/log_critical.log",
            "path": "logs/log_critical.log",
            "type": "file"
          },
          {
            "name": "public/logs/log_error.log",
            "path": "logs/log_error.log",
            "type": "file"
          },
          {
            "name": "public/logs/log_fatal.log",
            "path": "logs/log_fatal.log",
            "type": "file"
          },
          {
            "name": "public/logs/log_info.log",
            "path": "logs/log_info.log",
            "type": "file"
          },
          {
            "name": "public/logs/log_raw.log",
            "path": "logs/log_raw.log",
            "type": "file"
          },
          {
            "name": "public/logs/log_warning.log",
            "path": "logs/log_warning.log",
            "type": "file"
          },
          {
            "name": "public/test_info",
            "path": "build/blobber_upload_dir",
            "type": "directory"
          }
        ],
        "command": [
          {
            "task-reference": "c:\\mozilla-build\\python\\python.exe -u mozharness\\scripts\\desktop_unittest.py --cfg mozharness\\configs\\unittests\\win_taskcluster_unittest.py --cppunittest-suite=cppunittest --no-read-buildbot-config --installer-url https://queue.taskcluster.net/v1/task/<build>/artifacts/public/build/firefox-55.0a1.en-US.win64-asan.zip --test-packages-url https://queue.taskcluster.net/v1/task/<build>/artifacts/public/build/firefox-55.0a1.en-US.win64-asan.test_packages.json --download-symbols ondemand"
          }
        ],
        "env": {},
        "maxRunTime": 3600,
        "mounts": [
          {
            "content": {
              "artifact": "public/build/mozharness.zip",
              "taskId": {
                "task-reference": "<build>"
              }
            },
            "directory": ".",
            "format": "zip"
          }
        ],
        "osGroups": []
      },
      "provisionerId": "aws-provisioner-v1",
      "routes": [
        "tc-treeherder.v2.try.2cfdc494e3f06c02db70baf6b1c3bb309418ca7c.185644",
        "tc-treeherder-stage.v2.try.2cfdc494e3f06c02db70baf6b1c3bb309418ca7c.185644"
      ],
      "scopes": [],
      "tags": {
        "createdForUser": "pmoore@mozilla.com"
      },
      "workerType": "gecko-t-win10-64-gpu-c"
    }
  },


As you see from this example, expiry is now not included as an artifact field.

Tested with:

./mach taskgraph tasks -J -p ~/Downloads/parameters.yml
Comment on attachment 8862418 [details] [diff] [review]
bug1360198_gecko_v1.patch

Review of attachment 8862418 [details] [diff] [review]:
-----------------------------------------------------------------

Bug 1359468 would fix the issue with the task creator.  And the issue of artifacts vanishing could, I think, be addressed with a queue change (to keep the Artifacts table row around and just delete the storage, then show that with strikethrough in the UI).

In general I find arguments of the form "a user could make the following error ___ so let's take the tool away" fairly unconvincing, unless there is absolutely no correct use for the functionality.  I think there are legitimate cases for allowing artifacts to expire before a task does, and those cases can be made more usable without removing the functionality.

That said, until I looked at the patch I thought you were advocating removing support for artifact.expires from the generic-worker implementation.  I am perfectly happy with removing the value from gecko task definitions.

Please do update the comment to remove the note about task-creator updating the timestamps, though, as we will likely fix that another way.  How about:

 # Use the default artifact expiration (the task's expiration time)
Attachment #8862418 - Flags: review?(dustin) → review+
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9224f83eb55c
don't set artifact expiry in generic-worker tasks if equal to task expiry,r=dustin
sorry had to back this out since this caused problems like [taskcluster 2017-05-02T09:01:32.665Z] TASK FAIL since the task payload is invalid. See errors:
[taskcluster 2017-05-02T09:01:32.665Z] - expires: expires is required
[taskcluster 2017-05-02T09:01:32.665Z] Task not successful due to following exception(s):
[taskcluster 2017-05-02T09:01:32.665Z] Exception 1)
[taskcluster 2017-05-02T09:01:32.665Z] Validation of payload failed for task BUQC0barRTSgVcG25J97dg
[taskcluster 2017-05-02T09:01:32.665Z]
Flags: needinfo?(pmoore)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ff6260097f15
Backed out changeset 9224f83eb55c for breaking taskcluster builds
(In reply to Carsten Book [:Tomcat] from comment #5)
> sorry had to back this out since this caused problems like [taskcluster
> 2017-05-02T09:01:32.665Z] TASK FAIL since the task payload is invalid. See
> errors:
> [taskcluster 2017-05-02T09:01:32.665Z] - expires: expires is required
> [taskcluster 2017-05-02T09:01:32.665Z] Task not successful due to following
> exception(s):
> [taskcluster 2017-05-02T09:01:32.665Z] Exception 1)
> [taskcluster 2017-05-02T09:01:32.665Z] Validation of payload failed for task
> BUQC0barRTSgVcG25J97dg
> [taskcluster 2017-05-02T09:01:32.665Z]

Whooooooops my fault! Sorry!!!

I was sure generic worker used task expiry as default value for task expiry, but it looks like I didn't implement that feature.

I've added to generic worker now in [1].

It may be a while before this is rolled out globally to all windows and OS X workers used in gecko pushes, but I'll track progress in this bug. Once it is rolled out everywhere it needs to be, we can reland the patch.

In this patch, I also added a test[2] which demonstrates the new feature works as expected.

----

1) https://github.com/taskcluster/generic-worker/commit/0082386fe918c747bab84ac3a7b361d3021e6d3b
2) https://github.com/taskcluster/generic-worker/blob/0082386fe918c747bab84ac3a7b361d3021e6d3b/artifacts_test.go#L287-L313
Flags: needinfo?(pmoore)
Oh dear oh dear oh dear:


[taskcluster 2017-06-16T16:39:05.260Z] Worker Type (gecko-1-b-win2012) settings:
[taskcluster 2017-06-16T16:39:05.260Z]   {
[taskcluster 2017-06-16T16:39:05.260Z]     "aws": {
[taskcluster 2017-06-16T16:39:05.260Z]       "ami-id": "ami-ebad8cfd",
[taskcluster 2017-06-16T16:39:05.260Z]       "availability-zone": "us-east-1a",
[taskcluster 2017-06-16T16:39:05.260Z]       "instance-id": "i-0fd982070ba43cb88",
[taskcluster 2017-06-16T16:39:05.260Z]       "instance-type": "c4.4xlarge",
[taskcluster 2017-06-16T16:39:05.260Z]       "local-ipv4": "172.31.0.225",
[taskcluster 2017-06-16T16:39:05.260Z]       "public-hostname": "ec2-52-91-241-19.compute-1.amazonaws.com",
[taskcluster 2017-06-16T16:39:05.260Z]       "public-ipv4": "52.91.241.19"
[taskcluster 2017-06-16T16:39:05.260Z]     },
[taskcluster 2017-06-16T16:39:05.260Z]     "config": {
[taskcluster 2017-06-16T16:39:05.260Z]       "deploymentId": "38e2c816634d",
[taskcluster 2017-06-16T16:39:05.260Z]       "runTasksAsCurrentUser": false
[taskcluster 2017-06-16T16:39:05.261Z]     },
[taskcluster 2017-06-16T16:39:05.261Z]     "generic-worker": {
[taskcluster 2017-06-16T16:39:05.261Z]       "go-arch": "amd64",
[taskcluster 2017-06-16T16:39:05.261Z]       "go-os": "windows",
[taskcluster 2017-06-16T16:39:05.261Z]       "go-version": "go1.7.5",
[taskcluster 2017-06-16T16:39:05.261Z]       "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.0.5",
[taskcluster 2017-06-16T16:39:05.261Z]       "version": "10.0.5"
[taskcluster 2017-06-16T16:39:05.261Z]     },
[taskcluster 2017-06-16T16:39:05.261Z]     "machine-setup": {
[taskcluster 2017-06-16T16:39:05.261Z]       "ami-created": "2017-06-15 21:59:56.395Z",
[taskcluster 2017-06-16T16:39:05.261Z]       "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/38e2c816634d8540881f2a2d1ba96d47b0520045/userdata/Manifest/gecko-1-b-win2012.json"
[taskcluster 2017-06-16T16:39:05.261Z]     }
[taskcluster 2017-06-16T16:39:05.261Z]   }
[taskcluster 2017-06-16T16:39:05.261Z] Task ID: DhDgSKVYR2iWR4SsA0TOjA
[taskcluster 2017-06-16T16:39:05.261Z] === Task Starting ===
[taskcluster 2017-06-16T16:39:05.271Z] Task not successful due to following exception(s):
[taskcluster 2017-06-16T16:39:05.271Z] Exception 1)
[taskcluster 2017-06-16T16:39:05.271Z] Malformed payload: artifact 'a.txt' expires before task deadline (0001-01-01T00:00:00.000Z is before 2017-06-17T16:14:09.929Z)
[taskcluster 2017-06-16T16:39:05.271Z] 



I was expecting just to close this bug - but looks like there was a problem which bizarrely wasn't picked up by the unit tests, so I'll:

* make sure the unit tests catch the bug
* fix the bug
* make sure the unit tests are fixed

Strange - I was sure I had this tested and fixed. Oh well.

Late for me now, so will pick this up next week.
Commit pushed to master at https://github.com/taskcluster/generic-worker

https://github.com/taskcluster/generic-worker/commit/4e5a41e9848038f024c095ea07d46642da8c8a97
Bug 1360198 - do not compare artifact expiry to task deadline and task expiry if artifact expiry is not set
(In reply to Pete Moore [:pmoore][:pete] from comment #13)
> Oh dear oh dear oh dear:
> 
> 
> [taskcluster 2017-06-16T16:39:05.260Z] Worker Type (gecko-1-b-win2012)
> settings:
> [taskcluster 2017-06-16T16:39:05.260Z]   {
> [taskcluster 2017-06-16T16:39:05.260Z]     "aws": {
> [taskcluster 2017-06-16T16:39:05.260Z]       "ami-id": "ami-ebad8cfd",
> [taskcluster 2017-06-16T16:39:05.260Z]       "availability-zone":
> "us-east-1a",
> [taskcluster 2017-06-16T16:39:05.260Z]       "instance-id":
> "i-0fd982070ba43cb88",
> [taskcluster 2017-06-16T16:39:05.260Z]       "instance-type": "c4.4xlarge",
> [taskcluster 2017-06-16T16:39:05.260Z]       "local-ipv4": "172.31.0.225",
> [taskcluster 2017-06-16T16:39:05.260Z]       "public-hostname":
> "ec2-52-91-241-19.compute-1.amazonaws.com",
> [taskcluster 2017-06-16T16:39:05.260Z]       "public-ipv4": "52.91.241.19"
> [taskcluster 2017-06-16T16:39:05.260Z]     },
> [taskcluster 2017-06-16T16:39:05.260Z]     "config": {
> [taskcluster 2017-06-16T16:39:05.260Z]       "deploymentId": "38e2c816634d",
> [taskcluster 2017-06-16T16:39:05.260Z]       "runTasksAsCurrentUser": false
> [taskcluster 2017-06-16T16:39:05.261Z]     },
> [taskcluster 2017-06-16T16:39:05.261Z]     "generic-worker": {
> [taskcluster 2017-06-16T16:39:05.261Z]       "go-arch": "amd64",
> [taskcluster 2017-06-16T16:39:05.261Z]       "go-os": "windows",
> [taskcluster 2017-06-16T16:39:05.261Z]       "go-version": "go1.7.5",
> [taskcluster 2017-06-16T16:39:05.261Z]       "release":
> "https://github.com/taskcluster/generic-worker/releases/tag/v10.0.5",
> [taskcluster 2017-06-16T16:39:05.261Z]       "version": "10.0.5"
> [taskcluster 2017-06-16T16:39:05.261Z]     },
> [taskcluster 2017-06-16T16:39:05.261Z]     "machine-setup": {
> [taskcluster 2017-06-16T16:39:05.261Z]       "ami-created": "2017-06-15
> 21:59:56.395Z",
> [taskcluster 2017-06-16T16:39:05.261Z]       "manifest":
> "https://github.com/mozilla-releng/OpenCloudConfig/blob/
> 38e2c816634d8540881f2a2d1ba96d47b0520045/userdata/Manifest/gecko-1-b-win2012.
> json"
> [taskcluster 2017-06-16T16:39:05.261Z]     }
> [taskcluster 2017-06-16T16:39:05.261Z]   }
> [taskcluster 2017-06-16T16:39:05.261Z] Task ID: DhDgSKVYR2iWR4SsA0TOjA
> [taskcluster 2017-06-16T16:39:05.261Z] === Task Starting ===
> [taskcluster 2017-06-16T16:39:05.271Z] Task not successful due to following
> exception(s):
> [taskcluster 2017-06-16T16:39:05.271Z] Exception 1)
> [taskcluster 2017-06-16T16:39:05.271Z] Malformed payload: artifact 'a.txt'
> expires before task deadline (0001-01-01T00:00:00.000Z is before
> 2017-06-17T16:14:09.929Z)
> [taskcluster 2017-06-16T16:39:05.271Z] 
> 
> 
> 
> I was expecting just to close this bug - but looks like there was a problem
> which bizarrely wasn't picked up by the unit tests, so I'll:
> 
> * make sure the unit tests catch the bug
> * fix the bug
> * make sure the unit tests are fixed
> 
> Strange - I was sure I had this tested and fixed. Oh well.
> 
> Late for me now, so will pick this up next week.

This issue should now be fixed in generic-worker 10.1.4. I'm currently rolling out generic-worker 10.1.4 to gecko-1-b-win2012-beta via OpenCloudConfig to test it.

Once that is rolled out globally across all worker types, we can look at relanding the gecko patch which removes the artifact expiries in the task definitions generated by the gecko decision task for generic-worker tasks.
(In reply to [github robot] from comment #16)
> Commit pushed to master at https://github.com/taskcluster/generic-worker
> 
> https://github.com/taskcluster/generic-worker/commit/
> df1564663e1adfe3c252fc84e544615e22b098e2
> Bug 1360198 - bug fix - directory artifacts also do not require explicit
> expiry

Gah, bugs!

Created 10.1.5 release to include fix.... Will test that too.
Is this fixed now, then?
Flags: needinfo?(pmoore)
I'll check! I'm pretty sure it landed in generic-worker and is working, but I don't remember if the gecko patch landed or not.
That said, we still have generic-worker 8 out there on windows testers, so I'd better make that a dependency...
Depends on: 1399401
Flags: needinfo?(pmoore)
With bug 1399401 deployed now, I've made a new try push to see if this works ok now:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=df273a7f666f936150846d58348281418726753f
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0222a82d1e90
don't set expiry for generic-worker task artifacts,r=dustin
Pushed by pmoore@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/186241aaea35
improve code comments per review feedback,r=dustin
Depends on: 1443589
Thanks Dorel. I'll try again once bug 1443589 has landed.
Flags: needinfo?(pmoore)
This is one of the optional properties that gets removed in bug 1486532, so marking this as a duplicate.
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → DUPLICATE
Duplicate of bug: 1486532
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.