Closed Bug 1251515 Opened 4 years ago Closed 4 years ago

artifactsTask is resolved regardless of parent's task result

Categories

(Release Engineering :: Release Automation: Other, defect)

defect
Not set

Tracking

(firefox47 fixed)

RESOLVED FIXED
Tracking Status
firefox47 --- fixed

People

(Reporter: rail, Assigned: rail)

References

Details

Attachments

(1 file)

Consider the following scenario:

1) l10n repack fails to repack some locales
https://tools.taskcluster.net/task-inspector/#qVvN5b_ARhyoWUwBnCgFQA/0 is an example
run 0 log: http://archive.mozilla.org/pub/firefox/tinderbox-builds/date-l10n/release-date_firefox_win64_l10n_repack-bm74-build1-build62.txt.gz

2) it uses a separate task to collect all artifacts
3) regardless of the result the artifacts task is marked as resolved
4) l10n repack task reruns
5) l10n repack rerun tries to use the same artifacts task run ID and fails:

http://archive.mozilla.org/pub/firefox/tinderbox-builds/date-l10n/release-date_firefox_win64_l10n_repack-bm91-build1-build46.txt.gz

17:50:28    FATAL - TaskclusterRestFailure: Run 0 was already claimed by another worker.
17:50:28    FATAL - ----
17:50:28    FATAL - errorCode:  RequestConflict
17:50:28    FATAL - statusCode: 409
17:50:28    FATAL - requestInfo:
17:50:28    FATAL -   method:   claimTask
17:50:28    FATAL -   params:   {"taskId":"zt9QjqefRtumeaYM9QzIiw","runId":"0"}
17:50:28    FATAL -   payload:  {
17:50:28    FATAL -   "workerGroup": "buildbot",
17:50:28    FATAL -   "workerId": "buildbot"
17:50:28    FATAL - }
17:50:28    FATAL -   time:     2016-02-26T01:50:29.592Z
17:50:28    FATAL - details:
17:50:28    FATAL - {
17:50:28    FATAL -   "runId": 0
17:50:28    FATAL - }
17:50:28    FATAL - Running post_f


runId 0 is a wrong id, because it's not rescheduled, because it's resolved in the first place.

Probably we should not resolve the artifacts task in case if some of the locales are busted.
Assignee: nobody → rail
Attached patch artifacts.diffSplinter Review
Attachment #8724551 - Flags: review?(jlund)
Comment on attachment 8724551 [details] [diff] [review]
artifacts.diff

Review of attachment 8724551 [details] [diff] [review]:
-----------------------------------------------------------------

great solution!

::: testing/mozharness/scripts/desktop_l10n.py
@@ +504,5 @@
>          self.set_buildbot_property(prop_key, prop_value, write_to_file=True)
>          BaseScript.add_failure(self, locale, message=message, **kwargs)
>  
> +    def query_failed_locales(self):
> +        return [l for l, res in self.locales_property.items() if

are we going to use this list anywhere? or should we simplify this to:

return FAILURE_STR in self.locales_property.values()
Attachment #8724551 - Flags: review?(jlund) → review+
https://hg.mozilla.org/mozilla-central/rev/4ca277948a42
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
backed out for causing problems on m-c like https://treeherder.mozilla.org/logviewer.html#?job_id=3384796&repo=mozilla-central

05:27 <&garndt> it's not liking that hyphen there
05:29 <&garndt> which looks to be a long dash (\u2013)
Flags: needinfo?(rail)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Pushed again https://hg.mozilla.org/integration/mozilla-inbound/rev/ad0a07c60528, with replaced long dash in the commit message
Flags: needinfo?(rail)
https://hg.mozilla.org/mozilla-central/rev/ad0a07c60528
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Comment on attachment 8724551 [details] [diff] [review]
artifacts.diff

I think there might be fallout or at least a follow patch required

job:
http://buildbot-master74.bb.releng.usw2.mozilla.com:8001/builders/release-date_firefox_win32_l10n_repack/builds/78/steps/run_script/logs/stdio

log snippet:
20:51:24     INFO - Resolving -iVt-s89T1qml_wXkap_oQ, run 2. Full task:
20:51:24     INFO - {u'status': {u'workerType': u'buildbot', u'taskGroupId': u'3IWQbJzETh-E9hsQjQDqQA', u'runs': [{u'scheduled': u'2016-03-02T00:05:09.298Z', u'reasonCreated': u'scheduled', u'takenUntil': u'2016-03-02T02:03:29.499Z', u'started': u'2016-03-02T01:43:29.594Z', u'workerId': u'buildbot', u'reasonResolved': u'claim-expired', u'workerGroup': u'buildbot', u'state': u'exception', u'runId': 0, u'resolved': u'2016-03-02T02:03:30.379Z'}, {u'scheduled': u'2016-03-02T02:03:30.379Z', u'reasonCreated': u'retry', u'takenUntil': u'2016-03-02T04:02:59.654Z', u'started': u'2016-03-02T03:43:00.020Z', u'workerId': u'buildbot', u'reasonResolved': u'claim-expired', u'workerGroup': u'buildbot', u'state': u'exception', u'runId': 1, u'resolved': u'2016-03-02T04:03:01.115Z'}, {u'scheduled': u'2016-03-02T04:03:01.115Z', u'reasonCreated': u'retry', u'state': u'pending', u'runId': 2}], u'expires': u'3016-03-02T00:04:13.634Z', u'retriesLeft': 3, u'state': u'pending', u'schedulerId': u'task-graph-scheduler', u'deadline': u'2016-03-06T00:04:13.596Z', u'taskId': u'-iVt-s89T1qml_wXkap_oQ', u'provisionerId': u'null-provisioner'}}
20:51:24     INFO - Starting new HTTPS connection (1): queue.taskcluster.net
c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\build\venv\Lib\site-packages\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
20:51:24    FATAL - Uncaught exception: Traceback (most recent call last):
20:51:24    FATAL -   File "c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\scripts\mozharness\base\script.py", line 1765, in run
20:51:24    FATAL -     self.run_action(action)
20:51:24    FATAL -   File "c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\scripts\mozharness\base\script.py", line 1707, in run_action
20:51:24    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
20:51:24    FATAL -   File "c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\scripts\mozharness\base\script.py", line 1647, in _possibly_run_method
20:51:24    FATAL -     return getattr(self, method_name)()
20:51:24    FATAL -   File "scripts/scripts/desktop_l10n.py", line 1046, in taskcluster_upload
20:51:24    FATAL -     artifacts_tc.report_completed(artifacts_task)
20:51:24    FATAL -   File "c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\scripts\mozharness\mozilla\taskcluster_helper.py", line 135, in report_completed
20:51:24    FATAL -     self.taskcluster_queue.reportCompleted(task_id, run_id)
20:51:24    FATAL -   File "c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\build\venv\Lib\site-packages\taskcluster\client.py", line 455, in apiCall
20:51:24    FATAL -     return self._makeApiCall(e, *args, **kwargs)
20:51:24    FATAL -   File "c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\build\venv\Lib\site-packages\taskcluster\client.py", line 232, in _makeApiCall
20:51:24    FATAL -     return self._makeHttpRequest(entry['method'], route, payload)
20:51:24    FATAL -   File "c:\builds\moz2_slave\rel-date_fx_w32_l10n_rpk-00000\build\venv\Lib\site-packages\taskcluster\client.py", line 424, in _makeHttpRequest
20:51:24    FATAL -     superExc=rerr
20:51:24    FATAL - TaskclusterRestFailure: Run 2 on task -iVt-s89T1qml_wXkap_oQ is resolved or not running.
20:51:24    FATAL - ----
20:51:24    FATAL - errorCode:  RequestConflict
20:51:24    FATAL - statusCode: 409
20:51:24    FATAL - requestInfo:
20:51:24    FATAL -   method:   reportCompleted
20:51:24    FATAL -   params:   {"taskId":"-iVt-s89T1qml_wXkap_oQ","runId":"2"}
20:51:24    FATAL -   payload:  {}
20:51:24    FATAL -   time:     2016-03-02T04:51:26.463Z
20:51:24    FATAL - details:
20:51:24    FATAL - {
20:51:24    FATAL -   "taskId": "-iVt-s89T1qml_wXkap_oQ",
20:51:24    FATAL -   "runId": 2
20:51:24    FATAL - }
20:51:24    FATAL - Running post_fatal callback...
20:51:24    FATAL - Exiting -1
does taskcluster_queue.reportFailed(taskid) allow for that taskid to be used again?
(In reply to Jordan Lund (:jlund) from comment #12)
> does taskcluster_queue.reportFailed(taskid) allow for that taskid to be used
> again?

Yes, TC reschedules another run. I checked this explicitly.

comment 11 is actually bug 1252725 and there is a patch ;)
You need to log in before you can comment on or make changes to this bug.