Closed Bug 1250458 Opened 4 years ago Closed 4 years ago

taskcluster upload should be able to cope with slow network

Categories

(Release Engineering :: Applications: MozharnessCore, defect, major)

defect
Not set
major

Tracking

(firefox48 fixed)

RESOLVED FIXED
Tracking Status
firefox48 --- fixed

People

(Reporter: nthomas, Assigned: nthomas)

References

Details

Attachments

(1 file, 1 obsolete file)

Revealed by the slow network in bug 1250374, if we take more than 20 minutes to upload to taskcluster we'll fail to reclaim the task, it'll expire, and everything will go pear shaped. See bug 1250374 comment #8.

https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/building/buildbase.py#1541
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/taskcluster_helper.py#12

mshal set this up originally but I think he's on a work-week this week.
See Also: → 1250374
Dropping severity since this is not actively blocking anything other than resiliency of our network
Severity: blocker → major
IMO the easiest thing to do is to call reclaimTask in between each file, which would mean the per-file limit is 20 minutes instead of a per-job limit of 20 minutes. It would be better still if there's an easy way to periodically call reclaimTask in a separate thread or something, but off-hand I don't know how hard that would be to do.
+1. It would help the most common failure modes without over-complicating this logic.
There's other things we could do here, but lets grab the low hanging fruit.
Assignee: nobody → nthomas
Status: NEW → ASSIGNED
Attachment #8722797 - Flags: review?(mshal)
Huh, I thought we could've needed to add a new reclaim_task method in taskcluster_helper.py to call http://docs.taskcluster.net/queue/api-docs/#reclaimTask

:jonasfj, does calling claimTask again effectively do the same thing here as reclaimTask as far as resetting the timer?
Flags: needinfo?(jopsen)
@mshal,
You are right, claimTask != reclaimTask, hmm, I see can't refer to the docs as I didn't write any...

> :jonasfj, does calling claimTask again effectively do the same thing here as reclaimTask as far as
> resetting the timer?
Calling claimTask(taskId, runId) on a task and run that is already running will return 409, conflict.

To post-pone the takenUntil timestamp call reclaimTask(taskId, runId)
Flags: needinfo?(jopsen)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #7)
> Calling claimTask(taskId, runId) on a task and run that is already running
> will return 409, conflict.

Hmm, that doesn't seem to jive with nthomas' try push - it looks like it is successful (or something is silently ignoring the error).

> 
> To post-pone the takenUntil timestamp call reclaimTask(taskId, runId)

So, I think we'll want a reclaim_task in taskcluster_helper that does something like:

        self.taskcluster_queue.reclaimTask(
            task['status']['taskId'],
            task['status']['runs'][-1]['runId'])

(untested)
Attachment #8722797 - Flags: review?(mshal) → feedback+
Duplicate of this bug: 1262911
Attachment #8722797 - Attachment is obsolete: true
Comment on attachment 8739729 [details]
MozReview Request: Bug 1250458 - Reclaim task before file uploads r=nthomas

The same approach works fine in create_reference_artifact() in the same helper.
Comment on attachment 8739729 [details]
MozReview Request: Bug 1250458 - Reclaim task before file uploads r=nthomas

https://reviewboard.mozilla.org/r/45353/#review41893

lgtm
Attachment #8739729 - Flags: review?(nthomas) → review+
Attachment #8739729 - Flags: checked-in+
https://hg.mozilla.org/mozilla-central/rev/b85e9878c32f
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.