Closed Bug 1712333 Opened 3 years ago Closed 3 years ago

Intermittent Decision task failure | HTTPError: 400 Client Error: Bad Request for url (due to dependency which expires before task deadline)

Categories

(Firefox Build System :: Task Configuration, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1690947

People

(Reporter: aryx, Unassigned)

References

Details

Attachments

(1 obsolete file)

Many of the gecko decision tasks of the Try repositories are broken (example) because this fetch-civet-source task expires in less than 24 hours which would be the deadline for the tasks in the recent Try pushes.

Try push artifacts expire after 4 weeks. Should a fetch-civet-source task be scheduled on mozilla-central (where artifacts even expire only after a year by default)?

For now, a manually requested fetch-civet-source should resolve the issue for the next 27 days.

Summary: there should also be a fetch-civet-source artifact which doesn't expire in the next 24 hours because that breaks decision tasks → there should always be a fetch-civet-source artifact which doesn't expire in the next 24 hours because that breaks decision tasks

Here's the relevant error:

[task 2021-05-21T14:02:28.072Z] http://taskcluster:80 "PUT /queue/v1/task/b9P7J3BBR2O25rzrLjr3fA HTTP/1.1" 400 6616
[task 2021-05-21T14:02:28.072Z] `task.dependencies` references tasks that expires
[task 2021-05-21T14:02:28.072Z] before `task.deadline` this is not allowed, see tasks: 
[task 2021-05-21T14:02:28.072Z]  * fIF6MHloTKyf9q5_zVpkwg,
[task 2021-05-21T14:02:28.072Z] All taskIds in `task.dependencies` **must** have
[task 2021-05-21T14:02:28.072Z] `task.expires` greater than the `deadline` for this task.
[task 2021-05-21T14:02:28.072Z] 
[task 2021-05-21T14:02:28.072Z] 
[task 2021-05-21T14:02:28.072Z] ---
[task 2021-05-21T14:02:28.072Z] 
[task 2021-05-21T14:02:28.072Z] * method:     createTask
[task 2021-05-21T14:02:28.072Z] * errorCode:  InputError
[task 2021-05-21T14:02:28.072Z] * statusCode: 400
[task 2021-05-21T14:02:28.072Z] * time:       2021-05-21T14:02:28.079Z

My theory is that this check started happening on the taskcluster side when we recently upgraded taskcluster. The index-search optimizer never bothers to check that the expiry of the dependency is greater than the deadline of the current task.

If my theory is correct, we'll see the same issue happening with other fetch and toolchain tasks as well (it's just that fetch-civet happened to be the first one to approach expiry since the taskcluster upgrade). So I think we'll have to prioritize this.

I think there are two ways to fix this:

  1. Turn off the check for Gecko. This isn't ideal since the check does prevent an intermittent failure case... though not one that has seemed to be very harmful this far in Gecko at least.
  2. Fix the IndexSearch optimizer to take the expiry of the dependency and deadline of the current task into account.

#2 is ideal, though it may take a bit of finagling to access the dependent's deadline from within the optimizer.

Priority: -- → P2
Summary: there should always be a fetch-civet-source artifact which doesn't expire in the next 24 hours because that breaks decision tasks → Intermittent Decision task failure | HTTPError: 400 Client Error: Bad Request for url (due to dependency which expires before task deadline)

Just to clarify why there isn't any issue with fetch-civet.

Prior to the taskcluster upgrade, this would have still worked (since the task hadn't expired yet). As soon as the task expired, then new decision tasks would have failed to find a replacement via the index-search optimizer, and we simply would have scheduled a new task to run.

I think that fetch-civet task likely should have a much longer expiry, but it's not the root cause of the issue. Just the thing that triggers it.

this would have still worked

Not quite -- if the dependent task runs after the fetch-civet task, then it would fail.

Correct, the IndexSearch optimizer in Gecko should have been comparing expiry / deadline all along, this error is pointing out the flaw that has always existed in our optimizer.

Aryx pointed out that we've hit this failure in the past (and "fixed" it by increasing the expiry). So it's not a regression from the upgrade after all, and likely just very rare (and a coincidence that it happened shortly after the upgrade).

We should increase the fetch-civet expiry either way here.. but might be also worth solving the root issue properly this time.

See Also: → 1617030
See Also: → 1618067

I got a bit nerd sniped here. The attached patch is untested and won't work yet, but should be pretty close to what we need. Posting it to phabricator now as there's a chance I'll let it slip.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE

Comment on attachment 9223027 [details]
WIP: Bug 1712333 - Take dependent deadlines into account when deciding whether to replace a task

Revision D115726 was moved to bug 1690947. Setting attachment 9223027 [details] to obsolete.

Attachment #9223027 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: