Closed Bug 1278406 Opened 3 years ago Closed 3 years ago

Gecko decision tasks take 8 minutes making HTTP requests to queue

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

(Blocks 1 open bug)

Details

(Whiteboard: [optimization])

Attachments

(1 file)

From https://public-artifacts.taskcluster.net/D7BjgysjRnuNVG3rLhsGZw/0/public/logs/live_backing.log:

[34m 0:31.94(B[m Starting new HTTP connection (1): taskcluster
[34m 0:33.16(B[m "PUT /queue/v1/task/eYYlnt-bRWOfh0qsEhK7ZA HTTP/1.1" 200 524
[34m 0:33.16(B[m Creating task with taskId fIt_6K9qT9u7RkH4U6lKnQ for TaskLabel==AIED_qR-RPqZ3C47ttkTiQ
...
[34m 8:38.19(B[m Creating task with taskId drTF3b2bQYyNv-onxXtJQg for TaskLabel==eynGHW5vQqWir0PPmrem-w
[34m 8:39.18(B[m "PUT /queue/v1/task/drTF3b2bQYyNv-onxXtJQg HTTP/1.1" 200 378
[taskcluster 2016-06-06 21:40:04.807Z] === Task Finished ===

8 minutes making HTTP requests is a rather long time. We can do better.

dustin: can you brain dump what you said in IRC? I didn't really follow.
Flags: needinfo?(dustin)
This would be better if the requests were made concurrently (in a limited fashion -- 1000's of concurrent requests will cause timeouts, I bet).  You mentioned requests has built-in support for such a thing.
Component: Queue → Task Configuration
Flags: needinfo?(dustin)
Concurrent requests will only fix part of the problem.

Why do we need to make dozens of HTTP requests at all? Wouldn't it be more efficient for the queue to have a "bulk submit" API? That feels like fewer moving parts and increases the probability of an all-or-nothing end state. I like atomic operations.
But if we must do concurrent requests, https://github.com/ross/requests-futures would likely work.

The requests library is thread safe. But I think you need to provide your own thread/process pool. Probably better off using something pre-built.
The queue is making HTTP requests to Azure on the backend anyway, so a bulk API wouldn't save anything.

The task-graph-scheduler, which we had been using up until a week or so ago, made parallel requests:

https://github.com/taskcluster/task-graph-scheduler/blob/e4874e8170d55d0e506cb63dd2eeb9f744342574/scheduler/helpers.js#L180
  return Promise.all(input.tasks.map(function(taskNode) {
    return options.queue.defineTask(
      taskNode.taskId,
      taskNode.task

with no limit on the concurrency (but using the TC client library, which will retry 500's, thus handling timeouts; the decision tasks are using the Go version of that library).

I mentioned in irc, this timing used to be invisible to the user: the docker-worker would call scheduler.extendTaskGraph *after* the task had completed, so the time taken making that call was not counted in the time for the decision task.

So yes, whatever is easiest to use in-tree (that is, requiring the least vendoring) would be best here.
https://github.com/ross/requests-futures/blob/master/requests_futures/sessions.py is a very minimal wrapper around a basic futures executor. We already have futures vendored. This should be trivial to make concurrent...
Assignee: nobody → gps
Status: NEW → ASSIGNED
Currently, Gecko decision tasks spend ~8 minutes making HTTP requests to the
queue API. Let's throw a thread pool at it so tasks are submitted faster.

I'd prefer the queue offer a batch submit API. But it appears its operators
are not willing to offer such an API at this time.

Review commit: https://reviewboard.mozilla.org/r/58086/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/58086/
Attachment #8760510 - Flags: review?(dustin)
Comment on attachment 8760510 [details]
Bug 1278406 - Use a thread pool for submitting tasks to queue;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58086/diff/1-2/
https://reviewboard.mozilla.org/r/58086/#review54960

https://queue.taskcluster.net/v1/task/AozSXU0ATji_xubqlO72Bg/runs/0/artifacts/public%2Flogs%2Flive_backing.log says queue API calls are back under 60s again. Appears to be a roughly 8x speedup. We can't fire HTTP requests as fast as the server will accept them because you can't create the task on the server until the dependent task is created. A bulk submit API would cut down on the round trip latency. But I suppose things are fast enough with this patch... for now.
Comment on attachment 8760510 [details]
Bug 1278406 - Use a thread pool for submitting tasks to queue;

https://reviewboard.mozilla.org/r/58086/#review54984

The tone of last sentence of the commit message feels out of place, particularly since I've indicated several times why that wouldn't actually be an advantage, but that will be lost in the sands of time soon enough.  And I've certainly been guilty of passive-aggressive comments from time to time.

::: taskcluster/taskgraph/create.py:42
(Diff revision 2)
> +            task_id = label_to_taskid[label]
>  
> -        _create_task(session, label_to_taskid[label], label, task_def)
> +            # We can't submit a task until its dependencies have been submitted.
> +            deps_fs = [fs[dep] for dep in deps_by_name.values()]
> +            for f in futures.as_completed(deps_fs):
> +                f.result()

This may still end up blocking more than necessary, since the post-order traversal doesn't totally order the tasks.  But is probably adequate for the purpose.  A better model might be to chain the futures into a DAG patterned after the graph.  If you'd like to give that a shot, I'd be happy to r? it.
Attachment #8760510 - Flags: review?(dustin) → review+
https://reviewboard.mozilla.org/r/58086/#review54984

Sorry - I didn't mean to sound passive aggressive in the commit message. Reading it again, it definitely comes across that way. I'll change the language. FTR, I definitely understand where you are coming from.
https://reviewboard.mozilla.org/r/58086/#review54984

> This may still end up blocking more than necessary, since the post-order traversal doesn't totally order the tasks.  But is probably adequate for the purpose.  A better model might be to chain the futures into a DAG patterned after the graph.  If you'd like to give that a shot, I'd be happy to r? it.

This feels like scope bloat. Let's add an inline comment to call out the potential optimization and call it a day?
Comment on attachment 8760510 [details]
Bug 1278406 - Use a thread pool for submitting tasks to queue;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58086/diff/2-3/
Attachment #8760510 - Flags: review?(gps)
Comment on attachment 8760510 [details]
Bug 1278406 - Use a thread pool for submitting tasks to queue;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58086/diff/3-4/
(In reply to Gregory Szorc [:gps] from comment #11)
> This feels like scope bloat. Let's add an inline comment to call out the
> potential optimization and call it a day?

I agree, and I like the comment.
Comment on attachment 8760510 [details]
Bug 1278406 - Use a thread pool for submitting tasks to queue;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58086/diff/4-5/
Comment on attachment 8760510 [details]
Bug 1278406 - Use a thread pool for submitting tasks to queue;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58086/diff/5-6/
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/aba39b9c043c
Use a thread pool for submitting tasks to queue; r=dustin
Blocks: thunder-try
Whiteboard: [optimization]
https://hg.mozilla.org/mozilla-central/rev/aba39b9c043c
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Attachment #8760510 - Flags: review?(gps)
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.