Closed Bug 1405072 Opened 8 years ago Closed 7 years ago

Iteration support in python client library

Categories

(Taskcluster :: Services, enhancement, P5)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sfraser, Assigned: jhford, Mentored)

References

Details

The call I use most often in the python taskcluster client is queue.listTaskGroup. Once it supports the continuation token, it would be very useful if there was a higher level wrapper around it. It's likely we'll have several different pieces of code around the place to deal with the continuation token, when most of the use cases will want either the full task graph, or up to the specified limit. It would be useful if the python client could handle the token on our behalf, so we don't need to see it.
Mentor: dustin, jhford
Hi Dustin, I am newbie to this project and would like to work on this bug. Can I proceed working on this one?
Flags: needinfo?(dustin)
Absolutely! What simon is looking for here is a function which will call another API function like https://docs.taskcluster.net/reference/platform/taskcluster-auth/references/api#azureTables https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#listArtifacts https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#listTaskGroup https://docs.taskcluster.net/reference/core/taskcluster-secrets/references/api#list https://docs.taskcluster.net/reference/core/taskcluster-index/references/api#list-namespaces repeatedly, using the continuationToken option to build up the whole list in memory and then returning it. So this might look like queue = taskcluster.Queue() taskcluster.getAll(queue.listArtifacts, taskId, runId) so that would call queue.listArtifacts repeatedly, with taskId and runId, until it had fetched all of the artifacts, and then return the whole bunch. There are some complications here! For example, listArtifacts returns an object with an `artifacts` key, while listTaskGroup returns an object with a `tasks` key. How can this generic `getAll` function know what lists to put together to return "the whole bunch"? We do not want to make special-cases for each API method. How do you think we could solve this? Another complication is that the index.listNamespaces API method takes its continuationToken argument in the payload, instead of as a query parameter. I believe that is the only method to do so -- perhaps we could just modify it to behave like the others.
Flags: needinfo?(dustin)
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #2) Thanks Dustin. Before getting into the details of the issue. I would be grateful if you can help me out with a one query. a - how should I go about setting up development environment for this particular TaskCluster project ? > Absolutely! > > What simon is looking for here is a function which will call another API > function > .... > repeatedly, using the continuationToken option to build up the whole list in > memory and then returning it. > > So this might look like > > queue = taskcluster.Queue() > taskcluster.getAll(queue.listArtifacts, taskId, runId) > > so that would call queue.listArtifacts repeatedly, with taskId and runId, > until it had fetched all of the artifacts, and then return the whole bunch. > > There are some complications here! For example, listArtifacts returns an > object with an `artifacts` key, while listTaskGroup returns an object with a > `tasks` key. How can this generic `getAll` function know what lists to put > together to return "the whole bunch"? We do not want to make special-cases > for each API method. How do you think we could solve this? > Possible solutions to this : a - hold these possible keys in a constant list, i.e ["tasks", "artifacts", "secrets", "namespaces", ...] and iterate over this list to lookup the object(`listArtifacts or listTaskGroup`) passed to `getAll` method. b - pass the key name as the method parameter, i.e taskcluster.getAll(queue.listArtifacts, key, taskId, runId) where `key` is "tasks" or "artifacts" whichever is the key to get data from the first argument. c - pass a unique string to the client and lookup its values in an object with the hash as the key and the keyword to lookup with as its value, i.e taskcluster.getAll(queue.listArtifacts, hash, taskId, runId) lookup the key from this constant object { "hash1": "tasks", "hash2": "artifacts", "hash3": "secrets", "hash4": "namespaces" } b & c are almost on similar lines. Do let me know what you think and if a better solution can be found. > Another complication is that the index.listNamespaces API method takes its > continuationToken argument in the payload, instead of as a query parameter. > I believe that is the only method to do so -- perhaps we could just modify > it to behave like the others. Do let me know if modifying index.listNamespaces API method signature is within the scope of this contribution.
Flags: needinfo?(dustin)
If it'll be a generic wrapper to handle continuation tokens, I don't know if this would be possible, but: It would be nice if the result could be pythonic, such as 'iterate over all tasks returned by listTaskGroup' I'd started fiddling with https://github.com/srfraser/taskhuddler to answer some of my own questions about the release graphs, but it's not the generic client that taskcluster-client.py should be, as it involves knowing about how we structure things in order to answer useful questions.
(In reply to tuhina chatterjee from comment #3) > b - pass the key name as the method parameter, i.e > > c - pass a unique string to the client and lookup its values in an object > with the hash as > > b & c are almost on similar lines. Do let me know what you think and if a > better solution can be found. As a preference, I'd like to not have to tell the library how to do its job when I make calls to it. We could recursively merge dictionaries and extend lists - it seems unlikely there'd be duplicate entries in any lists returned.
I agree that an iterator would be better -- one that fetches a batch of results, yields them one-by-one, then fetches the next batch, and so on. And, you're right -- we should do this the right way. That means including enough metadata in the API definition that the client can auto-generate this sort of function. That's a bigger project than something we can expect of an Outreachy applicant (but you're welcome to work on it!) However, updating the index.listNamespaces API method is a good task. Please file a new bug to track that, and put the bug number in a comment on this one so that we can find it. Please also look through the other services on https://docs.taskcluster.net/reference to see if there are other API methods with continuation tokens that use POST instead of GET.
Flags: needinfo?(dustin)
Oops, I phrased that wrong: it's bigger than a "getting started" contribution. It would actually make a good Outreachy or GSoC project :)
..also, that kind of makes this not a very good mentored bug :)
Mentor: dustin, jhford
(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #6) That's a bigger project than something we can expect of > an Outreachy applicant (but you're welcome to work on it!) Is it possible that I take this up as an Outreachy project ? And I can probably select another good first bug to complete the application process ? > However, updating the index.listNamespaces API method is a good task. > Please file a new bug to track that, and put the bug number in a comment on > this one so that we can find it. Please also look through the other > services on https://docs.taskcluster.net/reference to see if there are other > API methods with continuation tokens that use POST instead of GET. This is the link to the bug https://bugzilla.mozilla.org/show_bug.cgi?id=1439058
Bug 1439058 will make a good first bug. We'll stick with the existing project for this round of Outreachy, though. Setting up for and mentoring an Outreachy project is a lot of work, and we're already set up for that project. Perhaps in a future round? Or, if you're selected, perhaps that can be a good follow-up project if you finish the main project quickly :)
Summary: [python client] Higher level calls in taskcluster-client.py → Iteration support in client libraries
The python client provides pagination support. https://github.com/taskcluster/taskcluster-client.py#pagination It looks like this is in reference to the Python client, if so I think this can be resolved as fixed.
Flags: needinfo?(sfraser)
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #12) > The python client provides pagination support. > > https://github.com/taskcluster/taskcluster-client.py#pagination > > It looks like this is in reference to the Python client, if so I think this > can be resolved as fixed. This is a request for the python client to not present the continuation token, but instead to handle the api in a more pythonic way: for example as a generator. At the moment we're hand-rolling our own continuationToken handler for every script, and for some use cases writing our own higher level wrapper around taskcluster-client.py, since most of our use cases require fetching the whole graph. It would be useful if we didn't have to do this - folding it into taskcluster-client.py seems reasonable, rather than have everyone reinvent that particular wheel.
Flags: needinfo?(sfraser)
John: would you be willing to mentor this bug?
Flags: needinfo?(jhford)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #13) > This is a request for the python client to not present the continuation > token, but instead to handle the api in a more pythonic way: for example as > a generator. > > At the moment we're hand-rolling our own continuationToken handler for every > script, and for some use cases writing our own higher level wrapper around > taskcluster-client.py, since most of our use cases require fetching the > whole graph. > > It would be useful if we didn't have to do this - folding it into > taskcluster-client.py seems reasonable, rather than have everyone reinvent > that particular wheel. The pagination support that I mentioned is not for presenting the continuation token, but rather taking a callable which is called on each response body. Here's the example of this callable support: import taskcluster queue = taskcluster.Queue() responses = [] def handle_page(y): print("%d tasks fetched" % len(y.get('tasks', []))) responses.append(y) queue.listTaskGroup('JzTGxwxhQ76_Tt1dxkaG5g', paginationHandler=handle_page) tasks = 0 for response in responses: tasks += len(response.get('tasks', [])) print("%d requests fetch %d tasks" % (len(responses), tasks)) I'm happy for us to have a generator based interface as well, or in place of this support if it's still desirable. (In reply to Chris Cooper [:coop] pronoun: he from comment #14) > John: would you be willing to mentor this bug? Sure!
Flags: needinfo?(jhford)
This bug seems to be about the python client, so changing the title to reflect that. The python client already has the requested support, as mentioned in comment 15. Let's open new bugs for possibly changing the pagination interface, or if there's problems doing pagination. I'm going to mark this bug as resolved because we have support for the requested interface.
Assignee: nobody → jhford
Status: NEW → RESOLVED
Closed: 7 years ago
Priority: -- → P5
Resolution: --- → FIXED
Summary: Iteration support in client libraries → Iteration support in python client library
This was the new bug to talk about changing the pagination interface :) I didn't request callbacks, so I don't know where that requested interface came from.
With a new Outreachy season approaching, I'm going to re-open this based on comment #7. Simon: if you need this more quickly, we would certainly accept/review patches, although it seems like you and John are still talking past each other about what's required. I'd encourage the two of you to hop on vidyo to sort it out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: jhford → nobody
Mentor: jhford
I wrote a comment but it seems that it didn't make its way into bugzilla, so here's version 2 :) My understanding is that this bug is about abstracting the low level continuation token away from users of the taskcluster-client.py library, which have support for. This support was added in an unrelated bug which was wrapping up right around when this bug was filed, which is why it used a different interface. Here's a usage example: import taskcluster queue = taskcluster.Queue() pages = [] queue.listTaskGroup('JzTGxwxhQ76_Tt1dxkaG5g', paginationHandler=pages.append) I'm finding this bug difficult to follow, but I think the new request is for us to add a generator interface to the client. That's a great idea! I've filed bug 1488482 to track this work. Simon, if the generator interface is the only outstanding request here, I think that the new bug will address that. Please let me know if there's anything other than a generator interface that I've missed, otherwise I don't think there's anything left in this bug.
Assignee: nobody → jhford
Flags: needinfo?(sfraser)
See Also: → 1488482
I think a higher level interface for common queries would be useful, but we can discuss that in the new bug if it's easier.
Flags: needinfo?(sfraser)
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Component: Client Libraries → Services
You need to log in before you can comment on or make changes to this bug.