Closed
Bug 1184209
Opened 9 years ago
Closed 8 years ago
reference-format: Handle end-points that returns a stream (Discussion)
Categories
(Taskcluster :: Services, defect)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jonasfj, Unassigned)
References
Details
Some API end-points like getArtifact returns a data stream... It is redirected to a binary file... We should denote this in the API reference and clients should implement ways to download a binary response as a stream. Client libraries should rely on position in the stream to implement automatic retries behind the scene.. Or something like that, I think it could work. Ideas for how/what to call this in reference format, and how to implement it client side is welcome.. IMO, we should do the investment and handle retries for people..
Comment 1•9 years ago
|
||
I agree. I've reverted the change to stream=True in the python client. What about calling it 'streaming_output' and have it be a boolean? If we default to False, we wouldn't need to modify existing APIs to support this behaviour.
Reporter | ||
Comment 3•9 years ago
|
||
Hmm... I don't like adding a new boolean, because that sort of allows invalid data that still satisfies the JSON schema for reference formats. Maybe, we introduce a new type: "stream" ? Ie. entries: [ { type: 'function', method: ... route: ... args: ... name: ... scopes: ... input: ... output: ... title: ... description: ... }, { type: 'stream', method: ... route: ... args: ... name: ... scopes: ... // NO input or output title: ... description: ... }, ] I'm not so worried about compatibility, there is only like 3 implementations, and I've designed my implementation to ignore "types" it doesn't understand :) @pmoore, jhford, what do you think? And how do you propose we implement this new kind of entry as API method, is it an API method that just returns a signed URL? Or do we try to do streaming with automatic retries behind the scenes?
Flags: needinfo?(pmoore)
Flags: needinfo?(jhford)
Comment 4•9 years ago
|
||
I'm happy to do it either way. Having type: stream is going to require more changes around docs generation and the sort, but having type stream might simplify the python client a little. I think it'd be strange to just send back just a signed URL, but it might be a lot of work to deal with the automated retries. I think having this be a per-client thing might be the best bet, since not all networking libraries might make it reasonable to do the retry logic with them.
Flags: needinfo?(jhford)
Reporter | ||
Comment 5•9 years ago
|
||
Hmm... Well, the downside to returning a signed url is that this will only work with GET requests. And we really should just allow "stream" type to work for POST, etc... too. Just in case we one day need it. But maybe the initial implementation is to just return a stream of some kind, and let clients deal with retry logic.
Comment 6•9 years ago
|
||
I like the idea of handling the retries in the client, and just exposing a stream that hides the retry logic. The stream is also nice for not swallowing memory (i.e. as opposed to say populating a byte[]). I'm happy to give it a shot. Also I like having an extra "type" to identify stream. Will be a fun piece to work on. =) However: by translating the http response into a stream, we do lose the benefit of the http header metadata, such as content-length. Maybe we should return an array of http headers, plus a stream to the content. This is essentially what an http response is (headers + output stream) - maybe we should instead return an actual http response that uses an underlying http transport that implements retries. This way to all intents and purposes it looks like a simple http response, and the retries are done behind the scene, so the client does not need to be concerned with it. If we do go the stream + headers route (rather than the http transport route), we'll need to make sure that we throw I/O exceptions if we can't get data from the source url (after retrying) - rather than just hanging, or closing the stream.
Flags: needinfo?(pmoore)
Reporter | ||
Comment 7•9 years ago
|
||
It seems we have agreement that we should do a new type: "stream" Then we can always find out how client libraries implement it. --- The downside I see is we can't have an API end-point that you POST json to a get a stream back from. Or the other way around. I don't have use-cases I'm just thinking we should try to be as generic as possible. Perhaps, instead it should be: input: 'stream' or 'https://schemas....' output: 'stream' or 'https://schemas....' Or maybe it should be: input: {type: 'application/octet'} input: {type: 'application/json', schema: 'https://schemas....'} output: {type: 'application/octet'} output: {type: 'application/json', schema: 'https://schemas....'} Granted that would break existing client libraries. But it might be a better way as we can support result types in the future and mix input and output.
Comment 8•9 years ago
|
||
Why can't we have an endpoint that can post json and receive a stream? Is it because we would only have a single stream: true thing? What about having: input: <as now> inputIsStream: false output: <as now> outputIsStream: true and use output: undefined for those cases where streaming is enabled on a given stream.
Reporter | ||
Comment 9•9 years ago
|
||
Yeah, we should probably do something along the line of what you suggest there. I would prefer input/output being objects for the sake of not having invalid data that still satisfies the schema.
Comment 10•9 years ago
|
||
Maybe the references/schemas should define the service endpoints, rather than the client library. As far as the API endpoint goes, it is returning an HTTP response that may have a content-type other than 'application/json', so the more I think about it, I'm not sure we should describe it as a stream in the API docs, but let the client docs manage that. In bug 1190016 I've just proposed that the creation of s3 artifacts becomes a PUT of metadata to the queue endpoint (defineArtifact) that gets updated when the publish is performed. The GET then returns the same metadata. In this way the Queue offers a standard RESTful resource publication mechanism, albeit for the metadata. I think if we are going to provide the streaming in the client, the client should simply get the URL from the metadata, and handle the publish itself (rather than getting a 30x redirect). This way we keep the docs clean (the API endpoints are only concerned with metadata) and the publish mechanism is internal to the client. It would therefore be documented in the client, and only referred to in the HTTP endpoints docs. The docs for the endpoints could simply say "please see the client-specific documentation about publishing an artifact using this metadata" with potential links to the clients. This also allows the various clients (node, python, go, java) to implement these features in different ways, or not to implement at all. I think overall that would be cleaner/simpler.
Comment 11•9 years ago
|
||
Maybe a topic for a group meeting, ideally before John goes on PTO!
Comment 12•8 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/taskcluster-lib-api https://github.com/taskcluster/taskcluster-lib-api/commit/c511a9849be03bf802ead788c90399573bdda595 Fix bug 1184209 https://github.com/taskcluster/taskcluster-lib-api/commit/caf8e12efdc796f4262fe3d425d04033d31a52a0 Merge pull request #33 from taskcluster/bug1184209 Fix bug 1184209
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•5 years ago
|
Component: Client Libraries → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•