Closed Bug 1184209 Opened 9 years ago Closed 8 years ago

reference-format: Handle end-points that returns a stream (Discussion)

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Unassigned)

References

Details

Some API end-points like getArtifact returns a data stream...
It is redirected to a binary file...

We should denote this in the API reference and clients should implement ways to download a binary response as a stream.

Client libraries should rely on position in the stream to implement automatic retries behind the scene..
Or something like that, I think it could work.

Ideas for how/what to call this in reference format, and how to implement it client side is welcome..

IMO, we should do the investment and handle retries for people..
I agree.  I've reverted the change to stream=True in the python client.

What about calling it 'streaming_output' and have it be a boolean?  If we default to False, we wouldn't need to modify existing APIs to support this behaviour.
Hmm... I don't like adding a new boolean, because that sort of allows invalid data that still satisfies
the JSON schema for reference formats.

Maybe, we introduce a new type: "stream" ?

Ie.
entries: [
  {
    type: 'function',
    method: ...
    route: ...
    args: ...
    name: ...
    scopes: ...
    input: ...
    output: ...
    title: ...
    description: ...
  }, {
    type: 'stream',
    method: ...
    route: ...
    args: ...
    name: ...
    scopes: ... // NO input or output
    title: ...
    description: ...
  },
]

I'm not so worried about compatibility, there is only like 3 implementations, and I've designed my
implementation to ignore "types" it doesn't understand :)

@pmoore, jhford, what do you think?
And how do you propose we implement this new kind of entry as API method, is it an API method that just
returns a signed URL? Or do we try to do streaming with automatic retries behind the scenes?
Flags: needinfo?(pmoore)
Flags: needinfo?(jhford)
I'm happy to do it either way.  Having type: stream is going to require more changes around docs generation and the sort, but having type stream might simplify the python client a little.

I think it'd be strange to just send back just a signed URL, but it might be a lot of work to deal with the automated retries.  I think having this be a per-client thing might be the best bet, since not all networking libraries might make it reasonable to do the retry logic with them.
Flags: needinfo?(jhford)
Hmm... Well, the downside to returning a signed url is that this will only work with GET requests.
And we really should just allow "stream" type to work for POST, etc... too. Just in case we one day need it.

But maybe the initial implementation is to just return a stream of some kind, and let clients deal with
retry logic.
I like the idea of handling the retries in the client, and just exposing a stream that hides the retry logic. The stream is also nice for not swallowing memory (i.e. as opposed to say populating a byte[]).

I'm happy to give it a shot. Also I like having an extra "type" to identify stream. Will be a fun piece to work on. =)

However: by translating the http response into a stream, we do lose the benefit of the http header metadata, such as content-length. Maybe we should return an array of http headers, plus a stream to the content. This is essentially what an http response is (headers + output stream) - maybe we should instead return an actual http response that uses an underlying http transport that implements retries. This way to all intents and purposes it looks like a simple http response, and the retries are done behind the scene, so the client does not need to be concerned with it.

If we do go the stream + headers route (rather than the http transport route), we'll need to make sure that we throw I/O exceptions if we can't get data from the source url (after retrying) - rather than just hanging, or closing the stream.
Flags: needinfo?(pmoore)
It seems we have agreement that we should do a new type: "stream"

Then we can always find out how client libraries implement it.

---
The downside I see is we can't have an API end-point that you POST json to a get a stream back from.
Or the other way around.

I don't have use-cases I'm just thinking we should try to be as generic as possible.
Perhaps, instead it should be:
  input: 'stream' or 'https://schemas....'
  output: 'stream' or 'https://schemas....'
Or maybe it should be:
  input: {type: 'application/octet'}
  input: {type: 'application/json', schema: 'https://schemas....'}
  output: {type: 'application/octet'}
  output: {type: 'application/json', schema: 'https://schemas....'}
Granted that would break existing client libraries. But it might be a better way as we can support
result types in the future and mix input and output.
Why can't we have an endpoint that can post json and receive a stream?  Is it because we would only have a single stream: true thing?

What about having:


input: <as now>
inputIsStream: false
output: <as now>
outputIsStream: true

and use output: undefined for those cases where streaming is enabled on a given stream.
Yeah, we should probably do something along the line of what you suggest there.
I would prefer input/output being objects for the sake of not having invalid data that still satisfies the schema.
Maybe the references/schemas should define the service endpoints, rather than the client library. As far as the API endpoint goes, it is returning an HTTP response that may have a content-type other than 'application/json', so the more I think about it, I'm not sure we should describe it as a stream in the API docs, but let the client docs manage that.

In bug 1190016 I've just proposed that the creation of s3 artifacts becomes a PUT of metadata to the queue endpoint (defineArtifact) that gets updated when the publish is performed. The GET then returns the same metadata. In this way the Queue offers a standard RESTful resource publication mechanism, albeit for the metadata. I think if we are going to provide the streaming in the client, the client should simply get the URL from the metadata, and handle the publish itself (rather than getting a 30x redirect).

This way we keep the docs clean (the API endpoints are only concerned with metadata) and the publish mechanism is internal to the client. It would therefore be documented in the client, and only referred to in the HTTP endpoints docs. The docs for the endpoints could simply say "please see the client-specific documentation about publishing an artifact using this metadata" with potential links to the clients. This also allows the various clients (node, python, go, java) to implement these features in different ways, or not to implement at all.

I think overall that would be cleaner/simpler.
Maybe a topic for a group meeting, ideally before John goes on PTO!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Client Libraries → Services
You need to log in before you can comment on or make changes to this bug.