Closed Bug 693812 Opened 13 years ago Closed 6 years ago

[meta] Better streaming of collection POSTs

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

Details

(Whiteboard: [storage-api-addition?])

This is a meta bug to track improvements to Sync collection POSTs so they can stream better, likely leading to lower request latency and a better Sync experience.

The current collection POST API (/1.1/username/storage/collection) isn't very stream friendly. Issues:

1) Server isn't allowed (due to lack of wording in the spec [1]) to process records not as a single batch unit
2) Payload as single JSON array isn't very friendly for streaming (likely requires manual serialization foo to work around serialization errors while reading/writing JSON string, which is inefficient).
3) Enforcing max record upload size before serializing JSON POST body isn't clear cut (we can guestimate well enough, but this is approximation. Computers like being exact, especially if we have middleware enforcing limits.

In the ideal world, the entire flow of collection POSTs from client to server would stream well. It would have the following characteristics:

1) Client would put WBO data on the wire as soon as it is ready for upload. Currently, the client buffers all outgoing records then puts them on the wire during a later Sync phase[2].
2) Server would slurp WBOs off the wire as they become available and start processing immediately, without waiting for all data to become available. Currently, the Storage API 1.1 does not allow such behavior. In addition, server-storage waits for the entire HTTP request body to become available before doing any processing [3].
3) The entire HTTP flow would speak Transfer-Encoding: chunked [4].

These changes would likely benefit clients on slow networks (e.g. mobile) the most since they would start (and thus finish) their data transmission sooner. A snappier Sync is a happier Sync.

Making this change requires a bunch of separate tasks to be completed. And, for us to see the full benefit, they must all be done. These should be tracked as separate bugs. I'm describing the issues here, but holding out on filing bugs until we are ready to implement them. The tasks are:

1) [everyone] Create new storage API version that explicitly states that server can start processing records before it has received the entire HTTP request body. Will require thought on the best way to deal with errors, especially in cases where clients could encounter error mid-upload and abort the stream. This new API should also consider modifying the media type of the request body to something more suitable for streaming for cases where streaming JSON serialization is not available. The application/whoisi and application/newlines media types from the existing storage 1.1 GET collections API would be suitable. Multipart MIME would be another possibility. While we are revving the storage API, we might also consider moving the "keys," "crypto," and "clients" collections outside of storage, as well as other ideas floating around in various people's minds.

2) [server] Implement support for new storage API version. Ideally, actually start I/O operations before reading the entire HTTP request body.

3) [client] Support and switch to new storage API version. Modify sync implementation to put WBOs on wire as soon as they are available, not until after they are all available.

4) [ops] Deploy new server software. Ensure HTTP middleware preserves chunked transfer encoding, doesn't unnecessary buffer HTTP requests when proxying, etc.

Of these, part of 3 can be done today. We could theoretically change the client to put data on the wire (encoded in the current JSON array format) as soon as it is available. This would require us manually terminating the request string with a "]" and ensuring the client always sends valid JSON, but that's not a big deal. The hard part is doing the streaming on the client, which would require somewhat refactoring how the engines do sync (which is what rnewman and philikon are up to in the async rewrite!).

Anyway, this will probably be filed away for a while. Hopefully I recorded enough to preserve for posterity.

[1] http://docs.services.mozilla.com/storage/apis-1.1.html
[2] https://hg.mozilla.org/services/services-central/file/72ceb8048b70/services/sync/modules/engines.js#l1098
[3] https://hg.mozilla.org/services/server-storage/file/457707cce3dd/syncstorage/controller.py#l326
[4] http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.6.1
It might also be worth considering implementing "application/newlines" for POSTs. That way GETs and POSTs would have a nice symmetry to them.
Whiteboard: [storage-api-addition?]
(In reply to Philipp von Weitershausen [:philikon] from comment #1)
> It might also be worth considering implementing "application/newlines" for
> POSTs. That way GETs and POSTs would have a nice symmetry to them.

If we can tolerate the addition to storage 1.1, that would build a bridge that the client can cross sooner. However, the server is still blocked on performing I/O before the entire request arrives, as that is not allowed by the existing 1.1 specification. That being said, streaming support on the client has more of an impact than it does on the server (according to conventional thinking), so +1 to doing it against 1.1.
Putting this somewhere more likely to get eyes, and asking rfkelly to chime in!
Component: General → Server: Sync
Flags: needinfo?(rfkelly)
OS: Mac OS X → All
Hardware: x86 → All
Sync2.0 allows application/newlines input for POST, so that's a good start already :-)

I don't understand the specific concern behind "Server isn't allowed (due to lack of wording in the spec) to process records not as a single batch unit".  What wording should be present in the spec to allow this, and is it also missing in Sync2.0?

The current server stack will not make it easy to slurp data off the wire while it's being written.  It should be doable, but both WSGI and nginx make it fiddly.  However, this should not block the client from sending data in a streaming fashion and having it buffered in the server.  Enabling streaming reads on the server at some future data should in theory be transparent to the client.

My biggest concern here is with chunked-encoding.  AFAIK it's not terribly well supported, particularly for *input* as well as output, and some folks consider it a securing hole due to slowloris type attacks.  Will chunked-encoding be a required part of any conforming sync server, or an optional feature that allows more efficient uploads?

If chunked-encoding is a hard requirement, this may be a non-starter due to third party installs being unwilling or unable to support it.
Flags: needinfo?(rfkelly)
(In reply to Ryan Kelly [:rfkelly] from comment #4)
> If chunked-encoding is a hard requirement, this may be a non-starter due to
> third party installs being unwilling or unable to support it.

I hope this didn't sound too negative - I can definitely see the benefits here, and am interested in exploring the approach!  But if chunked-encoding input is the only way to do this properly, we'll have to (a) add verbiage to the spec stating that all servers must support it, and (b) investigate how easy this would be for third-party deployments.

It would be cool if the client could fall back to non-chunked input if the server doesn't play along, but that may introduce far too much complication in the client code.
> 6 years ago

Gonna just go ahead and close this...
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.