Closed Bug 768655 Opened 13 years ago Closed 11 years ago

Provide API for more efficient implementation of "commands" functionality

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED WONTFIX

People

(Reporter: rfkelly, Assigned: rfkelly)

Details

(Whiteboard: [qa-])

Sync client request: ability to filter results based on an arbitrary field in tbe BSO. Specific use-case is for the upcoming "commands" API, so that individual clents can avoid transferring (potentially large) command objects that are not itended for them.
Assignee: nobody → rfkelly
Blocks: 720964
This is... worrying. We used to have these filter fields, kind of, in the form of various db columns. It was a mess and getting rid of them was a big win. Payloads are encrypted, so you'd have to staple something onto them (and run the risk of not getting enough objects to fill the request. It's messy) and the server would have to do a bunch of decryption. Would this apply to just one collection? What would stop me from asking for all my history with nonsensical=1 just to watch the db churn? Sync is designed for lots of small records where we don't introspect data. If we're moving to big records where we need to introspect, we may be better served by a separate service API.
(In reply to Toby Elliott [:telliott] from comment #1) > We used to have these filter fields, kind of, in the form of various db > columns. It was a mess and getting rid of them was a big win. Payloads are > encrypted, so you'd have to staple something onto them Some context that was missing from my original comment: there would be no additional indexing and no additional database fields. The client would put some unencrypted data into the payload, and the server would scan the payloads before sending them back to the client, discarding any that do not match the given query. That's inefficient, but it may be a win over sending lots of irrelevant data to the client. Ideally, the API would make it very clear that this is an inefficient operation that involves scanning all the data. APIs that look efficient but actually aren't are dangerous. Possible problem: The payload is just a string, there is no requirement for it to be a JSON object. So "filtering by a field in the payload" doesn't make sense as part of the spec. Filtering the payload by a regex does, but that may not correspond well enough to the semantics desired by the client. > and run the risk of not getting enough objects to fill the request. I don't understand what you mean here, please elaborate. > Would this apply to just one collection? I wouldn't count on it, this is the sort of feature that tends to find other uses once it becomes available... > Sync is designed for lots of small records where we don't introspect data. > If we're moving to big records where we need to introspect, we may be better > served by a separate service API. This is an interesting thought, also in the context of Bug 768663. If the shape of the API is the same, we could also consider a separate backend implementation for collections with these differing characteristics. (In the same way that tabs are currently stored in memcache rather than in the database)
Adjusting bug title because BSOs don't have arbitrary "fields", it's all about what's in the payload.
Summary: Server-side filtering by arbitrary BSO fields → Server-side filtering based on BSO payload contents
(In reply to Ryan Kelly [:rfkelly] from comment #2) > (In reply to Toby Elliott [:telliott] from comment #1) > > and run the risk of not getting enough objects to fill the request. > > I don't understand what you mean here, please elaborate. > Give me the top 50 items sorted by sortindex, filtered by key. How many items do I need to grab from the db? > > Would this apply to just one collection? > > I wouldn't count on it, this is the sort of feature that tends to find other > uses once it becomes available... Right, so we can't count on there not being 25K of these objects on the server to sift through. > This is an interesting thought, also in the context of Bug 768663. If the > shape of the API is the same, we could also consider a separate backend > implementation for collections with these differing characteristics. (In > the same way that tabs are currently stored in memcache rather than in the > database) This would also be a possibility, but runs into the previous problem. The shape of the API differs, because some collections support this filter and others don't
I should back up here and explain the use-case as I understand it in more detail. FF Sync current has a notion of "commands" which can be sent to individual clients to instruct them to e.g. wipe their sync data. These are currently managed as part of the client records within the "clients" collection, but this is suboptimal (:gps et al, why exactly?). Proposal is to store commands as individual records in their own "commands" collection. The operation that each client wants to perform is "get all the commands for client XYZ", where XYZ is a unique identifier for that client. This will be a subset of all the records in the "commands" collection. How might we implement this operation? Here are some options that spring to mind: The Status Quo -------------- The client fetches all records in the "commands" collection, then filters them locally to extract just the commands it needs. If command records are small this is probably OK, but if they are large then it's pretty wasteful. Generic Payload Matching ------------------------ Add a filtering option so that the client can ask for BSOs with payload matching a certain condition. It could be as simple as "payload contains string XYZ" like so: GET /storage/commands?newer=12345&contains=XYZ This is rather inelegant and costly if the payloads are large. Assuming some structure on the payload could make it more elegant but also more costly to implement. Dedicated BSO Field ------------------- Add another metadata field named "tag", with a corresponding database column. This defaults to None, can be set to an arbitrary string by the client, and you can filter records based on their tag: GET /storage/commands?newer=12345&tag=XYZ Use a Different System for Commands ----------------------------------- If we accept that commands are sufficiently unlike normal sync data, we could build a dedicated service to deal with them properly. Something based on queuey perhaps? Thoughts?
(In reply to Ryan Kelly [:rfkelly] from comment #5) > FF Sync current has a notion of "commands" which can be sent to individual > clients to instruct them to e.g. wipe their sync data. These are currently > managed as part of the client records within the "clients" collection, but > this is suboptimal (:gps et al, why exactly?). Proposal is to store > commands as individual records in their own "commands" collection. It is suboptimal because there are race conditions between multiple clients updating the shared client record. If two clients sync at the same time, one client may issue a command (insertion into existing record) and another may delete a command because it processed an incoming one. We can address this with If-Unmodified-Since and reconciling, but it is ugly. I'd just prefer that clients were authoritative over their own, independent record and individual commands could be inserted/modified/deleted at well without any chance for conflict. Another alternative is for commands with large payloads the command record itself could be a "pointer" that says "fetch this other record in this other collection." That results in another round trip. And, it still results in clients fetching all commands and dropping some needlessly on the floor. We also may want to use this feature for tab syncing. We would store a common identifier in related tabs (likely part of the same tab group) in the unencrypted metadata. Then, "tabs from other computers" could selectively retrieve tabs based on the group they were in.
I like the "tag" idea. If you could implement tags as sets, that would be ideal. Although, by the time you do that, what you've essentially invented is storage of arbitrary top-level fields. Heh.
Note that "filtered by key" implies a sorted scan of the entire dataset. If you say LIMIT 50 and there's 49 matching rows out of 100_000, it will scan all 100_000 rows. I don't believe mysql is particularly able to very efficiently index this for both the TEXT index required for efficient substring matching and also the TIME index required for efficient sorting. It doesn't seem like *new* clients are expected to care about commands, only *existing* clients. If that's the case, then default new clients to the latest ttl and skip downloading the entire table. That would also makes this a pure queuing solution for existing clients, with no relevance to new ones. The level of complexity desired by this ticket is best addressed by implementing a true queueing backend. Of course, that has its own ball of problems - you'll have to create one queue per tag, and feed all of them copies of the same input packet. Or you'll have to find a queue that also indexes an arbitrary field and can filter efficiently - which sounds a lot like MySQL with a latest_ttl per client. Do you prefer inefficiency or simplicity? I see a lot of theoretical concerns above. If 99.99% of users are being penalized by downloading tens of thousands unnecessary records to get the one record that matters, then all of this might be worth it. If 99.99% of users are downloading ten records, it probably is not.
The use case here feels pretty far-afield from core sync, and we should probably just talk about what we'd build if we were starting from scratch rather than try to wedge it in, then see what of our current projects we can use.
Whiteboard: [qa?]
Whiteboard: [qa?] → [qa-]
Based on out-of-band discussions, the consensus seems to be that we should use a separate API for this rather than trying to change syncstorage to accommodate. I am changing the bug title to reflect this and unblocking Bug 766026.
No longer blocks: 720964
Summary: Server-side filtering based on BSO payload contents → Provide API for more efficient implementation of "commands" functionality
We can reboot this discussion in a fresh bug for the Brave New FxA world, no point hanging onto the old one in the meantime.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
yea. sounds good.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.