Closed Bug 766026 Opened 12 years ago Closed 10 years ago

Gracefully handle lots of records with closely-spaced ttls

Categories

(Cloud Services Graveyard :: Server: Sync, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

VERIFIED WONTFIX

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Some users currently have many thousands of records with closely-spaced ttls, and it's bogging down their queries.  We need a better way to deal with this.  Possibilities:

 * implement some sort of internal pagination, so that each individual query is limited to a certain number of results
 * implement row-count quota so nobody's store can grow this big in future
 * completely change how we handle ttls so that we don't have to scan through them on each query...?


IRC scrollback for posterity:
------------------------------------------------
<atoll> [11:30:12] rfkelly: hello, do you have a few minutes to chat sync and batching thing?
<rfkelly> [11:30:22] sure
<atoll> [11:30:35] ~300k objects across 10 seconds of ttl space
<atoll> [11:31:06] in a wbo (or bso?) table
<rfkelly> [11:31:07] wow
<atoll> [11:31:13] appears to break one of the rowscan indexes
<atoll> [11:31:21] i'm not sure how the user did it, because encrypted
<atoll> [11:31:30] maybe data import from another browser!
<atoll> [11:31:41] "we read you back your entire life, in 10 seconds. on mars."
<atoll> [11:31:42] TOTAL RECALL
<atoll> [11:31:49] and this is the mysql server's head exploding
<atoll> [11:32:05] in any case, rather than make users bother about this
<atoll> [11:32:37] we could alter (ttl_idx) to be (ttl, id)
<atoll> [11:32:59] and scan new rows WHERE ttl > ? AND id > ? LIMIT 100
<atoll> [11:33:06] with some sort of python batching ont he ?, ? counters
<atoll> [11:33:18] that's about the only way i've figured out yet for this particular variation of mysql
<rfkelly> [11:33:23] what query is breaking because of this?
<atoll> [11:33:38] SELECT id, parentid, predecessor WHERE username, collection = 4, ttl > ?
* atoll [11:33:50] has it here somewhere
<atoll> [11:33:59] EXPLAIN SELECT wbo84.id, wbo84.parentid, wbo84.predecessorid, wbo84.sortindex, wbo84.modified, wbo84.payload FROM wbo84 WHERE wbo84.username = '7489684' AND wbo84.collection = 4 AND wbo84.ttl > 1340063880;
<atoll> [11:34:23] it takes a long time for mysql to full scan the ~300k rows returned
<atoll> [11:34:38] and doing any sort of batching by "one second at a time" with WHERE ttl = ?
<atoll> [11:34:45] wouldn't help this case, because still tens of thousands of rows per second
<atoll> [11:35:18] (ttl, id) means we can scan *within* the tens of thousands of rows in any single second
<rfkelly> [11:35:47] right
<rfkelly> [11:35:56] ah, "id" here is the wbo id
* atoll [11:36:00] nods
<rfkelly> [11:36:01] for some reason I was thinking userid
<atoll> [11:36:10] username is misleadingly that
<atoll> [11:36:42] so, this only happens once in a very long while.
<atoll> [11:36:48] we could reset, which would throw away the data.
<atoll> [11:36:52] this might be quite the opposite of desired.
<atoll> [11:37:18] i've a long-term idea about # of objects quota
<atoll> [11:37:28] (soft quota, beyond which we prune by ttl at our leisure)
<rfkelly> [11:37:32] I'm just worried about complicating the common case to take care of this edge case - it sounds like quite fiddly code as opposed to a single select statement
<atoll> [11:37:41] yep! i don't like it either.
<rfkelly> [11:37:56] then again, if we just drop the data, they might just upload it again with the same properties
<rfkelly> [11:38:11] is it only slowing down the query for this user, or for all users on the database?
<atoll> [11:38:22] the client repeats it often enough to create 10 replicas
<atoll> [11:38:26] at which point, all IO threads saturate
<atoll> [11:38:29] i'm looking into ways to prevent that somehow
<rfkelly> [11:38:57] ah, so request times out, client tries again, but the original query keeps running?
* atoll [11:39:02] nods
<atoll> [11:39:08] there's a patch somewhere to trap that, i think
<atoll> [11:39:12] but the worker may not yet know the client timed out
<rfkelly> [11:39:16] yeah, I recall a but for that as well
<rfkelly> [11:39:22] s/but/bug/
<atoll> [11:40:15] query time is 700+ seconds somehow
<atoll> [11:40:48] worker timeouts are set at 600 seconds
<atoll> [11:41:15] 4 users so far
<atoll> [11:41:20] over.. a year?
<atoll> [11:41:23] several months?
<rfkelly> [11:42:59] I'm leaning towards clearing these users' data, and seeing if it gets reupload with similar properties
<rfkelly> [11:43:26] obviously we need a better fix, but it might buy us a bit of time to work it out
<rfkelly> [11:43:41] plus "sync is not a backup" blah blah
<atoll> [11:43:45] yep, at this very slow user rate a migrate is the right path
<atoll> [11:43:55] manual DELETE required, to avoid complications later
<rfkelly> [11:45:06] ok, this sounds like the right path to me
<atoll> [11:46:40] it's a much more interesting long-term than short-term problem
<rfkelly> [11:46:45] indeed
<atoll> [11:46:50] in each case, "nuke all but a few of the hundreds of thousands of items" is the desired result anyways
<rfkelly> [11:47:31] row count quota sounds like a very good idea
<rfkelly> [11:47:45] assuming we can implement it cheaply wit some help from memcache
<atoll> [12:36:50] food for schema thought
<mconnor> [12:59:48] rfkelly: sync is not a backup blah blah blah for now
<mconnor> [12:59:49] :)
Whiteboard: [qa+]
While there's some good ideas above, I think we also ought to be looking at client mitigation approaches here. It'd be good if the client noticed that there were 300k records and said "hmm, maybe not all/all at once". Does anyone really need to sync 300K history? That's one hell of a tail.

Obviously that's only going to work for clients we control, but that does represent the vast majority. Saying "Don't sync more than 10K items initially" or something would probably do a lot of good here.
Sure, we can put in rules in our clients that say "don't upload >N records in one sync." That being said, the fewer of these rules we have, the better.

If the client needs to work around a server limitation, I'd say that is a problem with the server or the protocol, not the client.

We do have a problem with the data representation of history records. I'm optimistic we can solve this problem with changes to Sync's storage schema on the server. e.g. by storing places entries and visits in separate collections. We have a number of options here.

Anyway, I'd plea for this problem to be solved on the server so things "just work" and so any willy-nilly client can't DoS the server. Speaking of DoS, this bug essentially outlines steps to take down Mozilla's Sync service. Should this bug be publicly viewable?
Oh, we'll make server changes to "officially" solve this. I'm just questioning the value of uploading 300K history records in the first place.
Blocks: 784598
Priority: -- → P3
I don't think this is relevant in the new-sync world.  We *always* order queries by "modified" now so the ttl-based query in this bug will not show up.  We also have improved pagination and the option to do it transparently, per Bug 1007987.  So I'm closing this out.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Fine with me.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.