Closed
Bug 693375
Opened 14 years ago
Closed 7 years ago
Defer deletes to operational discretion
Categories
(Cloud Services Graveyard :: Server: Sync, defect, P4)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: telliott, Assigned: rfkelly)
References
Details
(Whiteboard: [qa-][sync-scale])
Ops has requested that we don't do deletes directly. They both tie up the server and often timeout connections.
Obviously this is irrelevant for most users, so it should be an addon to the server. Ops will take responsibility for actually doing the deletions, though we need to make it easy for them to do so.
| Reporter | ||
Comment 1•14 years ago
|
||
[note: this solution requires membase, so will be blocked on that deployment]
Having pondered it for a bit, I think this is the easiest thing:
We maintain a hash in membase of deletion timestamps for each collection (as well as a master timestamp in the event of a DELETE /storage).
When a delete request comes in, if the flag is set to enable this:
1) grab the hash from membase and update the one item
2) rewrite it back to disk if unchanged on disk with a very lengthy ttl
3) return success
When a get request comes in, if the flag is set to enable this:
1) grab the hash from membase.
2) figure out max(collection timestamp, master timestamp)
3) if that value is greater than the user-passed 'newer' parameter, rewrite 'newer' to that and pass it along into the query
As a result of these two pieces, ops can loops through users and delete in small batches as resources are available. Since many records will be overwritten during the resync process, the actual number of deletions ends up much lower.
Risks:
The system is not reversible if we lose the timestamp hash - ops will have no way to delete it, and bogus records will be returned to the client. We have no further source of truth for it, so it needs to be handled carefully.
| Reporter | ||
Updated•13 years ago
|
Assignee: nobody → rkelly
Updated•13 years ago
|
Whiteboard: [qa-]
| Assignee | ||
Comment 2•13 years ago
|
||
After discussion with atoll and telliott, it will be better to store this information in the database itself. Something like:
collection_deleted (userid INTEGER, collection_id INTEGER, deletion_time INTEGER)
Then when querying things from the collection, we filter out anything earlier than the deletion timestamp. It can be pulled out into the query with a sub-select along the lines of "WHERE timestamp > (SELECT deletion_time FROM collection_deleted WHERE user_id=%s AND collection_id=%s)".
"userid" and "collection_id" should both have either 0, or 1, underscore.
Suggest "deletion_time" could be called "min_ttl".
Should a PUT/POST of an object with a ttl older than "min_ttl" return success or failure to the client?
Comment 4•13 years ago
|
||
(In reply to Richard Soderberg [:atoll] from comment #3)
> Should a PUT/POST of an object with a ttl older than "min_ttl" return
> success or failure to the client?
Client-supplied TTLs are deltas, not absolute times. They're "delete this WBO at server mtime + ttl".
| Assignee | ||
Comment 5•13 years ago
|
||
> "userid" and "collection_id" should both have either 0, or 1, underscore.
Yes, I intended underscores throughout but apparently have fat fingers this week.
> Suggest "deletion_time" could be called "min_ttl".
>
> Should a PUT/POST of an object with a ttl older than "min_ttl" return success or failure
> to the client?
I don't think that ttls are the right mechanism for this, since it deals with explicit deletion rather than passive expiry. It's the modification time that needs to be filtered on, like "any BSO with a last-modified time earlier than X should be considered deleted". Or do I misunderstand something about the requirements here?
I didn't realize we had a last modified column. If we do, that'd be far more appropriate.
| Reporter | ||
Updated•13 years ago
|
Whiteboard: [qa-] → [qa-][sync-scale]
| Assignee | ||
Comment 7•13 years ago
|
||
CC'ing Bob.
It's been a while since we considered this, so to recap: when clients do a "reset sync" to delete all their stored data, it kicks off a very long and expensive "DELETE FROM bso WHERE userid=:userid" query that actually goes through and deletes all the data. This is bad, because it takes so long that the client request can time out, and because it eats precious resources on the DB.
This bug considers an alternative approach where we don't actually delete the data, we just put some sort of marker in the DB to indicate that it was deleted. Ops can then run a background script to clean up old data at leisure. This makes the delete itself much faster, but potentially slows down reads due to a "has this data been deleted" check.
IIRC, we applied a band-aid to fix the symptoms of this problem by setting the request timeout to 5 minutes. All but the most pathological deletion operations are able to complete within this time, which in turn has decreased server load by preventing clients from retrying the operation over and over again.
So, question for ops: Is this still causing you problems in Sync1.1? Do we want to push ahead with deferred deletes for Sync2.0?
| Assignee | ||
Comment 8•13 years ago
|
||
(I ask because the existing plan for implementing this feature, whereby we leave BSO records in the database but filter them against the last-deleted-time of the collection, will be tricky to combine with the partial-bso-updates feature implemented in Bug 784599)
Updated•11 years ago
|
Priority: -- → P2
| Assignee | ||
Comment 9•11 years ago
|
||
Um, yeah, this is still a problem in the new system; some stats that bobm showed me today:
MariaDB [INFORMATION_SCHEMA]> select TIME, USER, STATE, INFO FROM PROCESSLIST WHERE TIME > '600';
+--------+--------+-----------+---------------------------------------------------------------------------------------------------+
| TIME | USER | STATE | INFO |
+--------+--------+-----------+---------------------------------------------------------------------------------------------------+
| 1570 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=1798812 AND collection=7 |
| 10553 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=2152243 AND collection=4 |
| 11679 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=1906975 AND collection=7 |
| 11838 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=1902467 AND collection=7 |
| 12975 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=2544616 AND collection=7 |
| 32462 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=1835080 AND collection=4 |
| 32592 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=1798230 AND collection=4 |
| 31346 | syncrw | updating | /* [queryName=DELETE_COLLECTION_ITEMS] */ DELETE FROM bso WHERE userid=1801140 AND collection=7 |
| 306117 | syncrw | query end | /* [queryName=PURGE_SOME_EXPIRED_ITEMS] */ DELETE FROM bso WHERE ttl < (UNIX_TIMESTAMP() - 86400) |
+--------+--------+-----------+---------------------------------------------------------------------------------------------------+
TokuDB may be the best solution to this problem, as it supposedly has orders-of-magnitude better performance on deletes. But perhaps we need to help it along a little with a code fix as well.
| Assignee | ||
Comment 10•11 years ago
|
||
> the existing plan for implementing this feature, whereby we leave BSO records in the database
> but filter them against the last-deleted-time of the collection, will be tricky to combine with
> the partial-bso-updates feature implemented in Bug 784599
Note to self: I think we can make this work, but it will be even more ugly than that code already is. The trick would be to use an IF() on the columns that are not being updated, to leave them unchanged if the row is current but to reset them to default values if the row is marked as deleted.
Like this:
INSERT INTO bso (id, payload) VALUES ("item", "data")
ON DUPLICATE KEY UPDATE
payload = VALUES(payload),
sortindex = IF(modified <= last_deleted, 0, sortindex),
ttl = IF(modified <= last_deleted, DEFAULT_TTL, ttl),
...etc...
I haven't tried it out so mysql may have some obscure problem with it, but it seems plausible and worth trying in order to preserve the nice partial-update syntax that we currently have. But oh man, is it ever doing to ugly up the code...
| Reporter | ||
Comment 11•11 years ago
|
||
I think I'd like to try Toku before we go to SQL queries that look like that.
| Assignee | ||
Comment 12•11 years ago
|
||
Worth noting: the testing in Bug 1038484 showed TokuDB greatly reducing the avg and max query time for these deletion queries, which is great news. This bug can go back into hibernation for a while, as we watch performance of TokuDB in prod...
Updated•10 years ago
|
Priority: P2 → P4
| Assignee | ||
Comment 13•7 years ago
|
||
I'm going to close this ancient bug, since I don't think it's causing us any particular headaches in production these days. :bobm, feel free to re-open if this still seems valuable.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Updated•2 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•