Closed Bug 749924 Opened 12 years ago Closed 12 years ago

some users may have damaged data that references nonexistent collections

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Atoll, Assigned: rfkelly)

Details

(Whiteboard: [qa+])

Attachments

(1 file)

File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 419, in _collid2name
    return collections[collection_id]
KeyError: 11

trace these and check the user data to confirm suspicion
select t1.username, t2.collectionid from (select username, collection from wbo$i group by username, collection order by username, collection) as t1 left join (select userid, collectionid from collections group by userid, collectionid order by userid, collectionid) as t2 on t1.username=t2.userid and t1.collection=t2.collectionid where t1.collection > 10 having t2.collectionid is null;

|  1 | PRIMARY     | <derived2>  | ALL   | NULL          | NULL    | NULL    | NULL |   357 | Using where              |
|  1 | PRIMARY     | <derived3>  | ALL   | NULL          | NULL    | NULL    | NULL |  9883 |                          |
|  3 | DERIVED     | collections | range | NULL          | PRIMARY | 6       | NULL | 10141 | Using index for group-by |
|  2 | DERIVED     | wbo42       | range | NULL          | PRIMARY | 6       | NULL |    12 | Using index for group-by |
identified 3 users on sync41.db.phx1, weave0, wbo58, wbo74, wbo96 that have broken data on the server and will need to be migrated.  running sitewide to collect list of userids to repair.
#!/bin/bash

xapply -P5 "./749924-tables.sh %1 2>&1" weave{0..9}

#!/bin/bash

for i in `seq 0 99`; do
	RESULT=$( sudo mysql -NBe "select t1.username, t1.collection, t2.collectionid from (select username, collection from wbo$i group by username, collection order by username, collection) as t1 left join (select userid, collectionid from collections group by userid, collectionid order by userid, collectionid) as t2 on t1.username=t2.userid and t1.collection=t2.collectionid where t1.collection > 10 having t2.collectionid is null;" $1 )
	if [ -n "$RESULT" ]; then
		echo "$RESULT" | perl -pe '$_ = "'$1.wbo$i'\t$_";'
	fi
done
1336 users affected, will find a way to migrate them.
*blink*
OS: Mac OS X → All
Hardware: x86 → All
Yeah, that's kind of a lot. And, of course, we can't use the payload for clues as to what sort of data it is. I'm actually surprised that we have that many people using non-standard collections, let alone a blank one.

We don't actually verify that the collection name was successfully entered before returning the id. However, if that insert fails, then it throws an error and the whole thing should abort. Deleting the collection wouldn't delete the value from that table until after the data had been deleted, and, again, if the first one times out, then the whole thing should abort. I don't see any way to explain this yet.
(In reply to Toby Elliott [:telliott] from comment #6)
> Yeah, that's kind of a lot. And, of course, we can't use the payload for
> clues as to what sort of data it is. I'm actually surprised that we have
> that many people using non-standard collections, let alone a blank one.

Even given the client/clients and key/keys mis-naming?

Collection id 11 is the first available custom collection id, which I would expect to be assigned to either "clients" or "keys" for pretty much all users.

How recent are the modified timestamps on these rows?  If the rows in the wbo$i table are quite old then maybe they were introduced by a bug that has long-since been fixed, and which was not observable until we cleared memcache and forced get_collection_timestamps() to hit the database.
True, this could be the client/clients bug manifesting itself.

The keys should be untranslated in memcache. If we haven't migrated all the users yet, we might consider poking at those to see if any of them have interesting keys in them.
We have a theory about what has caused this corruption.  Here's the code that runs then the client does a DELETE /storage:

       for query in ('DELETE_USER_COLLECTIONS', 'DELETE_USER_WBOS'):
           query = self._get_query(query, user_id)
           safe_execute(self._engine, query, user_id=user_id)

In other words, it first deletes the mapping of custom collection names to ids, then deletes all the wbo records.  If that second query times out (not unlikely, some users have a *lot* of rows) then the database will be left in the inconsistent state we see currently.

Swapping the order of the queries should fix the problem well enough for now.  The second query might fail and leave the user with some empty custom collections, but at least that's not violating any assumed invariants of the code.  Patch coming.
Assignee: rsoderberg → rkelly
Attachment #619479 - Flags: review?(telliott)
At an early glance, the data timestamps for this bug loosely correlate with server issues. The most recent occurrence is 3 days ago, so once we patch existing users, this patch should prevent new occurrences.
> patch existing users

Specifically, delete user data rows associated with a collection id that has no collection label row, so their clients can recover.
Comment on attachment 619479 [details] [diff] [review]
patch to delete data in consistency-preserving order

Yep, this got reversed in the transition.
Attachment #619479 - Flags: review?(telliott) → review+
Whiteboard: [qa+]
(In reply to Richard Soderberg [:atoll] from comment #11)
> At an early glance, the data timestamps for this bug loosely correlate with
> server issues. The most recent occurrence is 3 days ago, so once we patch
> existing users, this patch should prevent new occurrences.

Removed data rows for existing affected users in PHX1. 500 error rate for PHX1 flatlined at 0/sec as a result. Scanning and repairing SCL2 next.
Removed data row for 1 affected user in SCL2. Repairs complete, pending bug fix ship.
The bugfix has been shipped, and it sounds like corrupted data has been fixed.  Is there any further action to be taken on this bug?
Don't see anything further to do here, closing the bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
OK. So QA can verify somehow when this goes to Production.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: