Closed
Bug 749924
Opened 12 years ago
Closed 12 years ago
some users may have damaged data that references nonexistent collections
Categories
(Cloud Services :: Operations: Miscellaneous, task)
Cloud Services
Operations: Miscellaneous
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Atoll, Assigned: rfkelly)
Details
(Whiteboard: [qa+])
Attachments
(1 file)
1.01 KB,
patch
|
telliott
:
review+
|
Details | Diff | Splinter Review |
File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 419, in _collid2name return collections[collection_id] KeyError: 11 trace these and check the user data to confirm suspicion
select t1.username, t2.collectionid from (select username, collection from wbo$i group by username, collection order by username, collection) as t1 left join (select userid, collectionid from collections group by userid, collectionid order by userid, collectionid) as t2 on t1.username=t2.userid and t1.collection=t2.collectionid where t1.collection > 10 having t2.collectionid is null; | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 357 | Using where | | 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 9883 | | | 3 | DERIVED | collections | range | NULL | PRIMARY | 6 | NULL | 10141 | Using index for group-by | | 2 | DERIVED | wbo42 | range | NULL | PRIMARY | 6 | NULL | 12 | Using index for group-by |
identified 3 users on sync41.db.phx1, weave0, wbo58, wbo74, wbo96 that have broken data on the server and will need to be migrated. running sitewide to collect list of userids to repair.
#!/bin/bash xapply -P5 "./749924-tables.sh %1 2>&1" weave{0..9} #!/bin/bash for i in `seq 0 99`; do RESULT=$( sudo mysql -NBe "select t1.username, t1.collection, t2.collectionid from (select username, collection from wbo$i group by username, collection order by username, collection) as t1 left join (select userid, collectionid from collections group by userid, collectionid order by userid, collectionid) as t2 on t1.username=t2.userid and t1.collection=t2.collectionid where t1.collection > 10 having t2.collectionid is null;" $1 ) if [ -n "$RESULT" ]; then echo "$RESULT" | perl -pe '$_ = "'$1.wbo$i'\t$_";' fi done
Comment 6•12 years ago
|
||
Yeah, that's kind of a lot. And, of course, we can't use the payload for clues as to what sort of data it is. I'm actually surprised that we have that many people using non-standard collections, let alone a blank one. We don't actually verify that the collection name was successfully entered before returning the id. However, if that insert fails, then it throws an error and the whole thing should abort. Deleting the collection wouldn't delete the value from that table until after the data had been deleted, and, again, if the first one times out, then the whole thing should abort. I don't see any way to explain this yet.
Assignee | ||
Comment 7•12 years ago
|
||
(In reply to Toby Elliott [:telliott] from comment #6) > Yeah, that's kind of a lot. And, of course, we can't use the payload for > clues as to what sort of data it is. I'm actually surprised that we have > that many people using non-standard collections, let alone a blank one. Even given the client/clients and key/keys mis-naming? Collection id 11 is the first available custom collection id, which I would expect to be assigned to either "clients" or "keys" for pretty much all users. How recent are the modified timestamps on these rows? If the rows in the wbo$i table are quite old then maybe they were introduced by a bug that has long-since been fixed, and which was not observable until we cleared memcache and forced get_collection_timestamps() to hit the database.
Comment 8•12 years ago
|
||
True, this could be the client/clients bug manifesting itself. The keys should be untranslated in memcache. If we haven't migrated all the users yet, we might consider poking at those to see if any of them have interesting keys in them.
Assignee | ||
Comment 9•12 years ago
|
||
We have a theory about what has caused this corruption. Here's the code that runs then the client does a DELETE /storage: for query in ('DELETE_USER_COLLECTIONS', 'DELETE_USER_WBOS'): query = self._get_query(query, user_id) safe_execute(self._engine, query, user_id=user_id) In other words, it first deletes the mapping of custom collection names to ids, then deletes all the wbo records. If that second query times out (not unlikely, some users have a *lot* of rows) then the database will be left in the inconsistent state we see currently. Swapping the order of the queries should fix the problem well enough for now. The second query might fail and leave the user with some empty custom collections, but at least that's not violating any assumed invariants of the code. Patch coming.
Assignee | ||
Comment 10•12 years ago
|
||
Assignee: rsoderberg → rkelly
Attachment #619479 -
Flags: review?(telliott)
Reporter | ||
Comment 11•12 years ago
|
||
At an early glance, the data timestamps for this bug loosely correlate with server issues. The most recent occurrence is 3 days ago, so once we patch existing users, this patch should prevent new occurrences.
Reporter | ||
Comment 12•12 years ago
|
||
> patch existing users
Specifically, delete user data rows associated with a collection id that has no collection label row, so their clients can recover.
Comment 13•12 years ago
|
||
Comment on attachment 619479 [details] [diff] [review] patch to delete data in consistency-preserving order Yep, this got reversed in the transition.
Attachment #619479 -
Flags: review?(telliott) → review+
Assignee | ||
Comment 14•12 years ago
|
||
Committed to trunk: http://hg.mozilla.org/services/server-storage/rev/129577996b3f Backported to 1.10 release branch: http://hg.mozilla.org/services/server-storage/rev/04a175709e3d
Updated•12 years ago
|
Whiteboard: [qa+]
Reporter | ||
Comment 15•12 years ago
|
||
(In reply to Richard Soderberg [:atoll] from comment #11) > At an early glance, the data timestamps for this bug loosely correlate with > server issues. The most recent occurrence is 3 days ago, so once we patch > existing users, this patch should prevent new occurrences. Removed data rows for existing affected users in PHX1. 500 error rate for PHX1 flatlined at 0/sec as a result. Scanning and repairing SCL2 next.
Reporter | ||
Comment 16•12 years ago
|
||
Removed data row for 1 affected user in SCL2. Repairs complete, pending bug fix ship.
Assignee | ||
Comment 17•12 years ago
|
||
The bugfix has been shipped, and it sounds like corrupted data has been fixed. Is there any further action to be taken on this bug?
Assignee | ||
Comment 18•12 years ago
|
||
Don't see anything further to do here, closing the bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 19•12 years ago
|
||
OK. So QA can verify somehow when this goes to Production.
You need to log in
before you can comment on or make changes to this bug.
Description
•