Closed Bug 749924 Opened 12 years ago Closed 12 years ago

some users may have damaged data that references nonexistent collections

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: Atoll, Assigned: rfkelly)

Details

(Whiteboard: [qa+])

Attachments

(1 file)

patch to delete data in consistency-preserving order 12 years ago Ryan Kelly [:rfkelly] 1.01 KB, patch	telliott : review+	Details \| Diff \| Splinter Review

:Atoll

Reporter

Description

•

12 years ago

File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 419, in _collid2name
    return collections[collection_id]
KeyError: 11

trace these and check the user data to confirm suspicion

:Atoll

Reporter

Comment 1

•

12 years ago

select t1.username, t2.collectionid from (select username, collection from wbo$i group by username, collection order by username, collection) as t1 left join (select userid, collectionid from collections group by userid, collectionid order by userid, collectionid) as t2 on t1.username=t2.userid and t1.collection=t2.collectionid where t1.collection > 10 having t2.collectionid is null;

|  1 | PRIMARY     | <derived2>  | ALL   | NULL          | NULL    | NULL    | NULL |   357 | Using where              |
|  1 | PRIMARY     | <derived3>  | ALL   | NULL          | NULL    | NULL    | NULL |  9883 |                          |
|  3 | DERIVED     | collections | range | NULL          | PRIMARY | 6       | NULL | 10141 | Using index for group-by |
|  2 | DERIVED     | wbo42       | range | NULL          | PRIMARY | 6       | NULL |    12 | Using index for group-by |

:Atoll

Reporter

Comment 2

•

12 years ago

identified 3 users on sync41.db.phx1, weave0, wbo58, wbo74, wbo96 that have broken data on the server and will need to be migrated.  running sitewide to collect list of userids to repair.

:Atoll

Reporter

Comment 3

•

12 years ago

#!/bin/bash

xapply -P5 "./749924-tables.sh %1 2>&1" weave{0..9}

#!/bin/bash

for i in `seq 0 99`; do
	RESULT=$( sudo mysql -NBe "select t1.username, t1.collection, t2.collectionid from (select username, collection from wbo$i group by username, collection order by username, collection) as t1 left join (select userid, collectionid from collections group by userid, collectionid order by userid, collectionid) as t2 on t1.username=t2.userid and t1.collection=t2.collectionid where t1.collection > 10 having t2.collectionid is null;" $1 )
	if [ -n "$RESULT" ]; then
		echo "$RESULT" | perl -pe '$_ = "'$1.wbo$i'\t$_";'
	fi
done

:Atoll

Reporter

Comment 4

•

12 years ago

1336 users affected, will find a way to migrate them.

Richard Newman [:rnewman]

Comment 5

•

12 years ago

*blink*

OS: Mac OS X → All

Hardware: x86 → All

Toby Elliott [:telliott]

Comment 6

•

12 years ago

Yeah, that's kind of a lot. And, of course, we can't use the payload for clues as to what sort of data it is. I'm actually surprised that we have that many people using non-standard collections, let alone a blank one.

We don't actually verify that the collection name was successfully entered before returning the id. However, if that insert fails, then it throws an error and the whole thing should abort. Deleting the collection wouldn't delete the value from that table until after the data had been deleted, and, again, if the first one times out, then the whole thing should abort. I don't see any way to explain this yet.

Ryan Kelly [:rfkelly]

Assignee

Comment 7

•

12 years ago

(In reply to Toby Elliott [:telliott] from comment #6)
> Yeah, that's kind of a lot. And, of course, we can't use the payload for
> clues as to what sort of data it is. I'm actually surprised that we have
> that many people using non-standard collections, let alone a blank one.

Even given the client/clients and key/keys mis-naming?

Collection id 11 is the first available custom collection id, which I would expect to be assigned to either "clients" or "keys" for pretty much all users.

How recent are the modified timestamps on these rows?  If the rows in the wbo$i table are quite old then maybe they were introduced by a bug that has long-since been fixed, and which was not observable until we cleared memcache and forced get_collection_timestamps() to hit the database.

Toby Elliott [:telliott]

Comment 8

•

12 years ago

True, this could be the client/clients bug manifesting itself.

The keys should be untranslated in memcache. If we haven't migrated all the users yet, we might consider poking at those to see if any of them have interesting keys in them.

Ryan Kelly [:rfkelly]

Assignee

Comment 9

•

12 years ago

We have a theory about what has caused this corruption.  Here's the code that runs then the client does a DELETE /storage:

       for query in ('DELETE_USER_COLLECTIONS', 'DELETE_USER_WBOS'):
           query = self._get_query(query, user_id)
           safe_execute(self._engine, query, user_id=user_id)

In other words, it first deletes the mapping of custom collection names to ids, then deletes all the wbo records.  If that second query times out (not unlikely, some users have a *lot* of rows) then the database will be left in the inconsistent state we see currently.

Swapping the order of the queries should fix the problem well enough for now.  The second query might fail and leave the user with some empty custom collections, but at least that's not violating any assumed invariants of the code.  Patch coming.

Ryan Kelly [:rfkelly]

Assignee

Comment 10

•

12 years ago

Attached patch patch to delete data in consistency-preserving order — Details — Splinter Review

Assignee: rsoderberg → rkelly

Attachment #619479 - Flags: review?(telliott)

:Atoll

Reporter

Comment 11

•

12 years ago

At an early glance, the data timestamps for this bug loosely correlate with server issues. The most recent occurrence is 3 days ago, so once we patch existing users, this patch should prevent new occurrences.

:Atoll

Reporter

Comment 12

•

12 years ago

> patch existing users

Specifically, delete user data rows associated with a collection id that has no collection label row, so their clients can recover.

Toby Elliott [:telliott]

Comment 13

•

12 years ago

Comment on attachment 619479 [details] [diff] [review]
patch to delete data in consistency-preserving order

Yep, this got reversed in the transition.

Attachment #619479 - Flags: review?(telliott) → review+

Ryan Kelly [:rfkelly]

Assignee

Comment 14

•

12 years ago

Committed to trunk: http://hg.mozilla.org/services/server-storage/rev/129577996b3f
Backported to 1.10 release branch: http://hg.mozilla.org/services/server-storage/rev/04a175709e3d

James Bonacci [:jbonacci]

Updated

•

12 years ago

Whiteboard: [qa+]

:Atoll

Reporter

Comment 15

•

12 years ago

(In reply to Richard Soderberg [:atoll] from comment #11)
> At an early glance, the data timestamps for this bug loosely correlate with
> server issues. The most recent occurrence is 3 days ago, so once we patch
> existing users, this patch should prevent new occurrences.

Removed data rows for existing affected users in PHX1. 500 error rate for PHX1 flatlined at 0/sec as a result. Scanning and repairing SCL2 next.

:Atoll

Reporter

Comment 16

•

12 years ago

Removed data row for 1 affected user in SCL2. Repairs complete, pending bug fix ship.

Ryan Kelly [:rfkelly]

Assignee

Comment 17

•

12 years ago

The bugfix has been shipped, and it sounds like corrupted data has been fixed.  Is there any further action to be taken on this bug?

Ryan Kelly [:rfkelly]

Assignee

Comment 18

•

12 years ago

Don't see anything further to do here, closing the bug.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 19

•

12 years ago

OK. So QA can verify somehow when this goes to Production.

You need to log in before you can comment on or make changes to this bug.