get_collection_timestamps(user_id) KeyError when a data row's collection name is missing

RESOLVED FIXED

Status

Cloud Services
Server: Sync
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: atoll, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qa?])

Attachments

(2 attachments)

(Reporter)

Description

6 years ago
2012-04-28 17:54:30,435 ERROR [syncserver] Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/services/util.py", line 495, in __call__
    return self.app(environ, start_response)
  File "/usr/lib/python2.6/site-packages/paste/translogger.py", line 68, in __call__
    return self.application(environ, replacement_start_response)
  File "/usr/lib/python2.6/site-packages/webob/dec.py", line 147, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.6/site-packages/webob/dec.py", line 208, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/services/baseapp.py", line 191, in __notified
    response = func(self, request)
  File "/usr/lib/python2.6/site-packages/services/baseapp.py", line 270, in __call__
    result = function(request, **params)
  File "/usr/lib/python2.6/site-packages/syncstorage/controller.py", line 106, in get_collections
    collections = storage.get_collection_timestamps(user_id)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/memcachedsql.py", line 359, in get_collection_timestamps
    self).get_collection_timestamps(user_id)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 393, in get_collection_timestamps
    bigint2time(stamp)) for coll_id, stamp in res])
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 419, in _collid2name
    return collections[collection_id]
KeyError: 11

A data row in the wbo table references a collection that's missing from the collections table. get_collection_timestamps(user_id) logs a KeyError.

I think bug 622864 is an older/fixed instance of this.
The cause of this corrupted data is discussed in Bug 749924, along a patch that will hopefully prevent it from arising in the future.

A separate issue is: what should the server code do if it encounters such data?  One option is to ignore the orphaned rows so that get_collection_timestamps can return successfully, but I think blowing up with a KeyError might be a safer solution overall - at least it alerts us to the presence of corrupted data in the db.
Whiteboard: [qa?]
I think blowing up is a bad idea - it doesn't alert us, it just locks out the user, who otherwise has a perfectly functional sync. Either they won't notice it, or they will and their sync is broken with minimal recourse.

What it should do is to write a detailed description into the error log with the username, the collection id and any other details. That'll give us all the information that we - who do have some power to fix it - need.
(Reporter)

Comment 3

6 years ago
Crashing, albeit with user/collection, guarantees that we investigate and protects clients from adverse scenarios.

If we stop blowing up, then the user risks creating a *new* custom collection that gets assigned id 11, and then suddenly all of the broken data becomes visible to the client. If we feed the clients engine a bunch of adblock entries, does it crash gently or crash hard? 

I would rather we returned "you must first sync from scratch" to the client than either a 503 or simply ignoring the damaged rows.
That last feature is on the request list and will hopefully eventually happen. It's not an option at present, though.

With the first guarantee above, I'm comfortable crashing.
(In reply to Toby Elliott [:telliott] from comment #2)
> I think blowing up is a bad idea - it doesn't alert us, it just locks out
> the user, who otherwise has a perfectly functional sync.

As atoll points out, the issue is that the *don't* have a perfectly functional sync.  They have a sync that's in imminent danger of being filled with corrupt data.

It's also worth noting that the crash here is due to memcache being empty, so it's trying to rebuild info/collections from the database.  If memcached were full, I believe the client would happily continue syncing against the corrupted database and potentially hit the broken-data-appears-in-new-custom collection case described above.

> What it should do is to write a detailed description into the error log with
> the username, the collection id and any other details. That'll give us all
> the information that we - who do have some power to fix it - need.

So we should catch this case and crash with a detailed log rather than just a KeyError.  What's the appropriate response code to send back to the user in this case?  A 500 seems like the best option to me at this time, since this literally is "the server encountered an unexpected error".
(Reporter)

Comment 6

6 years ago
(In reply to Ryan Kelly [:rfkelly] from comment #5)
> (In reply to Toby Elliott [:telliott] from comment #2)
> > I think blowing up is a bad idea - it doesn't alert us, it just locks out
> > the user, who otherwise has a perfectly functional sync.
> 
> As atoll points out, the issue is that the *don't* have a perfectly
> functional sync.  They have a sync that's in imminent danger of being filled
> with corrupt data.

At least until yesterday, when we removed the damaged rows and they all started syncing without errors (global rate 0/sec, as of then).
(Reporter)

Comment 7

6 years ago
(In reply to Ryan Kelly [:rfkelly] from comment #5)
> So we should catch this case and crash with a detailed log rather than just
> a KeyError.  What's the appropriate response code to send back to the user
> in this case?  A 500 seems like the best option to me at this time, since
> this literally is "the server encountered an unexpected error".

500 is fine.
Created attachment 654111 [details] [diff] [review]
sync1.1 patch to log more details when collection id has no matching name
Attachment #654111 - Flags: review?(telliott)
Created attachment 654112 [details] [diff] [review]
sync2.0 patch to log more details when collection id has no matching name

Finally getting around to this.  I have added code to catch this KeyError and re-raise it with details about exactly what data is corrupted.

This will cause the details to show up in the error log, while re-raising the error will ensure that the usual dump-a-log-then-send-a-500-response logic will kick in.
Attachment #654112 - Flags: review?(telliott)
Attachment #654111 - Flags: review?(telliott) → review+
Attachment #654112 - Flags: review?(telliott) → review+
https://github.com/mozilla-services/server-syncstorage/commit/cffbdb5fbe174144a8329ca775168f23ed3bbaa4

http://hg.mozilla.org/services/server-storage/rev/9c91ee825035
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Needs to be Verified for Sync 1.1 and 2.0.
You need to log in before you can comment on or make changes to this bug.