Closed Bug 691315 Opened 13 years ago Closed 12 years ago

IntegrityError: Duplicate entry ___ for key PRIMARY during INSERT INTO wbo

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: Atoll, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+][python regression])

Attachments

(6 files, 2 obsolete files)

I think this is a missing "on duplicate key" exception handler in the sync server.

Found in the logs, no further information available.

2011-10-03 06:10:12,312 ERROR [syncserver] acbaf5e4d13e65b77f9a24c9404cca81
2011-10-03 06:10:12,312 ERROR [syncserver] Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/services/util.py", line 495, in __call__
    return self.app(environ, start_response)
  File "/usr/lib/python2.6/site-packages/paste/translogger.py", line 68, in __call__
    return self.application(environ, replacement_start_response)
  File "/usr/lib/python2.6/site-packages/webob/dec.py", line 147, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.6/site-packages/webob/dec.py", line 208, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/services/baseapp.py", line 191, in __notified
    response = func(self, request)
  File "/usr/lib/python2.6/site-packages/services/baseapp.py", line 270, in __call__
    result = function(request, **params)
  File "/usr/lib/python2.6/site-packages/syncstorage/controller.py", line 289, in set_item
    res = storage.set_item(user_id, collection_name, item_id, **wbo)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/memcachedsql.py", line 232, in set_item
    storage_time=storage_time, **values)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 609, in set_item
    return self._set_item(user_id, collection_name, item_id, **values)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 593, in _set_item
    safe_execute(self._engine, query)
  File "/usr/lib/python2.6/site-packages/services/util.py", line 622, in safe_execute
    return engine.execute(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1788, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1191, in execute
    params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
    return self.__execute_context(context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
    context.parameters[0], context=context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
    context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute
    context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python2.6/site-packages/pymysql/cursors.py", line 108, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 185, in defaulterrorhandler
    raise errorclass, errorvalue
IntegrityError: (IntegrityError) (1062, u"Duplicate entry '1898290-11-a1267300151a5f92af84f73ea5b5907aef73de4e' for key 'PRIMARY'") 'INSERT INTO wbo90 (id, username, collection, modified, payload, payload_size, ttl) VALUES (%s, %s, %s, %s, %s, %s, %s)' (u'a1267300151a5f92af84f73ea5b5907aef73de4e', '1898290', 11, 131764741227, '{"ciphertext":"REDACTED==","hmac":"REDACTED"}', 251, 2100000000)
Yeah the on duplicate is present only when we POST batches, so the error occurs on a PUT of a single item in a collection.

I am not quite sure to understand yet how we get to this status. Are you able to get the initial call log ?
I think Zeus replicated the request when the sync server took too long to reply the first time. So it shouldn't be a common issue.

We should probably still handle this particular error ('Object already exists') with Something other than 500 Server Error.

Http1.1 spec suggest 409 Conflict would be the appropriate response, but I don't know what Sync API and Firefox expect.
409 would probably confuse the heck out of the frontend. The 'traditional' way we've handled this in reg-server is 400 (response 4).

However, I'm not convinced that a 503 isn't OK here due to the obscurity, and we should consider restoring on-duplicate-key-updates to the PUT query.
(In reply to Toby Elliott [:telliott] from comment #3)
...
> and we should consider restoring on-duplicate-key-updates to the PUT query.

yes, filled bug 691409
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [qa+]
Group: services-infra
Ryan - let's backport this one as a bugfix to 1.1 as well, since we're seeing reports of the problem.
Assignee: nobody → rkelly
Whiteboard: [qa+] → [qa+][python regression]
I'm seeing a number of these resulting in "Sync encountered an error while syncing: Unknown error. Sync will automatically retry this action."

A retry (manual or automatic) does not clear the problem, the server log shows the same IntegrityErrors for multiple tries.

This is with FF 10.0.1 (gentoo) against my own sync server; due to the problem I've just upgraded it to today's default tip - no change.

The problem MAY have started with me entering saved passwords on two different synced browsers while one of them was disconnected from the sync - but this is after-the-fact conjecture.

I can provide further information (logs etc.), just tell me what to do.
(In reply to Joachim Breuer from comment #7)
> I'm seeing a number of these resulting in "Sync encountered an error while
> syncing: Unknown error. Sync will automatically retry this action."

Hi Joachim. You'll need to file a completely new bug with information and logs attached - we won't be able to do anything about your problem report in this bug otherwise. Please file it under Mozilla Services / General. Thanks!
I'm seeing this error too. I suspect its because the other computers in the group have identical URLs in their history. The records probably just have different timestamps. Either way, the system should either 'replace', or preserve the most recent entry. 

Attaching log. I tried to only remove the user data, but if I grabbed something critical let me know.
Attached file Sync log
Identical URLs should replace. The fact they're not is what's the bug here.

What would actually be hugely useful here would be for someone to look at their access logs and see if the requests that are erroring out are coming in from PUTs or POSTs. If it's posts, then this is something new and unknown.
My understanding is that some future version of the Sync storage server will log the request method and URI in backtraces, so that we have some hope of correlating (which we do not otherwise).
This is causing at least some portion of the data not to be synced, see Bug #731515.
@telliot:
It appears to me the errors are happening out of POSTs.

I only see GET and POST against mozsync, never a PUT.
I captured a TCP dump showing the actual HTTP requests, and stored the associated sync error logs. It appears all of the requests are GET, POST and DELETES. I didn't see any PUTS. I tried to upload a representative sample as an attachment after removing the encrypted data fields. I will hold onto the raw files for a few days, so let me know if I should go looking for something in particular...
Blocks: 727761
No longer blocks: 727761
Blocks: 727761
We can't add ON DUPLICATE KEY to this query without creating a data-loss race condition.  For example, if the PUT request included an X-If-Unmodified-Since header then ON DUPLICATE KEY will fail to respect it.

For sync2.0, we can just return a "409 Conflict" to indicate the conflicting writes.  Nice and easy.

Toby, still interested in backporting this to sync1.1?  In that case I think your suggestion of a 400[ERROR_INVALID_WRITE] is the most appropriate behaviour - it won't confuse the client or risk overwriting data on the server.
Blocks: 740695
This patch for sync2.0 returns a "409 Conflict" response for this scenario.

It's a little complex because the storage backends don't speak HTTP, so they can't really raise the HTTPConflict error directly.  Instead, they raise a StorageConflictError and let the controller convert it into an appropriate HTTP response.

Adding this handling here also paves the way for more details locking/conflict-detection as described in Bug 739845, Bug 735085.
Attachment #610810 - Flags: review?(telliott)
No tests needed?
> No tests needed?

Writing a reliable test for this seems ugly, as it's a pretty skinny race condition - you need to get two parallel requests past the  self.item_exists check before either of them has a change to actually create the item. 

I guess I could try to monkeypatch either the database or the storage object and fiddle things so that it triggers...
Comment on attachment 610810 [details] [diff] [review]
patch to return 409-conflict on duplicate BSO insertion

Ah, I was thinking specifically of the problem where a second put fails because of PK violation. This one is fine.
Attachment #610810 - Flags: review?(telliott) → review+
Depends on: 739845
This updated patch includes a Retry-After header in the 409 Conflict response, set to 30 seconds.  This seems more in line with the spirit of the 409 response, whereby we're supposed to include information about how to resolve the conflict.
Attachment #610810 - Attachment is obsolete: true
Attachment #611694 - Flags: review?(telliott)
Comment on attachment 611694 [details] [diff] [review]
updated patch to return 409-conflict on duplicate BSO insertion

>+# How long the client should wait before retrying a conflicting write.
>+RETRY_AFTER = 30

What is the minimum bound of Retry-After values accepted by the client, below which it will round up a server-provided value to its minimum?

>+        except StorageConflictError:
>+            raise HTTPConflict(headers={"Retry-After": str(RETRY_AFTER)})

To avoid synchronizing race conditions across clients, the server must not transmit the same Retry-After value to the clients, and the clients must not interpret the same Retry-After value from the server.

The server must random-fuzz the baseline RETRY_AFTER value returned to clients, for the same reasons that Ethernet collisions result in a random, rather than fixed, backoff.

The client must honor the random-fuzz value, so if the baseline RETRY_AFTER is smaller than the clients will honor from the server, randomizing it is of little use.

This implies that we define in the spec the minimum RETRY_AFTER value that the client must honor, and/or that we define in the spec a way for clients to communicate their minimum Retry-After value moving forward, and we presume a sane default for legacy clients.  Then the base RETRY_AFTER value can be raised from an actual baseline minimum to either the client-provided or the legacy-default value, before random-fuzz.

Random-fuzz being RETRY_AFTER := MAX( MIN_RETRY, CLIENT_RETRY || LEGACY_RETRY ) * (1 + rand(1)).

f- because it's not safe to return a fixed retry-after value without causing a direct risk of both ongoing client synchronization errors and forcing into synchronization clients that perhaps formerly were not, given certain pairings of a static RETRY_AFTER and a higher minimum in the client.
Attachment #611694 - Flags: feedback-
Comment on attachment 611694 [details] [diff] [review]
updated patch to return 409-conflict on duplicate BSO insertion

I don't think atoll's concern is all that worrying here - this isn't meltdown prevention, and the second retry-after conflict would only come into play if 2 clients attempted to modify at the same time while another was holding a lock. However, fuzzing things up with a couple seconds is a small win and probably worth doing.

I am, however, concerned that this patch is trying to do multiple things, possibly not correctly. The core underlying error we have here is the SQLAlchemy IntegrityError, which is thrown when we get a PK violation. That's not something that's going to be fixed by any kind of retry-after - the user should just not be doing a put there. Is the idea that we're going to merge both PK violations and session-lock violations into one 409 response? If so, it seems like parts of this belong in another patch.

Certainly, as this patch currently stands, wrapping the delete statement with a check for StorageConflictError is meaningless - you can't have a PK violation on a delete. Is the idea that you're futureproofing for upcoming additional StorageConflictErrors?
Attachment #611694 - Flags: review?(telliott) → review-
(In reply to Toby Elliott [:telliott] from comment #26)
> Comment on attachment 611694 [details] [diff] [review]
> updated patch to return 409-conflict on duplicate BSO insertion
> 
> I don't think atoll's concern is all that worrying here - this isn't
> meltdown prevention, and the second retry-after conflict would only come
> into play if 2 clients attempted to modify at the same time while another
> was holding a lock. However, fuzzing things up with a couple seconds is a
> small win and probably worth doing.

Yes.  I think it's a good idea to add a tween for this, that checks for a Retry-After header in all responses and applies the fuzzing in a single location.
 
> I am, however, concerned that this patch is trying to do multiple things,
> possibly not correctly. The core underlying error we have here is the
> SQLAlchemy IntegrityError, which is thrown when we get a PK violation.
> That's not something that's going to be fixed by any kind of retry-after -
> the user should just not be doing a put there.

I think it can be resolved by retrying.  The code handling a PUT statement basically goes like this:

    modified = self.get_item(id)

    if modified is None:
        query = "INSERT INTO bso VALUES blah blah"
    else:
        query = "UPDATE bso SET blah blah"

    self._safe_execute(query)

You'll only get an IntegrityError here if two clients are simultaneously going down the "INSERT" branch due to a race.  One INSERT succeeds, the other gets an integrity error.

If that failing client retrys the request, it should go down the "UPDATE" branch the second time around because the item now exists.

Now, I could do this retrying myself for this case - instead of catching the StorageConflictError and returning a 409, I could catch it and re-issue the call to the database.  But it seems safer to inform the client of the conflicting writes in case they want to update their local data to match.

> Is the idea that we're going
> to merge both PK violations and session-lock violations into one 409
> response? If so, it seems like parts of this belong in another patch.
> 
> Certainly, as this patch currently stands, wrapping the delete statement
> with a check for StorageConflictError is meaningless - you can't have a PK
> violation on a delete. Is the idea that you're futureproofing for upcoming
> additional StorageConflictErrors?

Yes, this was me pre-empting the upcoming locking changes.  Happy to defer this to a separate patch.
Ah, OK, what you're trying to do here is definitely made confusing by the other pieces of the patch. This Bug is actually handled by the updated PUT logic that you have above, and the parts of the patch that you're adding here to add a 409 is more than sufficient to close out the bug. Let's deal with session locking, etc somewhere else.
(In reply to Toby Elliott [:telliott] from comment #28)
> Ah, OK, what you're trying to do here is definitely made confusing by the
> other pieces of the patch. This Bug is actually handled by the updated PUT
> logic that you have above,

Just to be clear, this patch doesn't change the PUT-handling logic at all, what I described above is what is already in the code.  All this patch does is wrap it with the retry-after logic.

> and the parts of the patch that you're adding
> here to add a 409 is more than sufficient to close out the bug.
> Let's deal with session locking, etc somewhere else.

Agree, will trim it down to the bare necessities.
(In reply to Ryan Kelly [:rfkelly] from comment #29)
> (In reply to Toby Elliott [:telliott] from comment #28)
> > Ah, OK, what you're trying to do here is definitely made confusing by the
> > other pieces of the patch. This Bug is actually handled by the updated PUT
> > logic that you have above,
> 
> Just to be clear, this patch doesn't change the PUT-handling logic at all,
> what I described above is what is already in the code.  All this patch does
> is wrap it with the retry-after logic.
> 

Yes, the original PUT patch would have been sufficient to resolve this bug :)
`git blame` tells me the PUT code has been like that since at least 2010, well before this bug was created.  It has just always had the race condition between check and insert that this bug seems to have snuck through.

Anyway, enough chat, updated patch coming after the easter weekend :-)
Hmm, now I'm disturbed. Maybe this bug wasn't what we thought it was. Anyway, our best guess as to why has been invalidated, and this takes care of our second best guess, so that'll do it. Have a good weekend.
And, I just thought of another possibility - ttls.  The PUT logic uses self.item_exists() to check whether the item already exists.  The query underlying item_exists reports False for rows that are in the database but have an expired ttl.  So, using a PUT to overwrite an item with expired ttl will fail in exactly the way described in this bug - it will try to use an INSERT statement because the item doesn't appear to exist, but will fail with an IntegrityError because the (expired) row is still in the database.

Should be a simple fix, which I'll roll up with the conflict-reporting patch for good measure so it covers all the cases.
(In reply to Ryan Kelly [:rfkelly] from comment #33)
> So, using a PUT to overwrite an item with expired ttl
> will fail in exactly the way described in this bug - it will try to use an
> INSERT statement because the item doesn't appear to exist, but will fail
> with an IntegrityError because the (expired) row is still in the database.

Does this affect 1.1?

I don't recall Marina seeing this in testing, but if it's a problem then Android Sync users who turn their phone off for three weeks will be getting this in the wild. They PUT their client record.
It appears to me much more corner-case than originally believed. We should check, but it appears the code has been doing the right thing.
This patch fixes the issue with ttls, meaning expired items no longer block PUTs.
Attachment #613505 - Flags: review?(telliott)
This patch updates the 409-conflict handling, removing places where it's currently useless and adding the fuzzing as suggested by atoll.
Attachment #611694 - Attachment is obsolete: true
Attachment #613506 - Flags: review?(telliott)
Attachment #613505 - Flags: review?(telliott) → review+
Comment on attachment 613506 [details] [diff] [review]
updated patch to return 409-conflict on duplicate BSO insertion

Good enough for now. We should consider eventually expressing the fuzz as a percentage, rather than an absolute. If we're Retrying-After in, say, half-hour chunks, we probably want larger gaps than 5 seconds.
Attachment #613506 - Flags: review?(telliott) → review+
FYI: latest pull, make build, and make test, was successful.
Closing the bug.  Still need to backport to a hotfix release of sync1.1, but that's being tracked separately.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: