Closed Bug 691315 Opened 13 years ago Closed 12 years ago

IntegrityError: Duplicate entry ___ for key PRIMARY during INSERT INTO wbo

Tracking

(Not tracked)

Status:

VERIFIED FIXED

People

(Reporter: Atoll, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+][python regression])

Attachments

(6 files, 2 obsolete files)

Sync log 12 years ago Ladar Levison 11.44 KB, text/plain		Details
Screenshot showing request methods... 12 years ago Ladar Levison 23.27 KB, image/png		Details
White washed log showing a single sample error from sync attempt (there were several). 12 years ago Ladar Levison 386 bytes, text/plain		Details
White washed tcp dump of http request/response showing headers and failure response. 12 years ago Ladar Levison 875 bytes, text/plain		Details
patch to return 409-conflict on duplicate BSO insertion 12 years ago Ryan Kelly [:rfkelly] 5.40 KB, patch	telliott : review+	Details \| Diff \| Splinter Review
updated patch to return 409-conflict on duplicate BSO insertion 12 years ago Ryan Kelly [:rfkelly] 5.88 KB, patch	telliott : review- Atoll : feedback-	Details \| Diff \| Splinter Review
patch to allow overwriting of items with expired ttls 12 years ago Ryan Kelly [:rfkelly] 3.17 KB, patch	telliott : review+	Details \| Diff \| Splinter Review
updated patch to return 409-conflict on duplicate BSO insertion 12 years ago Ryan Kelly [:rfkelly] 6.29 KB, patch	telliott : review+	Details \| Diff \| Splinter Review

:Atoll

Reporter

Description

•

13 years ago

I think this is a missing "on duplicate key" exception handler in the sync server.

Found in the logs, no further information available.

2011-10-03 06:10:12,312 ERROR [syncserver] acbaf5e4d13e65b77f9a24c9404cca81
2011-10-03 06:10:12,312 ERROR [syncserver] Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/services/util.py", line 495, in __call__
    return self.app(environ, start_response)
  File "/usr/lib/python2.6/site-packages/paste/translogger.py", line 68, in __call__
    return self.application(environ, replacement_start_response)
  File "/usr/lib/python2.6/site-packages/webob/dec.py", line 147, in __call__
    resp = self.call_func(req, *args, **self.kwargs)
  File "/usr/lib/python2.6/site-packages/webob/dec.py", line 208, in call_func
    return self.func(req, *args, **kwargs)
  File "/usr/lib/python2.6/site-packages/services/baseapp.py", line 191, in __notified
    response = func(self, request)
  File "/usr/lib/python2.6/site-packages/services/baseapp.py", line 270, in __call__
    result = function(request, **params)
  File "/usr/lib/python2.6/site-packages/syncstorage/controller.py", line 289, in set_item
    res = storage.set_item(user_id, collection_name, item_id, **wbo)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/memcachedsql.py", line 232, in set_item
    storage_time=storage_time, **values)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 609, in set_item
    return self._set_item(user_id, collection_name, item_id, **values)
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 593, in _set_item
    safe_execute(self._engine, query)
  File "/usr/lib/python2.6/site-packages/services/util.py", line 622, in safe_execute
    return engine.execute(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1788, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1191, in execute
    params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
    return self.__execute_context(context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
    context.parameters[0], context=context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
    context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1394, in _cursor_execute
    context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/default.py", line 299, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python2.6/site-packages/pymysql/cursors.py", line 108, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 185, in defaulterrorhandler
    raise errorclass, errorvalue
IntegrityError: (IntegrityError) (1062, u"Duplicate entry '1898290-11-a1267300151a5f92af84f73ea5b5907aef73de4e' for key 'PRIMARY'") 'INSERT INTO wbo90 (id, username, collection, modified, payload, payload_size, ttl) VALUES (%s, %s, %s, %s, %s, %s, %s)' (u'a1267300151a5f92af84f73ea5b5907aef73de4e', '1898290', 11, 131764741227, '{"ciphertext":"REDACTED==","hmac":"REDACTED"}', 251, 2100000000)

Tarek Ziadé (:tarek)

Comment 1

•

13 years ago

Yeah the on duplicate is present only when we POST batches, so the error occurs on a PUT of a single item in a collection.

I am not quite sure to understand yet how we get to this status. Are you able to get the initial call log ?

:Atoll

Reporter

Comment 2

•

13 years ago

I think Zeus replicated the request when the sync server took too long to reply the first time. So it shouldn't be a common issue.

We should probably still handle this particular error ('Object already exists') with Something other than 500 Server Error.

Http1.1 spec suggest 409 Conflict would be the appropriate response, but I don't know what Sync API and Firefox expect.

Toby Elliott [:telliott]

Comment 3

•

13 years ago

409 would probably confuse the heck out of the frontend. The 'traditional' way we've handled this in reg-server is 400 (response 4).

However, I'm not convinced that a 503 isn't OK here due to the obscurity, and we should consider restoring on-duplicate-key-updates to the PUT query.

Tarek Ziadé (:tarek)

Comment 4

•

13 years ago

(In reply to Toby Elliott [:telliott] from comment #3)
...
> and we should consider restoring on-duplicate-key-updates to the PUT query.

yes, filled bug 691409

Richard Newman [:rnewman]

Updated

•

13 years ago

OS: Mac OS X → All

Hardware: x86 → All

James Bonacci [:jbonacci]

Updated

•

12 years ago

Whiteboard: [qa+]

Toby Elliott [:telliott]

Updated

•

12 years ago

Group: services-infra

Toby Elliott [:telliott]

Comment 6

•

12 years ago

Ryan - let's backport this one as a bugfix to 1.1 as well, since we're seeing reports of the problem.

Assignee: nobody → rkelly

Whiteboard: [qa+] → [qa+][python regression]

Joachim Breuer

Comment 7

•

12 years ago

I'm seeing a number of these resulting in "Sync encountered an error while syncing: Unknown error. Sync will automatically retry this action."

A retry (manual or automatic) does not clear the problem, the server log shows the same IntegrityErrors for multiple tries.

This is with FF 10.0.1 (gentoo) against my own sync server; due to the problem I've just upgraded it to today's default tip - no change.

The problem MAY have started with me entering saved passwords on two different synced browsers while one of them was disconnected from the sync - but this is after-the-fact conjecture.

I can provide further information (logs etc.), just tell me what to do.

:Atoll

Reporter

Comment 8

•

12 years ago

(In reply to Joachim Breuer from comment #7)
> I'm seeing a number of these resulting in "Sync encountered an error while
> syncing: Unknown error. Sync will automatically retry this action."

Hi Joachim. You'll need to file a completely new bug with information and logs attached - we won't be able to do anything about your problem report in this bug otherwise. Please file it under Mozilla Services / General. Thanks!

Ladar Levison

Comment 9

•

12 years ago

I'm seeing this error too. I suspect its because the other computers in the group have identical URLs in their history. The records probably just have different timestamps. Either way, the system should either 'replace', or preserve the most recent entry. 

Attaching log. I tried to only remove the user data, but if I grabbed something critical let me know.

Ladar Levison

Comment 10

•

12 years ago

Attached file Sync log — Details

Toby Elliott [:telliott]

Comment 11

•

12 years ago

Identical URLs should replace. The fact they're not is what's the bug here.

What would actually be hugely useful here would be for someone to look at their access logs and see if the requests that are erroring out are coming in from PUTs or POSTs. If it's posts, then this is something new and unknown.

:Atoll

Reporter

Comment 12

•

12 years ago

My understanding is that some future version of the Sync storage server will log the request method and URI in backtraces, so that we have some hope of correlating (which we do not otherwise).

Joachim Breuer

Comment 13

•

12 years ago

This is causing at least some portion of the data not to be synced, see Bug #731515.

Joachim Breuer

Comment 14

•

12 years ago

@telliot:
It appears to me the errors are happening out of POSTs.

I only see GET and POST against mozsync, never a PUT.

Ladar Levison

Comment 15

•

12 years ago

Attached image Screenshot showing request methods... — Details

Ladar Levison

Comment 16

•

12 years ago

Attached file White washed log showing a single sample error from sync attempt (there were several). — Details

Ladar Levison

Comment 17

•

12 years ago

Attached file White washed tcp dump of http request/response showing headers and failure response. — Details

Ladar Levison

Comment 18

•

12 years ago

I captured a TCP dump showing the actual HTTP requests, and stored the associated sync error logs. It appears all of the requests are GET, POST and DELETES. I didn't see any PUTS. I tried to upload a representative sample as an attachment after removing the encrypted data fields. I will hold onto the raw files for a few days, so let me know if I should go looking for something in particular...

Ryan Kelly [:rfkelly]

Assignee

Updated

•

12 years ago

Blocks: 727761

Ryan Kelly [:rfkelly]

Assignee

Updated

•

12 years ago

No longer blocks: 727761

Ryan Kelly [:rfkelly]

Assignee

Updated

•

12 years ago

Blocks: 727761

Ryan Kelly [:rfkelly]

Assignee

Comment 19

•

12 years ago

We can't add ON DUPLICATE KEY to this query without creating a data-loss race condition.  For example, if the PUT request included an X-If-Unmodified-Since header then ON DUPLICATE KEY will fail to respect it.

For sync2.0, we can just return a "409 Conflict" to indicate the conflicting writes.  Nice and easy.

Toby, still interested in backporting this to sync1.1?  In that case I think your suggestion of a 400[ERROR_INVALID_WRITE] is the most appropriate behaviour - it won't confuse the client or risk overwriting data on the server.

Ryan Kelly [:rfkelly]

Assignee

Updated

•

12 years ago

Blocks: 740695

Ryan Kelly [:rfkelly]

Assignee

Comment 20

•

12 years ago

Attached patch patch to return 409-conflict on duplicate BSO insertion (obsolete) — Details — Splinter Review

This patch for sync2.0 returns a "409 Conflict" response for this scenario.

It's a little complex because the storage backends don't speak HTTP, so they can't really raise the HTTPConflict error directly.  Instead, they raise a StorageConflictError and let the controller convert it into an appropriate HTTP response.

Adding this handling here also paves the way for more details locking/conflict-detection as described in Bug 739845, Bug 735085.

Attachment #610810 - Flags: review?(telliott)

Toby Elliott [:telliott]

Comment 21

•

12 years ago

No tests needed?

Ryan Kelly [:rfkelly]

Assignee

Comment 22

•

12 years ago

> No tests needed?

Writing a reliable test for this seems ugly, as it's a pretty skinny race condition - you need to get two parallel requests past the  self.item_exists check before either of them has a change to actually create the item. 

I guess I could try to monkeypatch either the database or the storage object and fiddle things so that it triggers...

Toby Elliott [:telliott]

Comment 23

•

12 years ago

Comment on attachment 610810 [details] [diff] [review]
patch to return 409-conflict on duplicate BSO insertion

Ah, I was thinking specifically of the problem where a second put fails because of PK violation. This one is fine.

Attachment #610810 - Flags: review?(telliott) → review+

Ryan Kelly [:rfkelly]

Assignee

Updated

•

12 years ago

Depends on: 739845

Ryan Kelly [:rfkelly]

Assignee

Comment 24

•

12 years ago

Attached patch updated patch to return 409-conflict on duplicate BSO insertion (obsolete) — Details — Splinter Review

This updated patch includes a Retry-After header in the 409 Conflict response, set to 30 seconds.  This seems more in line with the spirit of the 409 response, whereby we're supposed to include information about how to resolve the conflict.

Attachment #610810 - Attachment is obsolete: true

Attachment #611694 - Flags: review?(telliott)

:Atoll

Reporter

Comment 25

•

12 years ago

Comment on attachment 611694 [details] [diff] [review]
updated patch to return 409-conflict on duplicate BSO insertion

>+# How long the client should wait before retrying a conflicting write.
>+RETRY_AFTER = 30

What is the minimum bound of Retry-After values accepted by the client, below which it will round up a server-provided value to its minimum?

>+        except StorageConflictError:
>+            raise HTTPConflict(headers={"Retry-After": str(RETRY_AFTER)})

To avoid synchronizing race conditions across clients, the server must not transmit the same Retry-After value to the clients, and the clients must not interpret the same Retry-After value from the server.

The server must random-fuzz the baseline RETRY_AFTER value returned to clients, for the same reasons that Ethernet collisions result in a random, rather than fixed, backoff.

The client must honor the random-fuzz value, so if the baseline RETRY_AFTER is smaller than the clients will honor from the server, randomizing it is of little use.

This implies that we define in the spec the minimum RETRY_AFTER value that the client must honor, and/or that we define in the spec a way for clients to communicate their minimum Retry-After value moving forward, and we presume a sane default for legacy clients.  Then the base RETRY_AFTER value can be raised from an actual baseline minimum to either the client-provided or the legacy-default value, before random-fuzz.

Random-fuzz being RETRY_AFTER := MAX( MIN_RETRY, CLIENT_RETRY || LEGACY_RETRY ) * (1 + rand(1)).

f- because it's not safe to return a fixed retry-after value without causing a direct risk of both ongoing client synchronization errors and forcing into synchronization clients that perhaps formerly were not, given certain pairings of a static RETRY_AFTER and a higher minimum in the client.

Attachment #611694 - Flags: feedback-

Toby Elliott [:telliott]

Comment 26

•

12 years ago

Comment on attachment 611694 [details] [diff] [review]
updated patch to return 409-conflict on duplicate BSO insertion

I don't think atoll's concern is all that worrying here - this isn't meltdown prevention, and the second retry-after conflict would only come into play if 2 clients attempted to modify at the same time while another was holding a lock. However, fuzzing things up with a couple seconds is a small win and probably worth doing.

I am, however, concerned that this patch is trying to do multiple things, possibly not correctly. The core underlying error we have here is the SQLAlchemy IntegrityError, which is thrown when we get a PK violation. That's not something that's going to be fixed by any kind of retry-after - the user should just not be doing a put there. Is the idea that we're going to merge both PK violations and session-lock violations into one 409 response? If so, it seems like parts of this belong in another patch.

Certainly, as this patch currently stands, wrapping the delete statement with a check for StorageConflictError is meaningless - you can't have a PK violation on a delete. Is the idea that you're futureproofing for upcoming additional StorageConflictErrors?

Attachment #611694 - Flags: review?(telliott) → review-

Ryan Kelly [:rfkelly]

Assignee

Comment 27

•

12 years ago

(In reply to Toby Elliott [:telliott] from comment #26)
> Comment on attachment 611694 [details] [diff] [review]
> updated patch to return 409-conflict on duplicate BSO insertion
> 
> I don't think atoll's concern is all that worrying here - this isn't
> meltdown prevention, and the second retry-after conflict would only come
> into play if 2 clients attempted to modify at the same time while another
> was holding a lock. However, fuzzing things up with a couple seconds is a
> small win and probably worth doing.

Yes.  I think it's a good idea to add a tween for this, that checks for a Retry-After header in all responses and applies the fuzzing in a single location.
 
> I am, however, concerned that this patch is trying to do multiple things,
> possibly not correctly. The core underlying error we have here is the
> SQLAlchemy IntegrityError, which is thrown when we get a PK violation.
> That's not something that's going to be fixed by any kind of retry-after -
> the user should just not be doing a put there.

I think it can be resolved by retrying.  The code handling a PUT statement basically goes like this:

    modified = self.get_item(id)

    if modified is None:
        query = "INSERT INTO bso VALUES blah blah"
    else:
        query = "UPDATE bso SET blah blah"

    self._safe_execute(query)

You'll only get an IntegrityError here if two clients are simultaneously going down the "INSERT" branch due to a race.  One INSERT succeeds, the other gets an integrity error.

If that failing client retrys the request, it should go down the "UPDATE" branch the second time around because the item now exists.

Now, I could do this retrying myself for this case - instead of catching the StorageConflictError and returning a 409, I could catch it and re-issue the call to the database.  But it seems safer to inform the client of the conflicting writes in case they want to update their local data to match.

> Is the idea that we're going
> to merge both PK violations and session-lock violations into one 409
> response? If so, it seems like parts of this belong in another patch.
> 
> Certainly, as this patch currently stands, wrapping the delete statement
> with a check for StorageConflictError is meaningless - you can't have a PK
> violation on a delete. Is the idea that you're futureproofing for upcoming
> additional StorageConflictErrors?

Yes, this was me pre-empting the upcoming locking changes.  Happy to defer this to a separate patch.

Toby Elliott [:telliott]

Comment 28

•

12 years ago

Ah, OK, what you're trying to do here is definitely made confusing by the other pieces of the patch. This Bug is actually handled by the updated PUT logic that you have above, and the parts of the patch that you're adding here to add a 409 is more than sufficient to close out the bug. Let's deal with session locking, etc somewhere else.

Ryan Kelly [:rfkelly]

Assignee

Comment 29

•

12 years ago

(In reply to Toby Elliott [:telliott] from comment #28)
> Ah, OK, what you're trying to do here is definitely made confusing by the
> other pieces of the patch. This Bug is actually handled by the updated PUT
> logic that you have above,

Just to be clear, this patch doesn't change the PUT-handling logic at all, what I described above is what is already in the code.  All this patch does is wrap it with the retry-after logic.

> and the parts of the patch that you're adding
> here to add a 409 is more than sufficient to close out the bug.
> Let's deal with session locking, etc somewhere else.

Agree, will trim it down to the bare necessities.

Toby Elliott [:telliott]

Comment 30

•

12 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #29)
> (In reply to Toby Elliott [:telliott] from comment #28)
> > Ah, OK, what you're trying to do here is definitely made confusing by the
> > other pieces of the patch. This Bug is actually handled by the updated PUT
> > logic that you have above,
> 
> Just to be clear, this patch doesn't change the PUT-handling logic at all,
> what I described above is what is already in the code.  All this patch does
> is wrap it with the retry-after logic.
> 

Yes, the original PUT patch would have been sufficient to resolve this bug :)

Ryan Kelly [:rfkelly]

Assignee

Comment 31

•

12 years ago

`git blame` tells me the PUT code has been like that since at least 2010, well before this bug was created.  It has just always had the race condition between check and insert that this bug seems to have snuck through.

Anyway, enough chat, updated patch coming after the easter weekend :-)

Toby Elliott [:telliott]

Comment 32

•

12 years ago

Hmm, now I'm disturbed. Maybe this bug wasn't what we thought it was. Anyway, our best guess as to why has been invalidated, and this takes care of our second best guess, so that'll do it. Have a good weekend.

Ryan Kelly [:rfkelly]

Assignee

Comment 33

•

12 years ago

And, I just thought of another possibility - ttls.  The PUT logic uses self.item_exists() to check whether the item already exists.  The query underlying item_exists reports False for rows that are in the database but have an expired ttl.  So, using a PUT to overwrite an item with expired ttl will fail in exactly the way described in this bug - it will try to use an INSERT statement because the item doesn't appear to exist, but will fail with an IntegrityError because the (expired) row is still in the database.

Should be a simple fix, which I'll roll up with the conflict-reporting patch for good measure so it covers all the cases.

Richard Newman [:rnewman]

Comment 34

•

12 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #33)
> So, using a PUT to overwrite an item with expired ttl
> will fail in exactly the way described in this bug - it will try to use an
> INSERT statement because the item doesn't appear to exist, but will fail
> with an IntegrityError because the (expired) row is still in the database.

Does this affect 1.1?

I don't recall Marina seeing this in testing, but if it's a problem then Android Sync users who turn their phone off for three weeks will be getting this in the wild. They PUT their client record.

Toby Elliott [:telliott]

Comment 35

•

12 years ago

It appears to me much more corner-case than originally believed. We should check, but it appears the code has been doing the right thing.

Ryan Kelly [:rfkelly]

Assignee

Comment 36

•

12 years ago

Attached patch patch to allow overwriting of items with expired ttls — Details — Splinter Review

This patch fixes the issue with ttls, meaning expired items no longer block PUTs.

Attachment #613505 - Flags: review?(telliott)

Ryan Kelly [:rfkelly]

Assignee

Comment 37

•

12 years ago

Attached patch updated patch to return 409-conflict on duplicate BSO insertion — Details — Splinter Review

This patch updates the 409-conflict handling, removing places where it's currently useless and adding the fuzzing as suggested by atoll.

Attachment #611694 - Attachment is obsolete: true

Attachment #613506 - Flags: review?(telliott)

Toby Elliott [:telliott]

Updated

•

12 years ago

Attachment #613505 - Flags: review?(telliott) → review+

Toby Elliott [:telliott]

Comment 38

•

12 years ago

Comment on attachment 613506 [details] [diff] [review]
updated patch to return 409-conflict on duplicate BSO insertion

Good enough for now. We should consider eventually expressing the fuzz as a percentage, rather than an absolute. If we're Retrying-After in, say, half-hour chunks, we probably want larger gaps than 5 seconds.

Attachment #613506 - Flags: review?(telliott) → review+

Ryan Kelly [:rfkelly]

Assignee

Comment 39

•

12 years ago

Committed to master here:

https://github.com/mozilla-services/server-syncstorage/commit/7dda211ada1b20fb248d9cf837c4c535d85f8121
https://github.com/mozilla-services/server-syncstorage/commit/703c3bf4adc4d650f735a08a59010834d5dcd93d

Leaving the bug open since it's targeted for a sync1.1 backport.

Ryan Kelly [:rfkelly]

Assignee

Comment 40

•

12 years ago

Backported to Sync1.1 trunk: http://hg.mozilla.org/services/server-storage/rev/3daa0503ce94

James Bonacci [:jbonacci]

Comment 41

•

12 years ago

FYI: latest pull, make build, and make test, was successful.

Ryan Kelly [:rfkelly]

Assignee

Comment 42

•

12 years ago

Closing the bug.  Still need to backport to a hotfix release of sync1.1, but that's being tracked separately.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Updated

•

12 years ago

Status: RESOLVED → VERIFIED

Ryan Kelly [:rfkelly]

Assignee

Comment 43

•

12 years ago

Backported to 1.10 release: http://hg.mozilla.org/services/server-storage/rev/d92eabfd0a11

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.