Closed Bug 1057892 Opened 10 years ago Closed 7 years ago

Debug apparent bug in TokuDB handling of "ON DUPLICATE KEY" updates

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

x86_64
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rfkelly, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

Attachments

(1 file, 2 obsolete files)

(Spinning this out from Bug 1055368 because it deserves more digging into.)

Per Bug 1055368 Comment 14, during our attempt to deploy TokuDB-backed sync nodes into production, we found that they would give mysterious "could not find record" errors after a sequence of operations like:

   INSERT ... ON DUPLICATE KEY UPDATE ...
   DELETE ...
   INSERT ... ON DUPLICATE KEY UPDATE ...

The errors would only trigger when the userids were larger then approximately 4000000, which :bobm points out may be some sort of bitmasking-related bug since 2^22=4194304.

Let's take a bit of time to see if we can reproduce this and turn it into an upstream bug, so that we can make use of the otherwise very promising performance improvements offered by TokuDB.
Whiteboard: [qa+]
Summary: Debug apprent bug in TokuDB handling of "ON DUPLICATE KEY" updates → Debug apparent bug in TokuDB handling of "ON DUPLICATE KEY" updates
Bobm spun up a "sync-5" tokudb to help this effort.  I was unable to reproduce the errors using manual SQL statements, or using my modified loadtest script from Bug 1055368.  It may be that we need to reach a certain data density in the table before the bug will be triggered, so I'm going to beat on this poor node for a while and see if it becomes reproducible.
Nope.  The new node is impervious to my tests and apparently would have passed the deployment loadtests from 1055368 with flying colours, including combined TS+sync loadtest.

:bobm, can you please sanity-check this node to ensure that it's in the same state as the previous failing nodes?  It looks the same to me but I don't really know what to check.

:jbonacci when you get a chance, can you please loadtest sync-5 like you would for an outgoing production deploy?  Maybe your magic QA touch can re-surface this bug.
Flags: needinfo?(jbonacci)
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #2)

> :bobm, can you please sanity-check this node to ensure that it's in the same
> state as the previous failing nodes?  It looks the same to me but I don't
> really know what to check.

I verified everything is as it should be.  The node was built the same way as the previous lot, it has the correct TokuDB table settings in the DB (same as the others), and passes verification test we use to deploy.  Plus I had a look over the running processes, instance settings, and RPM versions: all check out.
Flags: needinfo?(bobm)
I need the "secret" for Sync5.
:bobm can you send that to me? Thanks.
Status: NEW → ASSIGNED
Flags: needinfo?(jbonacci)
(In reply to James Bonacci [:jbonacci] from comment #4)
> I need the "secret" for Sync5.
> :bobm can you send that to me? Thanks.

Stage secret sent.
OK, ran a 30min isolated load test against Sync5:
https://loads.services.mozilla.com/run/fb5fcbc2-24ad-404c-b845-b6fd29c2c819

Found 499s at the end of the nginx access.log file
Found 'Broken pipe' errors near the end of the sync.log file
But none of the strange 500 errors and tracebacks we saw on the previous test.  I don't know what else to try here to reproduce it.
OK, not quite.
The test environment/method is not quite the same.

1. We need Sync5 to be part of the grouping of nodes seen by TS
2. We need to run a longer test - 60+ min
3. We need to run a combined test
Did another 30min this time with all 5 sync nodes.
On Sync5:
Nginx access log: nothing unusual here: 200s, 404s, etc.

sync.log:
200s and 404s
The 'Broken pipe' errors/exceptions
The 'QueuePoolWithMaxBacklog' errors/timeouts
The 400s, 304s, 412s, 415s.

That's it. No 500s or traceback...
Blocks: 1055368
No longer blocks: 1055368
Blocks: 1059025
I want to mark this Resolved/Fixed.
We did chat about this in IRC - and decided to move forward with a deploy of TokuDB (rather than InnoDB) to Stage and Prod.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
server-syncstorage 1.5.9 is now in Production, including some nodes with TokuDB.
Status: RESOLVED → VERIFIED
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Re-opening because we're seeing these errors in prod.
Sample of errors seen on a prod node:

sync103:
      7 (1032, u\"Can't find record in 'bso0'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     43 (1032, u\"Can't find record in 'bso10'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      9 (1032, u\"Can't find record in 'bso11'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      4 (1032, u\"Can't find record in 'bso12'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     45 (1032, u\"Can't find record in 'bso13'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     30 (1032, u\"Can't find record in 'bso14'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      5 (1032, u\"Can't find record in 'bso15'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      6 (1032, u\"Can't find record in 'bso16'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      5 (1032, u\"Can't find record in 'bso17'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     12 (1032, u\"Can't find record in 'bso18'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      4 (1032, u\"Can't find record in 'bso19'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     36 (1032, u\"Can't find record in 'bso1'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     12 (1032, u\"Can't find record in 'bso2'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     47 (1032, u\"Can't find record in 'bso3'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     12 (1032, u\"Can't find record in 'bso4'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     24 (1032, u\"Can't find record in 'bso5'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     13 (1032, u\"Can't find record in 'bso6'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     30 (1032, u\"Can't find record in 'bso7'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     15 (1032, u\"Can't find record in 'bso8'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     58 (1032, u\"Can't find record in 'bso9'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
      1 r) (1032, u\"Can't find record in 'bso11'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      1 r) (1032, u\"Can't find record in 'bso9'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b

Let's try to get a summary of the userids affect on this node and we can check for (a) similarity between then, and (b) whether they are successfully recovering from the error.
Bob sent me a sample of the full tracebacks from one machine.  Picking the first line from it, I see user 6913794 getting this error during a post to their bookmarks collection.  So let's dig into this user's experience of the bug.

They account for 24 lines in the file bob sent me, across different collections.  Bob, can you keep on eye out for any further occurrences of the error for this user?  It will be very intersting to see whether and how it resolves.

(Unfortunately, because these are all POST requests, they show up in the logs as "200 OK" with errors in the body.  So it's kinda hard to see whether this user has had successful syncs since then.)

I can confirm that this user did a "DELETE /storage" request at "2014-09-11T05:55:48.000Z", just before they started showing up in tracebacks here at "2014-09-11T05:56:42.202049Z"  They also seem to have rapidly hit there error many times in quick succession after that, perhaps once for each collection that they attempted to sync.
Here's approximately the full set of affected userids that I extracted from the tracebacks (55 in total):

set([6628736, 6913794, 6585606, 6641161, 6591631, 6597137, 6638179, 6586003, 6586133, 6619175, 6604186, 6619418, 6618014, 6624544, 6593286, 6587814, 6591271, 6589608, 6585260, 6588845, 6592942, 6586418, 6598836, 6633273, 6605498, 6589371, 6640613, 6585794, 6631243, 6635853, 6618286, 6609233, 6600787, 6588885, 6589527, 6601444, 6610523, 6629285, 6585826, 1531, 6592100, 6589094, 1895, 6637290, 6597543, 6591341, 6598255, 6586480, 6612243, 6597749, 6618705, 6637431, 6631928, 1403, 6642301])
It would be great it we can confirm whether at least one of these users is successfully syncing despite having triggered this error in the past.  To do so would probably require finding successul POST requests from that user in the nginx logs, then checking that they do not correspond to any "Caught operation db error" lines in the app log.
(In reply to Ryan Kelly [:rfkelly] from comment #16)
> It would be great it we can confirm whether at least one of these users is
> successfully syncing despite having triggered this error in the past.

I'll have a look.
Attached file error-1410698301938.txt (obsolete) —
Wow, I think I may be hitting this myself.  I'm getting a 503 while attemptig to fetch /crypto/keys - see attached client log, node 105, userid 6809503.  Bob, can you please check the logs and see if I'm showing up with a traceback?

I'll monitor the behavior of my client and see if it resolves itself...
Flags: needinfo?(bobm)
Hmm...no actually that's just a GET, which ordinarily wouldn't show up as this bug.  I do see a previous failing POST but IIRC the usual behavior is for GETs to continue to succeed.
No 'Caught operation db error' errors in the sync.log file over the last two days.  Here's a sample of the POSTS for some of the users in comment 15 today. 

User: A
Code: 200 Count: 1
Endpoint: /storage/tabs Count: 1
User: B
Code: 200 Count: 69
Endpoint: /storage/forms Count: 4
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/history Count: 55
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/prefs Count: 2
User: C
Code: 200 Count: 4
Endpoint: /storage/clients Count: 1
Endpoint: /storage/greasemonkey Count: 3
User: D
Code: 200 Count: 147
Endpoint: /storage/forms Count: 19
Endpoint: /storage/clients Count: 8
Endpoint: /storage/tabs Count: 56
Endpoint: /storage/bookmarks Count: 7
Endpoint: /storage/history Count: 52
Endpoint: /storage/addons Count: 1
Endpoint: /storage/prefs Count: 4
User: E
Code: 200 Count: 37
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 16
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/history Count: 17
User: F
Code: 200 Count: 25
Endpoint: /storage/forms Count: 7
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/history Count: 9
User: G
Code: 200 Count: 36
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/history Count: 2
Endpoint: /storage/bookmarks Count: 26
Endpoint: /storage/addons Count: 1
Endpoint: /storage/prefs Count: 1
User: H
Code: 200 Count: 23
Endpoint: /storage/forms Count: 2
Endpoint: /storage/tabs Count: 10
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 10
User: I
Code: 200 Count: 22
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 9
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 7
User: J
Code: 200 Count: 22
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 3
Endpoint: /storage/tabs Count: 6
Endpoint: /storage/passwords Count: 4
Endpoint: /storage/history Count: 5
Endpoint: /storage/bookmarks Count: 1
User: K
Code: 200 Count: 6
Endpoint: /storage/forms Count: 1
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 2
User: L
Code: 200 Count: 34
Endpoint: /storage/forms Count: 5
Endpoint: /storage/greasemonkey Count: 5
Endpoint: /storage/tabs Count: 12
Endpoint: /storage/history Count: 11
Endpoint: /storage/prefs Count: 1
User: M
Code: 200 Count: 2
Endpoint: /storage/clients Count: 1
Endpoint: /storage/history Count: 1
User: N
Code: 200 Count: 24
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 10
Endpoint: /storage/history Count: 9
User: O
Code: 200 Count: 6
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
User: P
Code: 200 Count: 16
Endpoint: /storage/forms Count: 2
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/history Count: 7
User: Q
Code: 200 Count: 14
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 5
User: R
Code: 200 Count: 15
Endpoint: /storage/forms Count: 3
Endpoint: /storage/tabs Count: 6
Endpoint: /storage/history Count: 6
User: S
Code: 200 Count: 28
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 10
Endpoint: /storage/history Count: 16
User: T
Code: 200 Count: 15
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 11
User: U
Code: 200 Count: 7
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/history Count: 2
User: V
Code: 200 Count: 11
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 4
Endpoint: /storage/history Count: 3
User: X
Code: 200 Count: 8
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 3
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 2
User: Y
Code: 200 Count: 5
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 1
Endpoint: /storage/history Count: 2
User: Z
Code: 200 Count: 21
Endpoint: /storage/clients Count: 3
Endpoint: /storage/tabs Count: 5
Endpoint: /storage/history Count: 13
User: AA
Code: 200 Count: 9
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 3
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/history Count: 3
User: BB
Code: 200 Count: 32
Endpoint: /storage/forms Count: 7
Endpoint: /storage/tabs Count: 9
Endpoint: /storage/passwords Count: 7
Endpoint: /storage/history Count: 9
User: CC
Code: 200 Count: 53
Endpoint: /storage/forms Count: 6
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 22
Endpoint: /storage/history Count: 20
User: DD
Code: 200 Count: 12
Endpoint: /storage/forms Count: 2
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 7
User: EE
Code: 200 Count: 3
Endpoint: /storage/forms Count: 1
Endpoint: /storage/tabs Count: 1
Endpoint: /storage/history Count: 1
User: FF
Code: 200 Count: 35
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 17
Endpoint: /storage/history Count: 12
User: GG
Code: 200 Count: 72
Endpoint: /storage/forms Count: 10
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 30
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 28
Endpoint: /storage/prefs Count: 1
User: HH
Code: 200 Count: 63
Endpoint: /storage/forms Count: 5
Endpoint: /storage/clients Count: 3
Endpoint: /storage/tabs Count: 25
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/bookmarks Count: 4
Endpoint: /storage/history Count: 24
User: II
Code: 200 Count: 19
Endpoint: /storage/forms Count: 1
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/history Count: 11
User: JJ
Code: 200 Count: 2
Endpoint: /storage/clients Count: 2
User: KK
Code: 200 Count: 20
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 6
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/history Count: 4
User: LL
Code: 200 Count: 23
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 11
Endpoint: /storage/history Count: 9
User: MM
Code: 200 Count: 6
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/history Count: 2
User: NN
Code: 200 Count: 58
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 25
Endpoint: /storage/history Count: 28
Endpoint: /storage/addons Count: 1
User: OO
Code: 200 Count: 164
Endpoint: /storage/forms Count: 24
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 66
Endpoint: /storage/bookmarks Count: 3
Endpoint: /storage/passwords Count: 7
Endpoint: /storage/history Count: 59
User: PP
Code: 200 Count: 32
Endpoint: /storage/forms Count: 1
Endpoint: /storage/history Count: 31
User: QQ
Code: 200 Count: 123
Endpoint: /storage/forms Count: 8
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 56
Endpoint: /storage/history Count: 52
Endpoint: /storage/prefs Count: 2
Flags: needinfo?(bobm)
I think this means that users recover from this problem.  Anything else I should check?
Users are clearly recovering from the problem - at least in the sense that they can subsequently post data and don't have a consistently failing index or something disturbing.

The only issue I can see potentially remaining is what's causing the error in the first place and if it's a sign of data loss. Given that the error makes no sense in the context of the query even if we assume data loss, I don't think this is something that we should be concerned about. I'm guessing this is something deep inside the TokuDB engine that gets rolled up into a 1032 error. They should be made aware of it, but I think we're OK for now

Unless it starts happening a lot...
Per discussion with TokuDB support, this may be due to using read-uncommitted isolation level, which is a hangover from our InnoDB setup.  Bob has filed Bug 1068424 to investigate performance at higher levels, and we should watch out for this bug during such tests.  (And report back to Tokutek if we find anything interesting either way)
Depends on: 1068424
Just to note here too, over on bug 1066493 we experience this problem a lot while running our TPS tests for Firefox Sync on a Nightly build basis. It's happening nearly daily but fluctuates for some reason, so e.g. we haven't seen it over the weekend but got a spike again on Monday.

Is there anything we can do here, so we can make the tests not fail anymore?
Bob, do we see this on any requests besides PUT to /meta/global ?  Perhaps we can hack around it in the code by using a different code-path for this specific call (e.g. a plain INSERT/UPDATE pair rather than using ON DUPLICATE KEY, which should be fairly low impact for just the single-item /meta/global PUT)
Flags: needinfo?(bobm)
More concretely, here's something like the workaround I'm considering - just bypass the ON DUPLICATE KEY stuff if it's a single-item PUT.  We could limit it more specifically to just /meta/global but that hardly seems worth it.
Attachment #8513926 - Flags: review?(telliott)
Hmm, actually my previous comment says:

> Bob sent me a sample of the full tracebacks from one machine.  Picking the first line from it,
> I see user 6913794 getting this error during a post to their bookmarks collection.

But the posted workaround may still be a win if most of the errors are on /meta/global, since (a) this is blocking TPS, and (b) I'm pretty sure it fails with error bar on /meta/global, but will fail silently and retry on other collections.
Attachment #8489118 - Attachment is obsolete: true
(In reply to Ryan Kelly [:rfkelly] from comment #27)
> But the posted workaround may still be a win if most of the errors are on
> /meta/global, since (a) this is blocking TPS, and (b) I'm pretty sure it
> fails with error bar on /meta/global, but will fail silently and retry on
> other collections.

A potential list from the last 48 hours shows the following HTTP method breakdown, after removing some anomalous servers:
1.	PUT		66
2.	GET		7
3.	DELETE	        3

Note these aren't all necessarily related to this problem.  However, all of the PUT failures are to meta/global.
Flags: needinfo?(bobm)
In addition to this workaround, we could also "tag" failing requests with a special metrics entry to indicate that they trigger this bug.  This would make Bob's life easier when trying to correlate the tracebacks to the actual requests that triggered them.
We're deploying an updated TokuDB, which may be enough to perturb this weird behaviour out of existence: Bug 1089945.
Depends on: 1089945
Here's a revised patch that applies the workaround only for the "meta" collection.  It also adds a "syncstorage.storage.sql.tokudb_error" entry to the request summary JSON line, which should make it easier to detect which requests are hitting this error without :bobm having to crawl through logfile tracebacks.

I think we should target this for our next deployment and see what effect it has on the error rate.
Attachment #8513926 - Attachment is obsolete: true
Attachment #8513926 - Flags: review?(telliott)
Attachment #8514015 - Flags: review?(telliott)
By any means, would there be a chance to get this reviewed and deployed soon? We have to kill the old TPS CI system soon, but the new system throws us all those failures.
Flags: needinfo?(telliott)
Comment on attachment 8514015 [details] [diff] [review]
sync-on-duplicate-key-workaround.diff

Review of attachment 8514015 [details] [diff] [review]:
-----------------------------------------------------------------

Not loving the patch in general, but if it's needed to quiet the failures, guess we'll live with it.

::: syncstorage/storage/sql/dbconnect.py
@@ +413,2 @@
>          if mysql_error_code in (1205, 1206, 1213, 1689, 1032):
> +            if mysql_error_code == 1032:

unecessary nesting?

@@ +749,5 @@
>          if self._connector.driver == "mysql":
> +            # We are seeing a sporadic bug in production where using
> +            # ON DUPLICATE KEY errors out.  This workaround avoids it
> +            # the common case of PUT /meta/global (collection id 6 is
> +            # the "meta" collection in production).

Not sure what this comment is trying to say. missing an "in"?
Attachment #8514015 - Flags: review?(telliott) → review+
Flags: needinfo?(telliott)
Blocks: 1094588
> unecessary nesting?

*nod* I moved it out one level of nesting and it does seem clearer.

> Not sure what this comment is trying to say. missing an "in"?

Yes, should be "...avoids it in the common case of..."

Committed with the above fixes in https://github.com/mozilla-services/server-syncstorage/commit/055de05206194117531dd19a355805af939d2ad0

Tagged for deployment in Bug 1094587, and I filed Bug 1094588 to follow up on eventual removal of this workaround.
We rolled out the workaround, but we should take care not to forget about this issue.  AFAICT our next steps are:

* Deploy updated TokuDB and see if the problem persists (Bug 1089945)
* If not, then escalate to paid TokuDB support
Adding legal bug for the TokuDB support: https://bugzilla.mozilla.org/show_bug.cgi?id=1108705

I don't hold sufficient permissions to view that bug, so I cannot add it as a blocker.
Unfortunately, the workaround from Comment 31 does not seem to have helped.  Rather than the 1032 errors, we're now seeing errors like this:

    (IntegrityError) (1062, u\"Duplicate entry \\'16603449-6-global\\' for key \\'PRIMARY\\'\")',)

I suspect it is the same underlying bug in the db, just now being triggered in a different way:

  * The workaround logic is basically "try to update the row, if it's not found then insert it"
  * It tries to update the row, but it's "not found" due to the same weirdness that triggers "can't find record" in the previous code
  * So it tries to insert the row, but can totally find it this time, so fails out with an IntegrityError

This suggests that the underlying bug is something more fundamental, rather than just a problem with the implementation of ON DUPLICATE KEY.  Bleh.

Perhaps Bug 1089945 will come to the rescue here?
Ryan, I assume that means we now have to wait for Toku DB to provide a fix, whereby we do not know when this could happen. Which means our test failures will remain for a while then.
(In reply to Henrik Skupin (:whimboo) from comment #38)
> Ryan, I assume that means we now have to wait for Toku DB to provide a fix,
> whereby we do not know when this could happen. Which means our test failures
> will remain for a while then.

You could also use try using the same user instead of deleting it every time; or running a dedicated testing infrastructure that isn't TokuDB backed.
> running a dedicated testing infrastructure that isn't TokuDB backed.

If this is an option for you, I'm happy to advise on and help set up such a thing.

Sadly I can't think of another option for papering over this on the server side.  Could the tests be changed to treat 503s as ignore/skip rather than as failure?  Not ideal but maybe better than persistent failures.
We cannot run with a single user given that all the setup nodes will run tests at the same time. Means we need an x-amount of users, and I do not see how this scales over time.

Also the tests have been created as end-to-end tests to exercise the path a real user is acting. So setting up something local would drastically change that vision. I don't think that this is the right way.

Anyway it's off-topic on this bug and we might have to wait until this is fixed, given that it is a real issue we are facing here.
Possible solution here: https://mariadb.atlassian.net/browse/MDEV-7150  Will roll this out to stage.
We don't use auto-increment keys on this table (although the same underlying bug could be manifesting here of course)
Has something been changed on production lately? I do not see this failure for signIn() anymore with our TPS tests since a couple of days.
Flags: needinfo?(rfkelly)
I'll defer that question to :bobm
Flags: needinfo?(rfkelly) → needinfo?(bobm)
(In reply to Henrik Skupin (:whimboo) from comment #44)
> Has something been changed on production lately? I do not see this failure
> for signIn() anymore with our TPS tests since a couple of days.

We've upgraded our backend Database and the Engine that supports it recently.  There were some similar, but not exact, fixes in the changelog for that release.
Flags: needinfo?(bobm)
At least it has fixed our long outstanding problem! I still cannot see the failure as reported on bug 1066493 anymore. Those are really good news! I will close our bug and keep an eye on it.
Bob, is there any remaining work for us in this bug?
Flags: needinfo?(bobm)
None.  Just for completeness I checked the production sync logs for duplicate errors and found none.
Status: REOPENED → RESOLVED
Closed: 10 years ago7 years ago
Flags: needinfo?(bobm)
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: