1057892 - Debug apparent bug in TokuDB handling of "ON DUPLICATE KEY" updates

Assignee

Description

•

10 years ago

(Spinning this out from Bug 1055368 because it deserves more digging into.)

Per Bug 1055368 Comment 14, during our attempt to deploy TokuDB-backed sync nodes into production, we found that they would give mysterious "could not find record" errors after a sequence of operations like:

   INSERT ... ON DUPLICATE KEY UPDATE ...
   DELETE ...
   INSERT ... ON DUPLICATE KEY UPDATE ...

The errors would only trigger when the userids were larger then approximately 4000000, which :bobm points out may be some sort of bitmasking-related bug since 2^22=4194304.

Let's take a bit of time to see if we can reproduce this and turn it into an upstream bug, so that we can make use of the otherwise very promising performance improvements offered by TokuDB.

James Bonacci [:jbonacci]

Updated

•

10 years ago

Whiteboard: [qa+]

Ryan Kelly [:rfkelly]

Assignee

Updated

•

10 years ago

Summary: Debug apprent bug in TokuDB handling of "ON DUPLICATE KEY" updates → Debug apparent bug in TokuDB handling of "ON DUPLICATE KEY" updates

Ryan Kelly [:rfkelly]

Assignee

Comment 1

•

10 years ago

Bobm spun up a "sync-5" tokudb to help this effort.  I was unable to reproduce the errors using manual SQL statements, or using my modified loadtest script from Bug 1055368.  It may be that we need to reach a certain data density in the table before the bug will be triggered, so I'm going to beat on this poor node for a while and see if it becomes reproducible.

Ryan Kelly [:rfkelly]

Assignee

Comment 2

•

10 years ago

Nope.  The new node is impervious to my tests and apparently would have passed the deployment loadtests from 1055368 with flying colours, including combined TS+sync loadtest.

:bobm, can you please sanity-check this node to ensure that it's in the same state as the previous failing nodes?  It looks the same to me but I don't really know what to check.

:jbonacci when you get a chance, can you please loadtest sync-5 like you would for an outgoing production deploy?  Maybe your magic QA touch can re-surface this bug.

Flags: needinfo?(jbonacci)

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 3

•

10 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #2)

> :bobm, can you please sanity-check this node to ensure that it's in the same
> state as the previous failing nodes?  It looks the same to me but I don't
> really know what to check.

I verified everything is as it should be.  The node was built the same way as the previous lot, it has the correct TokuDB table settings in the DB (same as the others), and passes verification test we use to deploy.  Plus I had a look over the running processes, instance settings, and RPM versions: all check out.

Flags: needinfo?(bobm)

James Bonacci [:jbonacci]

Comment 4

•

10 years ago

I need the "secret" for Sync5.
:bobm can you send that to me? Thanks.

Status: NEW → ASSIGNED

Flags: needinfo?(jbonacci)

Bob Micheletto [:bobm]

Comment 5

•

10 years ago

(In reply to James Bonacci [:jbonacci] from comment #4)
> I need the "secret" for Sync5.
> :bobm can you send that to me? Thanks.

Stage secret sent.

James Bonacci [:jbonacci]

Comment 6

•

10 years ago

OK, ran a 30min isolated load test against Sync5:
https://loads.services.mozilla.com/run/fb5fcbc2-24ad-404c-b845-b6fd29c2c819

Found 499s at the end of the nginx access.log file
Found 'Broken pipe' errors near the end of the sync.log file

Ryan Kelly [:rfkelly]

Assignee

Comment 7

•

10 years ago

But none of the strange 500 errors and tracebacks we saw on the previous test.  I don't know what else to try here to reproduce it.

James Bonacci [:jbonacci]

Comment 8

•

10 years ago

OK, not quite.
The test environment/method is not quite the same.

1. We need Sync5 to be part of the grouping of nodes seen by TS
2. We need to run a longer test - 60+ min
3. We need to run a combined test

James Bonacci [:jbonacci]

Comment 9

•

10 years ago

Did another 30min this time with all 5 sync nodes.
On Sync5:
Nginx access log: nothing unusual here: 200s, 404s, etc.

sync.log:
200s and 404s
The 'Broken pipe' errors/exceptions
The 'QueuePoolWithMaxBacklog' errors/timeouts
The 400s, 304s, 412s, 415s.

That's it. No 500s or traceback...

James Bonacci [:jbonacci]

Updated

•

10 years ago

Blocks: 1055368

James Bonacci [:jbonacci]

Updated

•

10 years ago

No longer blocks: 1055368

James Bonacci [:jbonacci]

Updated

•

10 years ago

Blocks: 1059025

James Bonacci [:jbonacci]

Comment 10

•

10 years ago

I want to mark this Resolved/Fixed.
We did chat about this in IRC - and decided to move forward with a deploy of TokuDB (rather than InnoDB) to Stage and Prod.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 11

•

10 years ago

server-syncstorage 1.5.9 is now in Production, including some nodes with TokuDB.

Status: RESOLVED → VERIFIED

Ryan Kelly [:rfkelly]

Assignee

Updated

•

10 years ago

Status: VERIFIED → REOPENED

Resolution: FIXED → ---

Ryan Kelly [:rfkelly]

Assignee

Comment 12

•

10 years ago

Re-opening because we're seeing these errors in prod.

Ryan Kelly [:rfkelly]

Assignee

Comment 13

•

10 years ago

Sample of errors seen on a prod node:

sync103:
      7 (1032, u\"Can't find record in 'bso0'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     43 (1032, u\"Can't find record in 'bso10'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      9 (1032, u\"Can't find record in 'bso11'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      4 (1032, u\"Can't find record in 'bso12'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     45 (1032, u\"Can't find record in 'bso13'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     30 (1032, u\"Can't find record in 'bso14'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      5 (1032, u\"Can't find record in 'bso15'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      6 (1032, u\"Can't find record in 'bso16'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      5 (1032, u\"Can't find record in 'bso17'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     12 (1032, u\"Can't find record in 'bso18'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      4 (1032, u\"Can't find record in 'bso19'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
     36 (1032, u\"Can't find record in 'bso1'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     12 (1032, u\"Can't find record in 'bso2'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     47 (1032, u\"Can't find record in 'bso3'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     12 (1032, u\"Can't find record in 'bso4'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     24 (1032, u\"Can't find record in 'bso5'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     13 (1032, u\"Can't find record in 'bso6'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     30 (1032, u\"Can't find record in 'bso7'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     15 (1032, u\"Can't find record in 'bso8'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
     58 (1032, u\"Can't find record in 'bso9'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b
      1 r) (1032, u\"Can't find record in 'bso11'\") '/* [queryName=UPSERT_bso] */ INSERT INTO
      1 r) (1032, u\"Can't find record in 'bso9'\") '/* [queryName=UPSERT_bso] */ INSERT INTO b

Let's try to get a summary of the userids affect on this node and we can check for (a) similarity between then, and (b) whether they are successfully recovering from the error.

Ryan Kelly [:rfkelly]

Assignee

Comment 14

•

10 years ago

Bob sent me a sample of the full tracebacks from one machine.  Picking the first line from it, I see user 6913794 getting this error during a post to their bookmarks collection.  So let's dig into this user's experience of the bug.

They account for 24 lines in the file bob sent me, across different collections.  Bob, can you keep on eye out for any further occurrences of the error for this user?  It will be very intersting to see whether and how it resolves.

(Unfortunately, because these are all POST requests, they show up in the logs as "200 OK" with errors in the body.  So it's kinda hard to see whether this user has had successful syncs since then.)

I can confirm that this user did a "DELETE /storage" request at "2014-09-11T05:55:48.000Z", just before they started showing up in tracebacks here at "2014-09-11T05:56:42.202049Z"  They also seem to have rapidly hit there error many times in quick succession after that, perhaps once for each collection that they attempted to sync.

Ryan Kelly [:rfkelly]

Assignee

Comment 15

•

10 years ago

Here's approximately the full set of affected userids that I extracted from the tracebacks (55 in total):

set([6628736, 6913794, 6585606, 6641161, 6591631, 6597137, 6638179, 6586003, 6586133, 6619175, 6604186, 6619418, 6618014, 6624544, 6593286, 6587814, 6591271, 6589608, 6585260, 6588845, 6592942, 6586418, 6598836, 6633273, 6605498, 6589371, 6640613, 6585794, 6631243, 6635853, 6618286, 6609233, 6600787, 6588885, 6589527, 6601444, 6610523, 6629285, 6585826, 1531, 6592100, 6589094, 1895, 6637290, 6597543, 6591341, 6598255, 6586480, 6612243, 6597749, 6618705, 6637431, 6631928, 1403, 6642301])

Ryan Kelly [:rfkelly]

Assignee

Comment 16

•

10 years ago

It would be great it we can confirm whether at least one of these users is successfully syncing despite having triggered this error in the past.  To do so would probably require finding successul POST requests from that user in the nginx logs, then checking that they do not correspond to any "Caught operation db error" lines in the app log.

Bob Micheletto [:bobm]

Comment 17

•

10 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #16)
> It would be great it we can confirm whether at least one of these users is
> successfully syncing despite having triggered this error in the past.

I'll have a look.

Ryan Kelly [:rfkelly]

Assignee

Comment 18

•

10 years ago

Attached file error-1410698301938.txt (obsolete) — Details

Wow, I think I may be hitting this myself.  I'm getting a 503 while attemptig to fetch /crypto/keys - see attached client log, node 105, userid 6809503.  Bob, can you please check the logs and see if I'm showing up with a traceback?

I'll monitor the behavior of my client and see if it resolves itself...

Flags: needinfo?(bobm)

Ryan Kelly [:rfkelly]

Assignee

Comment 19

•

10 years ago

Hmm...no actually that's just a GET, which ordinarily wouldn't show up as this bug.  I do see a previous failing POST but IIRC the usual behavior is for GETs to continue to succeed.

Bob Micheletto [:bobm]

Comment 20

•

10 years ago

No 'Caught operation db error' errors in the sync.log file over the last two days.  Here's a sample of the POSTS for some of the users in comment 15 today. 

User: A
Code: 200 Count: 1
Endpoint: /storage/tabs Count: 1
User: B
Code: 200 Count: 69
Endpoint: /storage/forms Count: 4
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/history Count: 55
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/prefs Count: 2
User: C
Code: 200 Count: 4
Endpoint: /storage/clients Count: 1
Endpoint: /storage/greasemonkey Count: 3
User: D
Code: 200 Count: 147
Endpoint: /storage/forms Count: 19
Endpoint: /storage/clients Count: 8
Endpoint: /storage/tabs Count: 56
Endpoint: /storage/bookmarks Count: 7
Endpoint: /storage/history Count: 52
Endpoint: /storage/addons Count: 1
Endpoint: /storage/prefs Count: 4
User: E
Code: 200 Count: 37
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 16
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/history Count: 17
User: F
Code: 200 Count: 25
Endpoint: /storage/forms Count: 7
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/history Count: 9
User: G
Code: 200 Count: 36
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/history Count: 2
Endpoint: /storage/bookmarks Count: 26
Endpoint: /storage/addons Count: 1
Endpoint: /storage/prefs Count: 1
User: H
Code: 200 Count: 23
Endpoint: /storage/forms Count: 2
Endpoint: /storage/tabs Count: 10
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 10
User: I
Code: 200 Count: 22
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 9
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 7
User: J
Code: 200 Count: 22
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 3
Endpoint: /storage/tabs Count: 6
Endpoint: /storage/passwords Count: 4
Endpoint: /storage/history Count: 5
Endpoint: /storage/bookmarks Count: 1
User: K
Code: 200 Count: 6
Endpoint: /storage/forms Count: 1
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 2
User: L
Code: 200 Count: 34
Endpoint: /storage/forms Count: 5
Endpoint: /storage/greasemonkey Count: 5
Endpoint: /storage/tabs Count: 12
Endpoint: /storage/history Count: 11
Endpoint: /storage/prefs Count: 1
User: M
Code: 200 Count: 2
Endpoint: /storage/clients Count: 1
Endpoint: /storage/history Count: 1
User: N
Code: 200 Count: 24
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 10
Endpoint: /storage/history Count: 9
User: O
Code: 200 Count: 6
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
User: P
Code: 200 Count: 16
Endpoint: /storage/forms Count: 2
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/history Count: 7
User: Q
Code: 200 Count: 14
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 5
User: R
Code: 200 Count: 15
Endpoint: /storage/forms Count: 3
Endpoint: /storage/tabs Count: 6
Endpoint: /storage/history Count: 6
User: S
Code: 200 Count: 28
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 10
Endpoint: /storage/history Count: 16
User: T
Code: 200 Count: 15
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 11
User: U
Code: 200 Count: 7
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/history Count: 2
User: V
Code: 200 Count: 11
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 4
Endpoint: /storage/history Count: 3
User: X
Code: 200 Count: 8
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 3
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 2
User: Y
Code: 200 Count: 5
Endpoint: /storage/forms Count: 1
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 1
Endpoint: /storage/history Count: 2
User: Z
Code: 200 Count: 21
Endpoint: /storage/clients Count: 3
Endpoint: /storage/tabs Count: 5
Endpoint: /storage/history Count: 13
User: AA
Code: 200 Count: 9
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 3
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/history Count: 3
User: BB
Code: 200 Count: 32
Endpoint: /storage/forms Count: 7
Endpoint: /storage/tabs Count: 9
Endpoint: /storage/passwords Count: 7
Endpoint: /storage/history Count: 9
User: CC
Code: 200 Count: 53
Endpoint: /storage/forms Count: 6
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 22
Endpoint: /storage/history Count: 20
User: DD
Code: 200 Count: 12
Endpoint: /storage/forms Count: 2
Endpoint: /storage/bookmarks Count: 2
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 7
User: EE
Code: 200 Count: 3
Endpoint: /storage/forms Count: 1
Endpoint: /storage/tabs Count: 1
Endpoint: /storage/history Count: 1
User: FF
Code: 200 Count: 35
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 4
Endpoint: /storage/tabs Count: 17
Endpoint: /storage/history Count: 12
User: GG
Code: 200 Count: 72
Endpoint: /storage/forms Count: 10
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 30
Endpoint: /storage/passwords Count: 1
Endpoint: /storage/history Count: 28
Endpoint: /storage/prefs Count: 1
User: HH
Code: 200 Count: 63
Endpoint: /storage/forms Count: 5
Endpoint: /storage/clients Count: 3
Endpoint: /storage/tabs Count: 25
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/bookmarks Count: 4
Endpoint: /storage/history Count: 24
User: II
Code: 200 Count: 19
Endpoint: /storage/forms Count: 1
Endpoint: /storage/tabs Count: 7
Endpoint: /storage/history Count: 11
User: JJ
Code: 200 Count: 2
Endpoint: /storage/clients Count: 2
User: KK
Code: 200 Count: 20
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 6
Endpoint: /storage/passwords Count: 2
Endpoint: /storage/history Count: 4
User: LL
Code: 200 Count: 23
Endpoint: /storage/forms Count: 2
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 11
Endpoint: /storage/history Count: 9
User: MM
Code: 200 Count: 6
Endpoint: /storage/clients Count: 2
Endpoint: /storage/tabs Count: 2
Endpoint: /storage/history Count: 2
User: NN
Code: 200 Count: 58
Endpoint: /storage/forms Count: 3
Endpoint: /storage/clients Count: 1
Endpoint: /storage/tabs Count: 25
Endpoint: /storage/history Count: 28
Endpoint: /storage/addons Count: 1
User: OO
Code: 200 Count: 164
Endpoint: /storage/forms Count: 24
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 66
Endpoint: /storage/bookmarks Count: 3
Endpoint: /storage/passwords Count: 7
Endpoint: /storage/history Count: 59
User: PP
Code: 200 Count: 32
Endpoint: /storage/forms Count: 1
Endpoint: /storage/history Count: 31
User: QQ
Code: 200 Count: 123
Endpoint: /storage/forms Count: 8
Endpoint: /storage/clients Count: 5
Endpoint: /storage/tabs Count: 56
Endpoint: /storage/history Count: 52
Endpoint: /storage/prefs Count: 2

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 21

•

10 years ago

I think this means that users recover from this problem.  Anything else I should check?

Toby Elliott [:telliott]

Comment 22

•

10 years ago

Users are clearly recovering from the problem - at least in the sense that they can subsequently post data and don't have a consistently failing index or something disturbing.

The only issue I can see potentially remaining is what's causing the error in the first place and if it's a sign of data loss. Given that the error makes no sense in the context of the query even if we assume data loss, I don't think this is something that we should be concerned about. I'm guessing this is something deep inside the TokuDB engine that gets rolled up into a 1032 error. They should be made aware of it, but I think we're OK for now

Unless it starts happening a lot...

Ryan Kelly [:rfkelly]

Assignee

Comment 23

•

10 years ago

Per discussion with TokuDB support, this may be due to using read-uncommitted isolation level, which is a hangover from our InnoDB setup.  Bob has filed Bug 1068424 to investigate performance at higher levels, and we should watch out for this bug during such tests.  (And report back to Tokutek if we find anything interesting either way)

Depends on: 1068424

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

10 years ago

Blocks: 1066493

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 24

•

10 years ago

Just to note here too, over on bug 1066493 we experience this problem a lot while running our TPS tests for Firefox Sync on a Nightly build basis. It's happening nearly daily but fluctuates for some reason, so e.g. we haven't seen it over the weekend but got a spike again on Monday.

Is there anything we can do here, so we can make the tests not fail anymore?

Ryan Kelly [:rfkelly]

Assignee

Comment 25

•

10 years ago

Bob, do we see this on any requests besides PUT to /meta/global ?  Perhaps we can hack around it in the code by using a different code-path for this specific call (e.g. a plain INSERT/UPDATE pair rather than using ON DUPLICATE KEY, which should be fairly low impact for just the single-item /meta/global PUT)

Flags: needinfo?(bobm)

Ryan Kelly [:rfkelly]

Assignee

Comment 26

•

10 years ago

Attached patch sync-on-duplicate-key-workaround.diff (obsolete) — Details — Splinter Review

More concretely, here's something like the workaround I'm considering - just bypass the ON DUPLICATE KEY stuff if it's a single-item PUT.  We could limit it more specifically to just /meta/global but that hardly seems worth it.

Attachment #8513926 - Flags: review?(telliott)

Ryan Kelly [:rfkelly]

Assignee

Comment 27

•

10 years ago

Hmm, actually my previous comment says:

> Bob sent me a sample of the full tracebacks from one machine.  Picking the first line from it,
> I see user 6913794 getting this error during a post to their bookmarks collection.

But the posted workaround may still be a win if most of the errors are on /meta/global, since (a) this is blocking TPS, and (b) I'm pretty sure it fails with error bar on /meta/global, but will fail silently and retry on other collections.

Ryan Kelly [:rfkelly]

Assignee

Updated

•

10 years ago

Attachment #8489118 - Attachment is obsolete: true

Bob Micheletto [:bobm]

Comment 28

•

10 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #27)
> But the posted workaround may still be a win if most of the errors are on
> /meta/global, since (a) this is blocking TPS, and (b) I'm pretty sure it
> fails with error bar on /meta/global, but will fail silently and retry on
> other collections.

A potential list from the last 48 hours shows the following HTTP method breakdown, after removing some anomalous servers:
1.	PUT		66
2.	GET		7
3.	DELETE	        3

Note these aren't all necessarily related to this problem.  However, all of the PUT failures are to meta/global.

Flags: needinfo?(bobm)

Ryan Kelly [:rfkelly]

Assignee

Comment 29

•

10 years ago

In addition to this workaround, we could also "tag" failing requests with a special metrics entry to indicate that they trigger this bug.  This would make Bob's life easier when trying to correlate the tracebacks to the actual requests that triggered them.

Ryan Kelly [:rfkelly]

Assignee

Comment 30

•

10 years ago

We're deploying an updated TokuDB, which may be enough to perturb this weird behaviour out of existence: Bug 1089945.

Depends on: 1089945

Ryan Kelly [:rfkelly]

Assignee

Comment 31

•

10 years ago

Attached patch sync-on-duplicate-key-workaround.diff — Details — Splinter Review

Here's a revised patch that applies the workaround only for the "meta" collection.  It also adds a "syncstorage.storage.sql.tokudb_error" entry to the request summary JSON line, which should make it easier to detect which requests are hitting this error without :bobm having to crawl through logfile tracebacks.

I think we should target this for our next deployment and see what effect it has on the error rate.

Attachment #8513926 - Attachment is obsolete: true

Attachment #8513926 - Flags: review?(telliott)

Attachment #8514015 - Flags: review?(telliott)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 32

•

10 years ago

By any means, would there be a chance to get this reviewed and deployed soon? We have to kill the old TPS CI system soon, but the new system throws us all those failures.

Flags: needinfo?(telliott)

Toby Elliott [:telliott]

Comment 33

•

10 years ago

Comment on attachment 8514015 [details] [diff] [review]
sync-on-duplicate-key-workaround.diff

Review of attachment 8514015 [details] [diff] [review]:
-----------------------------------------------------------------

Not loving the patch in general, but if it's needed to quiet the failures, guess we'll live with it.

::: syncstorage/storage/sql/dbconnect.py
@@ +413,2 @@
>          if mysql_error_code in (1205, 1206, 1213, 1689, 1032):
> +            if mysql_error_code == 1032:

unecessary nesting?

@@ +749,5 @@
>          if self._connector.driver == "mysql":
> +            # We are seeing a sporadic bug in production where using
> +            # ON DUPLICATE KEY errors out.  This workaround avoids it
> +            # the common case of PUT /meta/global (collection id 6 is
> +            # the "meta" collection in production).

Not sure what this comment is trying to say. missing an "in"?

Attachment #8514015 - Flags: review?(telliott) → review+

Toby Elliott [:telliott]

Updated

•

10 years ago

Flags: needinfo?(telliott)

Ryan Kelly [:rfkelly]

Assignee

Updated

•

10 years ago

Blocks: 1094588

Ryan Kelly [:rfkelly]

Assignee

Comment 34

•

10 years ago

> unecessary nesting?

*nod* I moved it out one level of nesting and it does seem clearer.

> Not sure what this comment is trying to say. missing an "in"?

Yes, should be "...avoids it in the common case of..."

Committed with the above fixes in https://github.com/mozilla-services/server-syncstorage/commit/055de05206194117531dd19a355805af939d2ad0

Tagged for deployment in Bug 1094587, and I filed Bug 1094588 to follow up on eventual removal of this workaround.

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

10 years ago

Blocks: 994057

Ryan Kelly [:rfkelly]

Assignee

Comment 35

•

10 years ago

We rolled out the workaround, but we should take care not to forget about this issue.  AFAICT our next steps are:

* Deploy updated TokuDB and see if the problem persists (Bug 1089945)
* If not, then escalate to paid TokuDB support

Bob Micheletto [:bobm]

Comment 36

•

10 years ago

Adding legal bug for the TokuDB support: https://bugzilla.mozilla.org/show_bug.cgi?id=1108705

I don't hold sufficient permissions to view that bug, so I cannot add it as a blocker.

Ryan Kelly [:rfkelly]

Assignee

Comment 37

•

10 years ago

Unfortunately, the workaround from Comment 31 does not seem to have helped.  Rather than the 1032 errors, we're now seeing errors like this:

    (IntegrityError) (1062, u\"Duplicate entry \\'16603449-6-global\\' for key \\'PRIMARY\\'\")',)

I suspect it is the same underlying bug in the db, just now being triggered in a different way:

  * The workaround logic is basically "try to update the row, if it's not found then insert it"
  * It tries to update the row, but it's "not found" due to the same weirdness that triggers "can't find record" in the previous code
  * So it tries to insert the row, but can totally find it this time, so fails out with an IntegrityError

This suggests that the underlying bug is something more fundamental, rather than just a problem with the implementation of ON DUPLICATE KEY.  Bleh.

Perhaps Bug 1089945 will come to the rescue here?

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 38

•

10 years ago

Ryan, I assume that means we now have to wait for Toku DB to provide a fix, whereby we do not know when this could happen. Which means our test failures will remain for a while then.

Bob Micheletto [:bobm]

Comment 39

•

10 years ago

(In reply to Henrik Skupin (:whimboo) from comment #38)
> Ryan, I assume that means we now have to wait for Toku DB to provide a fix,
> whereby we do not know when this could happen. Which means our test failures
> will remain for a while then.

You could also use try using the same user instead of deleting it every time; or running a dedicated testing infrastructure that isn't TokuDB backed.

Ryan Kelly [:rfkelly]

Assignee

Comment 40

•

10 years ago

> running a dedicated testing infrastructure that isn't TokuDB backed.

If this is an option for you, I'm happy to advise on and help set up such a thing.

Sadly I can't think of another option for papering over this on the server side.  Could the tests be changed to treat 503s as ignore/skip rather than as failure?  Not ideal but maybe better than persistent failures.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 41

•

10 years ago

We cannot run with a single user given that all the setup nodes will run tests at the same time. Means we need an x-amount of users, and I do not see how this scales over time.

Also the tests have been created as end-to-end tests to exercise the path a real user is acting. So setting up something local would drastically change that vision. I don't think that this is the right way.

Anyway it's off-topic on this bug and we might have to wait until this is fixed, given that it is a real issue we are facing here.

Bob Micheletto [:bobm]

Comment 42

•

10 years ago

Possible solution here: https://mariadb.atlassian.net/browse/MDEV-7150  Will roll this out to stage.

Ryan Kelly [:rfkelly]

Assignee

Comment 43

•

10 years ago

We don't use auto-increment keys on this table (although the same underlying bug could be manifesting here of course)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 44

•

9 years ago

Has something been changed on production lately? I do not see this failure for signIn() anymore with our TPS tests since a couple of days.

Flags: needinfo?(rfkelly)

Ryan Kelly [:rfkelly]

Assignee

Comment 45

•

9 years ago

I'll defer that question to :bobm

Flags: needinfo?(rfkelly) → needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 46

•

9 years ago

(In reply to Henrik Skupin (:whimboo) from comment #44)
> Has something been changed on production lately? I do not see this failure
> for signIn() anymore with our TPS tests since a couple of days.

We've upgraded our backend Database and the Engine that supports it recently.  There were some similar, but not exact, fixes in the changelog for that release.

Flags: needinfo?(bobm)

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 47

•

9 years ago

At least it has fixed our long outstanding problem! I still cannot see the failure as reported on bug 1066493 anymore. Those are really good news! I will close our bug and keep an eye on it.

Ryan Kelly [:rfkelly]

Assignee

Comment 48

•

7 years ago

Bob, is there any remaining work for us in this bug?

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Comment 49

•

7 years ago

None.  Just for completeness I checked the production sync logs for duplicate errors and found none.

Status: REOPENED → RESOLVED

Closed: 10 years ago → 7 years ago

Flags: needinfo?(bobm)

Resolution: --- → FIXED

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard

error-1410698301938.txt 10 years ago Ryan Kelly [:rfkelly] 3.76 KB, text/plain		Details
sync-on-duplicate-key-workaround.diff 10 years ago Ryan Kelly [:rfkelly] 923 bytes, patch		Details \| Diff \| Splinter Review
sync-on-duplicate-key-workaround.diff 10 years ago Ryan Kelly [:rfkelly] 2.24 KB, patch	telliott : review+	Details \| Diff \| Splinter Review