Closed Bug 692544 Opened 14 years ago Closed 13 years ago

Lock wait timeouts should not generate a 503

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: petef, Assigned: rfkelly)

Details

(Whiteboard: [qa?])

Attachments

(2 files, 1 obsolete file)

patch to treat lock wait timeouts as a backend error 13 years ago Ryan Kelly [:rfkelly] 2.30 KB, patch	telliott : review- Atoll : review-	Details \| Diff \| Splinter Review
updated patch to treat lock wait timeouts as a backend error 13 years ago Ryan Kelly [:rfkelly] 6.07 KB, patch	telliott : review+	Details \| Diff \| Splinter Review
patch forward-porting these changes to sync2.0 13 years ago Ryan Kelly [:rfkelly] 4.08 KB, patch	telliott : review+	Details \| Diff \| Splinter Review

Pete Fritchman [:petef]

Reporter

Description

•

14 years ago

When we do a node realloc, we generate a lot of 503s due to MySQL "lock wait timeout" errors. As I understand it, it's two transactions battling for locks on the same row. This seems inevitable and expected because people using Sync are likely to have more than one device, and now that things are checking in more often, they are much more likely to collide. Is there a better way to handle this error other than a 503? It's possible that the better 503 handling is the solution to this bug.

Toby Elliott [:telliott]

Comment 1

•

14 years ago

if this is the issue, and it still seems a bit unlikely (though obviously exacerbated under high load), then there are a couple things we could do. Unfortunately, none are easy or possible at the moment: 1) send a random backoff from one of these 503s. (requires client to have working backoff) 2) set some kind of memcache lock. That wouldn't solve the problem, but it would send back an error way quicker. We could also do something like a short-lived write ip lock for the user (say, 20 seconds) and if another ip came in for that user while the lock is alive, just turn them down

:Atoll

Comment 2

•

14 years ago

I like the idea of 1) above and I don't like 2), but these are new features and, as they are not "easy or possible", they should be separate bugs so as not to delay resolving this one ASAP. Meanwhile, we need this implemented: 0) send a coherent error message, rather than dumping a stack trace to the application log. It's okay if it's a 500 or a 503 or whatever.

:Atoll

Comment 3

•

14 years ago

"coherent error message" - the application needs to stop backtrace-crashing and start responding with "HTTP/1.1 ### Something-Or-Other", where ### is 400, 500, 503, whatever; and also include a JSON error code if applicable, or otherwise not.

Toby Elliott [:telliott]

Comment 4

•

14 years ago

To clarify: The desire is to stop logging the backtrace on these specific sql errors What we'll need to do is add some way for the exception being caught in CatchErrorMiddleware to tell the catch to not log the backtrace (or put in the hash). While we're doing that, might be interesting to give errors the opportunity to spec a "public statement" with generic what-went-wrong in cases where it would help the ops team debug - or just a reassuring message like "Servers are really busy right now, please try later"

Pete Fritchman [:petef]

Reporter

Comment 5

•

14 years ago

along the same lines, we see a lot of duplicate key entries. can't we just lie and say success when we see a duplicate key, since that means the record is there? IntegrityError: (IntegrityError) (1062, u"Duplicate entry '3851723-2-keys' for key 'PRIMARY'") 'INSERT INTO wbo23 (id, username, collection, modified, payload, payload_size, ttl) VALUES (%s, %s, %s, %s, %s, %s, %s)' (u'keys', '3851723', 2, 131797826988, '{"ciphertext":"JLeJCBYcKzAdc8WR/CCv4TuA2pe9ZyGR/SY1M7SzW7l5FWxq+f9lhN32z6jmTSYYxtuMOsAsFCuO3B/8pZ5xmeaRx7+QDfygQdGzBQT3syEs2pYGoiFvCNkAwOtkUf/oUxNkhp4f3BG8YOaoNYNDdOjbjMIvU9jHDwCsfw/4CroBNvpONcCYjki7J72tvik21yVFi7QCF+XzoxbwAqHLGw==","IV":"Nz4NzjRB2OJvHNSgbIqaPw==","hmac":"9a47de227ddd4b7165956e85daa8239aafb3bdc309565e0838845ae2c069fd4b"}', 339, 2100000000) IntegrityError: (IntegrityError) (1062, u"Duplicate entry '2468486-11' for key 'PRIMARY'") 'INSERT INTO collections (userid, collectionid, name) VALUES (%s, %s, %s)' ('2468486', 11, u'clients')

Toby Elliott [:telliott]

Comment 6

•

14 years ago

The existence of duplicate key messages is a bug and being tracked in Bug 691409.

Toby Elliott [:telliott]

Updated

•

14 years ago

Assignee: nobody → rkelly

James Bonacci [:jbonacci]

Updated

•

14 years ago

Whiteboard: [qa?]

Ryan Kelly [:rfkelly]

Assignee

Comment 7

•

13 years ago

:petef or :atoll, is this issue still occurring in production? If so please attach an example of the stacktrace that is being dumped to the application log, so I can find the exception class that isn't being caught.

Ryan Kelly [:rfkelly]

Assignee

Comment 8

•

13 years ago

I found this in the stoken logs which looks like the same exception: InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction') So I just need to adjust the backend-error-catching code to intercept InternalError as well as OperationalError. Also must check that this won't mask legitimate program-level errors like trying to commit a non-existent transaction.

:Atoll

Comment 9

•

13 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #8) > I found this in the stoken logs which looks like the same exception: > > InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try > restarting transaction') > > So I just need to adjust the backend-error-catching code to intercept > InternalError as well as OperationalError. Also must check that this won't > mask legitimate program-level errors like trying to commit a non-existent > transaction. You say "intercept" here, but in the specific case of "InternalError (1205)" we want to retry the entire transaction once. However, if that's not feasible, then returning a non-backtrace 503 on "InternalError" would be fine.

Ryan Kelly [:rfkelly]

Assignee

Comment 10

•

13 years ago

Attached patch patch to treat lock wait timeouts as a backend error (obsolete) — Details — Splinter Review

> You say "intercept" here, but in the specific case of "InternalError (1205)" > we want to retry the entire transaction once. This is doable, as we already has retry-once logic in place for connection-related errors. Patch attached. (Looking for the error number in the string repr of the error is not great, but there's no obvious API for extracting it. Toby, if you really dislike this, I will try to dig deeper and find a better method) > However, if that's not feasible, then returning a non-backtrace 503 on > "InternalError" would be fine. Currently there is no such thing as a non-backtrace 503, the app container logs a traceback for all 503s immediately before responding to the client. It's certainly possible to add a non-backtrace variety, are there any other types of error for which the backtrace should be suppressed?

Attachment #642885 - Flags: review?(telliott)

:Atoll

Comment 11

•

13 years ago

Comment on attachment 642885 [details] [diff] [review] patch to treat lock wait timeouts as a backend error Review of attachment 642885 [details] [diff] [review]: ----------------------------------------------------------------- ::: services/util.py @@ +355,5 @@ > + return True > + # MySQL Lock Wait Timeoute errors can be safely retried. > + # Unfortunatly there's no standard API for getting the error number. > + error_message = str(exc) > + if "1025" in error_message: So, lock wait time out is Error 1205, and that's a dedicated error code: http://dev.mysql.com/doc/refman/5.1/en/error-messages-server.html#error_er_lock_wait_timeout Error: 1205 SQLSTATE: HY000 (ER_LOCK_WAIT_TIMEOUT) Message: Lock wait timeout exceeded; try restarting transaction In practice, this *should* result in the error string either starting with or containing (incl. trailing space): "ERROR 1205 " So, in theory (I don't know what Python does to the SQL result), error_message.startswith("ERROR 1205 ") should match here. @@ +366,5 @@ > + """Check whether the given db error should be reported as BackendError.""" > + if isinstance(exc, (OperationalError, TimeoutError)): > + return True > + error_message = str(exc) > + if "1025" in error_message: 1025 is Message: Error on rename of '%s' to '%s' (errno: %d), probably a typo on my part somewhere when filing the bug.

Attachment #642885 - Flags: review-

:Atoll

Comment 12

•

13 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #10) > > However, if that's not feasible, then returning a non-backtrace 503 on > > "InternalError" would be fine. > > Currently there is no such thing as a non-backtrace 503, the app container > logs a traceback for all 503s immediately before responding to the client. > It's certainly possible to add a non-backtrace variety, are there any other > types of error for which the backtrace should be suppressed? My misstatement, apologies. I was gesturing poorly at the difference between an unintentional backtrace and an intentional one. As long as we're *intentionally* throwing a backtrace, it's fine.

Toby Elliott [:telliott]

Comment 13

•

13 years ago

Comment on attachment 642885 [details] [diff] [review] patch to treat lock wait timeouts as a backend error atoll correctly pointed out the typo in the error code, but I'd take it a step further - there are other places that the string could match. At the least, it should be ' 1205 ' to ensure that it's just that number, and depending on what actually bubbles up to the client, something as direct as 'Error: 1205 ' would be best.

Attachment #642885 - Flags: review?(telliott) → review-

:Atoll

Comment 14

•

13 years ago

Please match "ERROR 1205 " with the trailing space and all-caps ERROR.

Ryan Kelly [:rfkelly]

Assignee

Comment 15

•

13 years ago

(In reply to Toby Elliott [:telliott] from comment #13) > atoll correctly pointed out the typo in the error code, Oops. > but I'd take it a step further - there are other places that the string could match. At the > least, it should be ' 1205 ' to ensure that it's just that number, and > depending on what actually bubbles up to the client, something as direct as > 'Error: 1205 ' would be best. The thing from Comment 8 is the string representation of the error I receive in python code: "InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')" It would probably be safe to match against "(1205,". This is unforunately an SQLAlchemy error wrapped around a PyMySQL error that has parsed info out of the MySQL error string. So I'll see if I can poke down into it for a more reliable test. There must be a get_error_code() method around here somewhere...

Ryan Kelly [:rfkelly]

Assignee

Comment 16

•

13 years ago

Attached patch updated patch to treat lock wait timeouts as a backend error — Details — Splinter Review

OK, taking another crack at this. I found a private API method that can be used to extract the MySQL error number. This is how SQLAlchemy handles errors internally and it seems more reliable than any string-matching we might come up with. I have also tweaked some tests to push more statements through the safe_execute function, this was useful in testing the changes against different backends.

Attachment #642885 - Attachment is obsolete: true

Attachment #648607 - Flags: review?(telliott)

Toby Elliott [:telliott]

Comment 17

•

13 years ago

Comment on attachment 648607 [details] [diff] [review] updated patch to treat lock wait timeouts as a backend error Looks good. Thanks for putting those db functions in a core lib.

Attachment #648607 - Flags: review?(telliott) → review+

Ryan Kelly [:rfkelly]

Assignee

Comment 18

•

13 years ago

Attached patch patch forward-porting these changes to sync2.0 — Details — Splinter Review

Attaching a second patch to forward-port these changes to sync2.0. Duplicating this is clearly less than ideal. Once we have a consensus on moving forward with a mozsvc splitup, I will factor this layer out into a utility lib that can be shared by sync1.1 and sync2.0. But in the meantime, I don't want to forget to do it in either place.

Attachment #651644 - Flags: review?(telliott)

Toby Elliott [:telliott]

Updated

•

13 years ago

Attachment #651644 - Flags: review?(telliott) → review+

Ryan Kelly [:rfkelly]

Assignee

Comment 19

•

13 years ago

https://github.com/mozilla-services/server-syncstorage/commit/420966995dc3f77a0a776199478525f2e6a0f0db

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

2 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.