Closed
Bug 692544
Opened 14 years ago
Closed 13 years ago
Lock wait timeouts should not generate a 503
Categories
(Cloud Services Graveyard :: Server: Sync, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: petef, Assigned: rfkelly)
Details
(Whiteboard: [qa?])
Attachments
(2 files, 1 obsolete file)
6.07 KB,
patch
|
telliott
:
review+
|
Details | Diff | Splinter Review |
4.08 KB,
patch
|
telliott
:
review+
|
Details | Diff | Splinter Review |
When we do a node realloc, we generate a lot of 503s due to MySQL "lock wait timeout" errors. As I understand it, it's two transactions battling for locks on the same row. This seems inevitable and expected because people using Sync are likely to have more than one device, and now that things are checking in more often, they are much more likely to collide.
Is there a better way to handle this error other than a 503? It's possible that the better 503 handling is the solution to this bug.
Comment 1•14 years ago
|
||
if this is the issue, and it still seems a bit unlikely (though obviously exacerbated under high load), then there are a couple things we could do. Unfortunately, none are easy or possible at the moment:
1) send a random backoff from one of these 503s. (requires client to have working backoff)
2) set some kind of memcache lock. That wouldn't solve the problem, but it would send back an error way quicker. We could also do something like a short-lived write ip lock for the user (say, 20 seconds) and if another ip came in for that user while the lock is alive, just turn them down
I like the idea of 1) above and I don't like 2), but these are new features and, as they are not "easy or possible", they should be separate bugs so as not to delay resolving this one ASAP.
Meanwhile, we need this implemented:
0) send a coherent error message, rather than dumping a stack trace to the application log.
It's okay if it's a 500 or a 503 or whatever.
"coherent error message" - the application needs to stop backtrace-crashing and start responding with "HTTP/1.1 ### Something-Or-Other", where ### is 400, 500, 503, whatever; and also include a JSON error code if applicable, or otherwise not.
Comment 4•14 years ago
|
||
To clarify: The desire is to stop logging the backtrace on these specific sql errors
What we'll need to do is add some way for the exception being caught in CatchErrorMiddleware to tell the catch to not log the backtrace (or put in the hash).
While we're doing that, might be interesting to give errors the opportunity to spec a "public statement" with generic what-went-wrong in cases where it would help the ops team debug - or just a reassuring message like "Servers are really busy right now, please try later"
Reporter | ||
Comment 5•14 years ago
|
||
along the same lines, we see a lot of duplicate key entries. can't we just lie and say success when we see a duplicate key, since that means the record is there?
IntegrityError: (IntegrityError) (1062, u"Duplicate entry '3851723-2-keys' for key 'PRIMARY'") 'INSERT INTO wbo23 (id, username, collection, modified, payload, payload_size, ttl) VALUES (%s, %s, %s, %s, %s, %s, %s)' (u'keys', '3851723', 2, 131797826988, '{"ciphertext":"JLeJCBYcKzAdc8WR/CCv4TuA2pe9ZyGR/SY1M7SzW7l5FWxq+f9lhN32z6jmTSYYxtuMOsAsFCuO3B/8pZ5xmeaRx7+QDfygQdGzBQT3syEs2pYGoiFvCNkAwOtkUf/oUxNkhp4f3BG8YOaoNYNDdOjbjMIvU9jHDwCsfw/4CroBNvpONcCYjki7J72tvik21yVFi7QCF+XzoxbwAqHLGw==","IV":"Nz4NzjRB2OJvHNSgbIqaPw==","hmac":"9a47de227ddd4b7165956e85daa8239aafb3bdc309565e0838845ae2c069fd4b"}', 339, 2100000000)
IntegrityError: (IntegrityError) (1062, u"Duplicate entry '2468486-11' for key 'PRIMARY'") 'INSERT INTO collections (userid, collectionid, name) VALUES (%s, %s, %s)' ('2468486', 11, u'clients')
Comment 6•14 years ago
|
||
The existence of duplicate key messages is a bug and being tracked in Bug 691409.
Updated•14 years ago
|
Assignee: nobody → rkelly
Updated•13 years ago
|
Whiteboard: [qa?]
Assignee | ||
Comment 7•13 years ago
|
||
:petef or :atoll, is this issue still occurring in production? If so please attach an example of the stacktrace that is being dumped to the application log, so I can find the exception class that isn't being caught.
Assignee | ||
Comment 8•13 years ago
|
||
I found this in the stoken logs which looks like the same exception:
InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')
So I just need to adjust the backend-error-catching code to intercept InternalError as well as OperationalError. Also must check that this won't mask legitimate program-level errors like trying to commit a non-existent transaction.
(In reply to Ryan Kelly [:rfkelly] from comment #8)
> I found this in the stoken logs which looks like the same exception:
>
> InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try
> restarting transaction')
>
> So I just need to adjust the backend-error-catching code to intercept
> InternalError as well as OperationalError. Also must check that this won't
> mask legitimate program-level errors like trying to commit a non-existent
> transaction.
You say "intercept" here, but in the specific case of "InternalError (1205)" we want to retry the entire transaction once.
However, if that's not feasible, then returning a non-backtrace 503 on "InternalError" would be fine.
Assignee | ||
Comment 10•13 years ago
|
||
> You say "intercept" here, but in the specific case of "InternalError (1205)"
> we want to retry the entire transaction once.
This is doable, as we already has retry-once logic in place for connection-related errors. Patch attached.
(Looking for the error number in the string repr of the error is not great, but there's no obvious API for extracting it. Toby, if you really dislike this, I will try to dig deeper and find a better method)
> However, if that's not feasible, then returning a non-backtrace 503 on
> "InternalError" would be fine.
Currently there is no such thing as a non-backtrace 503, the app container logs a traceback for all 503s immediately before responding to the client. It's certainly possible to add a non-backtrace variety, are there any other types of error for which the backtrace should be suppressed?
Attachment #642885 -
Flags: review?(telliott)
![]() |
||
Comment 11•13 years ago
|
||
Comment on attachment 642885 [details] [diff] [review]
patch to treat lock wait timeouts as a backend error
Review of attachment 642885 [details] [diff] [review]:
-----------------------------------------------------------------
::: services/util.py
@@ +355,5 @@
> + return True
> + # MySQL Lock Wait Timeoute errors can be safely retried.
> + # Unfortunatly there's no standard API for getting the error number.
> + error_message = str(exc)
> + if "1025" in error_message:
So, lock wait time out is Error 1205, and that's a dedicated error code:
http://dev.mysql.com/doc/refman/5.1/en/error-messages-server.html#error_er_lock_wait_timeout
Error: 1205 SQLSTATE: HY000 (ER_LOCK_WAIT_TIMEOUT)
Message: Lock wait timeout exceeded; try restarting transaction
In practice, this *should* result in the error string either starting with or containing (incl. trailing space):
"ERROR 1205 "
So, in theory (I don't know what Python does to the SQL result), error_message.startswith("ERROR 1205 ") should match here.
@@ +366,5 @@
> + """Check whether the given db error should be reported as BackendError."""
> + if isinstance(exc, (OperationalError, TimeoutError)):
> + return True
> + error_message = str(exc)
> + if "1025" in error_message:
1025 is Message: Error on rename of '%s' to '%s' (errno: %d), probably a typo on my part somewhere when filing the bug.
Attachment #642885 -
Flags: review-
![]() |
||
Comment 12•13 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #10)
> > However, if that's not feasible, then returning a non-backtrace 503 on
> > "InternalError" would be fine.
>
> Currently there is no such thing as a non-backtrace 503, the app container
> logs a traceback for all 503s immediately before responding to the client.
> It's certainly possible to add a non-backtrace variety, are there any other
> types of error for which the backtrace should be suppressed?
My misstatement, apologies. I was gesturing poorly at the difference between an unintentional backtrace and an intentional one. As long as we're *intentionally* throwing a backtrace, it's fine.
Comment 13•13 years ago
|
||
Comment on attachment 642885 [details] [diff] [review]
patch to treat lock wait timeouts as a backend error
atoll correctly pointed out the typo in the error code, but I'd take it a step further - there are other places that the string could match. At the least, it should be ' 1205 ' to ensure that it's just that number, and depending on what actually bubbles up to the client, something as direct as 'Error: 1205 ' would be best.
Attachment #642885 -
Flags: review?(telliott) → review-
![]() |
||
Comment 14•13 years ago
|
||
Please match "ERROR 1205 " with the trailing space and all-caps ERROR.
Assignee | ||
Comment 15•13 years ago
|
||
(In reply to Toby Elliott [:telliott] from comment #13)
> atoll correctly pointed out the typo in the error code,
Oops.
> but I'd take it a step further - there are other places that the string could match. At the
> least, it should be ' 1205 ' to ensure that it's just that number, and
> depending on what actually bubbles up to the client, something as direct as
> 'Error: 1205 ' would be best.
The thing from Comment 8 is the string representation of the error I receive in python code:
"InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')"
It would probably be safe to match against "(1205,".
This is unforunately an SQLAlchemy error wrapped around a PyMySQL error that has parsed info out of the MySQL error string. So I'll see if I can poke down into it for a more reliable test. There must be a get_error_code() method around here somewhere...
Assignee | ||
Comment 16•13 years ago
|
||
OK, taking another crack at this. I found a private API method that can be used to extract the MySQL error number. This is how SQLAlchemy handles errors internally and it seems more reliable than any string-matching we might come up with.
I have also tweaked some tests to push more statements through the safe_execute function, this was useful in testing the changes against different backends.
Attachment #642885 -
Attachment is obsolete: true
Attachment #648607 -
Flags: review?(telliott)
Comment 17•13 years ago
|
||
Comment on attachment 648607 [details] [diff] [review]
updated patch to treat lock wait timeouts as a backend error
Looks good. Thanks for putting those db functions in a core lib.
Attachment #648607 -
Flags: review?(telliott) → review+
Assignee | ||
Comment 18•13 years ago
|
||
Attaching a second patch to forward-port these changes to sync2.0.
Duplicating this is clearly less than ideal. Once we have a consensus on moving forward with a mozsvc splitup, I will factor this layer out into a utility lib that can be shared by sync1.1 and sync2.0.
But in the meantime, I don't want to forget to do it in either place.
Attachment #651644 -
Flags: review?(telliott)
Updated•13 years ago
|
Attachment #651644 -
Flags: review?(telliott) → review+
Assignee | ||
Comment 19•13 years ago
|
||
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•2 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•