Closed Bug 755589 Opened 13 years ago Closed 13 years ago

database errors while loadtesting stage

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

Details

(Whiteboard: [qa+])

AITC Stage currently has server-aitc tag rpm-1.1-1 deployed, and I can consistently produce database errors by running the loadtest script. The most frequently-logged error is: TimeoutError: QueuePool limit of size 10 overflow 10 reached, connection timed out, timeout 30 Occasionally I get some of: OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') As far as I can see the code is closing connections promptly after use, and I have been poking around in the internals of sqlalchemy to confirm this. Need some advice on what else I can do to debug this. What's quite strange is that rpm-1.1-1 is a very small change from rpm-1.0-1 - so small that it uses exactly the same code for the storage layer. Yet rpm-1.0-1 passed through loadtest successfully. It would be interesting to re-deploy rpm-1.0-1 just to double-check that it passes under our current loadtesting setup.
Yes indeed, duped/reviewed by QA
Whiteboard: [qa+]
MySQL shows idle connections, sure looks like a regression in the AITC/SyncStorage code. ---------------------------------------------------------- <petef> if you fire up some load, i'll poke at mysql <rfkelly> petef: firing up now <rfkelly> and stopped; I seem to be getting "500 Server Error" instead of a 503 <petef> i don't see any aitc connections <petef> are you using the ini file in /etc from puppet or your own? <rfkelly> haven't touched the ini file <rfkelly> one sec <petef> there, i see them now <rfkelly> running a lighter load at the moment <rfkelly> and no errors so far <rfkelly> ramp it up? <petef> yeah <rfkelly> ok, spawning with dist.sh <rfkelly> and it's fail city here now <petef> i have idle connections to mysql <petef> 15-18 seconds idle! <rfkelly> ok, meaning some hasn't cleaned up properly? <petef> i bet it goes up to the sqlalchemy pool timeout <petef> yeah <petef> it waits 10 (or whatever) seconds for a free connection in the pool, then gives up <rfkelly> 30 seconds I think <rfkelly> killing load <petef> ok <rfkelly> so, definitely looks like a resource cleanup problem in new AITC deploy
We tried rolling back to the 1.0 rpms and the problem was still present. Whatever is going on here, I don't think it's due to the 1.1 code updates. Two strange things are confounding the loadtests and making it hard to tell what's going on: 1) Something is automatically retrying requests in response to "503 Service Unavailable". This will not work with tokenserver/MAC-based authentication because each request is expected to have a unique nonce. If the request is retried, it will have a stale nonce and be rejected with a very confusing "401 Unauthorized" error. 2) Most loadtest failures are receiving a "499 Client Closed Request" error from nginx, which something else (zeus?) seems to be turning into a "500 Server Error" before it reaches the test client. nginx log lines look like this: 10.14.214.200 - - [16/May/2012:20:09:37 -0700] "GET /1.0/790548/apps/ HTTP/1.0" 499 0 "-" "-" XFF="63.245.208.248" TIME=29.413
One thought is that these 499 responses might be because we've drained things down to a single webhead, and there's so much load that it's failing out before things even get to the database. We should re-enable to whole cluster and try again.
Also, petef, the retry-on-503 thing explains the strange mix of 503s and 401s we saw while loadtesting against prod. They were all 503s in reality, but some were being turned into 401s due to this problem.
Here's another potential problem: the fix for Bug 692789 was not forward-ported to tip and so was not picked up when I split off the sync2.0 codebase. Basically, the bug is to automatically retry the connection if MySQL returns 'Lost connection to MySQL server during query'. Pete, can you confirm whether this logic should be ported to sync2.0/AITC?
(In reply to Ryan Kelly [:rfkelly] from comment #6) > Here's another potential problem: the fix for Bug 692789 was not > forward-ported to tip and so was not picked up when I split off the sync2.0 > codebase. Basically, the bug is to automatically retry the connection if > MySQL returns 'Lost connection to MySQL server during query'. > > Pete, can you confirm whether this logic should be ported to sync2.0/AITC? Yes, I think we should port that over.
OK, it's pretty clear that whatever is going on here, it's present in the initial 1.0 deploy as well. There are many small improvements in trunk, let's roll stage forward to the latest code and continue working from there. petef: do you want me to make a new tag, or are you happy to build rpms from dev? `make build_rpms CHANNEL=dev RPM_CHANNEL=dev` works cleanly for me. Of note: server-syncstorage now contains the patch from Bug 747255, meaning quotaa are now expressed in integer bytes rather than KB. The quota setting in the config file will need to be updated accordingly.
I agree - can we move back to Sync 1.1-x or whatever the version is now and continue hammering on Stage...
(In reply to Ryan Kelly [:rfkelly] from comment #8) > OK, it's pretty clear that whatever is going on here, it's present in the > initial 1.0 deploy as well. There are many small improvements in trunk, > let's roll stage forward to the latest code and continue working from there. OK. > petef: do you want me to make a new tag, or are you happy to build rpms from > dev? `make build_rpms CHANNEL=dev RPM_CHANNEL=dev` works cleanly for me. That's fine for now. We can save tagging for when we think we have a ready-to-ship release. > Of note: server-syncstorage now contains the patch from Bug 747255, meaning > quotaa are now expressed in integer bytes rather than KB. The quota setting > in the config file will need to be updated accordingly. Gotcha. updated: -quota_size = 5120 +quota_size = 5242880 New RPMs built, shipped to staging, and service restarted.
OK, we have tweaked some timeouts, fixed some database indexes and tracked down a race condition that only popped up under heavy load. Funkload tests now run all green for me. Marking as fixed, :jbonacci please try to verify by doing your own loadtest run.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
OK. On my list for today.
petef: please rebuild and deploy rpms from dev so we can verify this with all the latest fixes. I'll hold off on making the next tag until we have a clear path forward for Bug 750566.
(In reply to Ryan Kelly [:rfkelly] from comment #13) > petef: please rebuild and deploy rpms from dev so we can verify this with > all the latest fixes. I'll hold off on making the next tag until we have a > clear path forward for Bug 750566. done
Load test started around 11:30am PDT on 06/05/2012. Plan is to run this 24 hours while inspecting the loadtest.log files and zeus logs.
Completed a 24-hour run in Stage: 6/5/2012 11:15am PDT - 6/6/2012 11:15am PDT. Errors/Failures were seen in the STDOUT/STDERR of dist.sh that need to be correlated with the ~jbonacci/aitc-loadtest/loadtest.log files on client4-client9 boxes. :petef to check zeus logs... Not sure if we will find any issues directly related to this bug, so we may be able to Verify this ticket but open one or more depending on what is found in zeus logs.
Looking that the loadtest.logs on client4 - client7, there were some runs (spikes) of the following types of failure: FAIL: Can't load https://stage-aitc*.services.mozilla.com/1.0/888199/apps/BLAH etc But, most of the failures look like this: FAIL: /1.0/aitc/1.0 Still need to correlate with zeus logs
Dev and OPs have pretty much concluded that the issues here have been resolved. Separate bugs can be opened for zeus-related issues found during this last load test.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.