755589 - database errors while loadtesting stage

Reporter

Description

•

13 years ago

AITC Stage currently has server-aitc tag rpm-1.1-1 deployed, and I can consistently produce database errors by running the loadtest script. The most frequently-logged error is: TimeoutError: QueuePool limit of size 10 overflow 10 reached, connection timed out, timeout 30 Occasionally I get some of: OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') As far as I can see the code is closing connections promptly after use, and I have been poking around in the internals of sqlalchemy to confirm this. Need some advice on what else I can do to debug this. What's quite strange is that rpm-1.1-1 is a very small change from rpm-1.0-1 - so small that it uses exactly the same code for the storage layer. Yet rpm-1.0-1 passed through loadtest successfully. It would be interesting to re-deploy rpm-1.0-1 just to double-check that it passes under our current loadtesting setup.

James Bonacci [:jbonacci]

Comment 1

•

13 years ago

Yes indeed, duped/reviewed by QA

Whiteboard: [qa+]

Ryan Kelly [:rfkelly]

Reporter

Comment 2

•

13 years ago

MySQL shows idle connections, sure looks like a regression in the AITC/SyncStorage code. ---------------------------------------------------------- <petef> if you fire up some load, i'll poke at mysql <rfkelly> petef: firing up now <rfkelly> and stopped; I seem to be getting "500 Server Error" instead of a 503 <petef> i don't see any aitc connections <petef> are you using the ini file in /etc from puppet or your own? <rfkelly> haven't touched the ini file <rfkelly> one sec <petef> there, i see them now <rfkelly> running a lighter load at the moment <rfkelly> and no errors so far <rfkelly> ramp it up? <petef> yeah <rfkelly> ok, spawning with dist.sh <rfkelly> and it's fail city here now <petef> i have idle connections to mysql <petef> 15-18 seconds idle! <rfkelly> ok, meaning some hasn't cleaned up properly? <petef> i bet it goes up to the sqlalchemy pool timeout <petef> yeah <petef> it waits 10 (or whatever) seconds for a free connection in the pool, then gives up <rfkelly> 30 seconds I think <rfkelly> killing load <petef> ok <rfkelly> so, definitely looks like a resource cleanup problem in new AITC deploy

Ryan Kelly [:rfkelly]

Reporter

Comment 3

•

13 years ago

We tried rolling back to the 1.0 rpms and the problem was still present. Whatever is going on here, I don't think it's due to the 1.1 code updates. Two strange things are confounding the loadtests and making it hard to tell what's going on: 1) Something is automatically retrying requests in response to "503 Service Unavailable". This will not work with tokenserver/MAC-based authentication because each request is expected to have a unique nonce. If the request is retried, it will have a stale nonce and be rejected with a very confusing "401 Unauthorized" error. 2) Most loadtest failures are receiving a "499 Client Closed Request" error from nginx, which something else (zeus?) seems to be turning into a "500 Server Error" before it reaches the test client. nginx log lines look like this: 10.14.214.200 - - [16/May/2012:20:09:37 -0700] "GET /1.0/790548/apps/ HTTP/1.0" 499 0 "-" "-" XFF="63.245.208.248" TIME=29.413

Ryan Kelly [:rfkelly]

Reporter

Comment 4

•

13 years ago

One thought is that these 499 responses might be because we've drained things down to a single webhead, and there's so much load that it's failing out before things even get to the database. We should re-enable to whole cluster and try again.

Ryan Kelly [:rfkelly]

Reporter

Comment 5

•

13 years ago

Also, petef, the retry-on-503 thing explains the strange mix of 503s and 401s we saw while loadtesting against prod. They were all 503s in reality, but some were being turned into 401s due to this problem.

Ryan Kelly [:rfkelly]

Reporter

Comment 6

•

13 years ago

Here's another potential problem: the fix for Bug 692789 was not forward-ported to tip and so was not picked up when I split off the sync2.0 codebase. Basically, the bug is to automatically retry the connection if MySQL returns 'Lost connection to MySQL server during query'. Pete, can you confirm whether this logic should be ported to sync2.0/AITC?

Pete Fritchman [:petef]

Comment 7

•

13 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #6) > Here's another potential problem: the fix for Bug 692789 was not > forward-ported to tip and so was not picked up when I split off the sync2.0 > codebase. Basically, the bug is to automatically retry the connection if > MySQL returns 'Lost connection to MySQL server during query'. > > Pete, can you confirm whether this logic should be ported to sync2.0/AITC? Yes, I think we should port that over.

Ryan Kelly [:rfkelly]

Reporter

Comment 8

•

13 years ago

OK, it's pretty clear that whatever is going on here, it's present in the initial 1.0 deploy as well. There are many small improvements in trunk, let's roll stage forward to the latest code and continue working from there. petef: do you want me to make a new tag, or are you happy to build rpms from dev? `make build_rpms CHANNEL=dev RPM_CHANNEL=dev` works cleanly for me. Of note: server-syncstorage now contains the patch from Bug 747255, meaning quotaa are now expressed in integer bytes rather than KB. The quota setting in the config file will need to be updated accordingly.

James Bonacci [:jbonacci]

Comment 9

•

13 years ago

I agree - can we move back to Sync 1.1-x or whatever the version is now and continue hammering on Stage...

Pete Fritchman [:petef]

Comment 10

•

13 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #8) > OK, it's pretty clear that whatever is going on here, it's present in the > initial 1.0 deploy as well. There are many small improvements in trunk, > let's roll stage forward to the latest code and continue working from there. OK. > petef: do you want me to make a new tag, or are you happy to build rpms from > dev? `make build_rpms CHANNEL=dev RPM_CHANNEL=dev` works cleanly for me. That's fine for now. We can save tagging for when we think we have a ready-to-ship release. > Of note: server-syncstorage now contains the patch from Bug 747255, meaning > quotaa are now expressed in integer bytes rather than KB. The quota setting > in the config file will need to be updated accordingly. Gotcha. updated: -quota_size = 5120 +quota_size = 5242880 New RPMs built, shipped to staging, and service restarted.

Ryan Kelly [:rfkelly]

Reporter

Comment 11

•

13 years ago

OK, we have tweaked some timeouts, fixed some database indexes and tracked down a race condition that only popped up under heavy load. Funkload tests now run all green for me. Marking as fixed, :jbonacci please try to verify by doing your own loadtest run.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 12

•

13 years ago

OK. On my list for today.

Ryan Kelly [:rfkelly]

Reporter

Comment 13

•

13 years ago

petef: please rebuild and deploy rpms from dev so we can verify this with all the latest fixes. I'll hold off on making the next tag until we have a clear path forward for Bug 750566.

Pete Fritchman [:petef]

Comment 14

•

13 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #13) > petef: please rebuild and deploy rpms from dev so we can verify this with > all the latest fixes. I'll hold off on making the next tag until we have a > clear path forward for Bug 750566. done

James Bonacci [:jbonacci]

Comment 15

•

13 years ago

Load test started around 11:30am PDT on 06/05/2012. Plan is to run this 24 hours while inspecting the loadtest.log files and zeus logs.

James Bonacci [:jbonacci]

Comment 16

•

13 years ago

Completed a 24-hour run in Stage: 6/5/2012 11:15am PDT - 6/6/2012 11:15am PDT. Errors/Failures were seen in the STDOUT/STDERR of dist.sh that need to be correlated with the ~jbonacci/aitc-loadtest/loadtest.log files on client4-client9 boxes. :petef to check zeus logs... Not sure if we will find any issues directly related to this bug, so we may be able to Verify this ticket but open one or more depending on what is found in zeus logs.

James Bonacci [:jbonacci]

Comment 17

•

13 years ago

Looking that the loadtest.logs on client4 - client7, there were some runs (spikes) of the following types of failure: FAIL: Can't load https://stage-aitc*.services.mozilla.com/1.0/888199/apps/BLAH etc But, most of the failures look like this: FAIL: /1.0/aitc/1.0 Still need to correlate with zeus logs

James Bonacci [:jbonacci]

Comment 18

•

13 years ago

Dev and OPs have pretty much concluded that the issues here have been resolved. Separate bugs can be opened for zeus-related issues found during this last load test.

Status: RESOLVED → VERIFIED

BMO Automation

Updated

•

2 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

database errors while loadtesting stage

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

Tracking

(Not tracked)

People

(Reporter: rfkelly, Unassigned)

References

Details

(Whiteboard: [qa+])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated