:rfkelly and I found this while running some new funkload tests in Sync Stage. After a fresh hg clone of server-storage, I started the dist.sh script with just one "worker" - one client in use. The very short load test seemed to work will with no apparent errors or other issues. The second attempt of dist.sh used all available workers (client4 - client9) was not at all successful. We started to see 5xx spikes in the Pencil graphs for syncstorage on Stage and a significant number of "Failures" in the loadtest.xml files on the clientX boxes. Example: ...etc... <response cycle="001" cvus="050" thread="049" suite="StressTest" name="test_storage_session" step="001" number="001" type="get" result="Failure" url="/1.1/cuser568797/info/collections" code="503" description="" time="1342049540.41" duration="0.0150330066681"> <headers> <header name="retry-after" value="600" /> <header name="date" value="Wed, 11 Jul 2012 23:32:20 GMT" /> <header name="connection" value="close" /> <header name="content-type" value="application/json; charset=UTF-8" /> <header name="content-length" value="39" /> </headers> <body><![CDATA[ "server issue: database is not healthy" ]]> </body> </response> <testResult cycle="001" cvus="050" thread="049" suite="StressTest" name="test_storage_session" time="1342049540.41" result="Failure" steps="1" duration="0.0156259536743" connection_duration="0.0150330066681" requests="1" pages="0" xmlrpc="0" redirects="0" images="0" links="0" traceback='Traceback (most recent call last): File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 946, in __call__ testMethod() File "/home/jbonacci/syncstorage-loadtest/stress.py", line 69, in test_storage_session response = self.get(url) File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 391, in get method="get", load_auto_links=load_auto_links) File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 299, in _browse response = self._connect(url, params, ok_codes, method, description) File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 216, in _connect raise self.failureException, str(value.response) AssertionError: /1.1/cuser568797/info/collections HTTP Response 503: Database unavailable ' /> ...etc...
Leaving the worker/clientX snapshot as is, so.... The server-storage install is in /home/jbonacci on client4 It has the nohup.out log of the STDOUT/STDERR from the load test. There are also syncstorage-loadtest directories on all clients in /home/jbonacci Each of these directories has a loadtest.log (not very useful) and at least one loadtest.xml (full of errors).
These failures all appear to be 503 responses saying: HTTP Response 503: Database unavailable server issue: database is not healthy These error messages are not generated within the python app, so I guess they are directly from Zeus. Digging down into server-side log files to see if I can find out more.
I found 26 instances of a 503 error within syncstorage itself. The additional failures will be zeus backing off clients after receiving a 503 from syncstorage. Interestingly, all failing requests are pretty much identical, differing only in the randomly-generated userid and 'newer' paramter: 10.14.214.201 - cuser521521 [11/Jul/2012:16:29:57 -0700] "GET /1.1/cuser521521/storage/bookmarks?full=1&newer=1341780988 HTTP/1.0" 503 66 "https://stage-sync17.services.mozilla.com/1.1/cuser521521/info/collections" "FunkLoad/1.16.1" XFF="22.214.171.124" TIME=9.042 They are all to the "bookmarks" collection and they all took just over 9 seconds to come back with an error. In the application log I have instances of this traceback: File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 615, in get_items return [WBO(line, converters) for line in res] File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 287, in _do_query_fetchall res = safe_execute(self._engine, *args, **kwds) File "/usr/lib/python2.6/site-packages/services/util.py", line 339, in safe_execute return engine.execute(*args, **kwargs) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1788, in execute return connection.execute(statement, *multiparams, **params) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1191, in execute params) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement return self.__execute_context(context) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context context.parameters, context=context) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute context) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1350, in _handle_dbapi_exception self._autorollback() File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1144, in _autorollback self._rollback_impl() File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1088, in _rollback_impl self._handle_dbapi_exception(e, None, None, None, None) File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1085, in _rollback_impl self.engine.dialect.do_rollback(self.connection) File "/usr/lib/python2.6/site-packages/sqlalchemy/dialects/mysql/base.py", line 1657, in do_rollback connection.rollback() File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 575, in rollback self.errorhandler(None, exc, value) File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 187, in defaulterrorhandler raise Error(errorclass, errorvalue) DBAPIError: (Error) (<class 'socket.error'>, error(32, 'Broken pipe')) None None This is the app trying to execute a SELECT query, failing, trying to execute a rollback and then getting a socketerror on its connection to MySQL. Unfortunately the rollback failure seems to be masking the original error. Even more unfortunately, this auto-rollback logic is down inside sqlalchemy rather than our own code. :atoll, does the ~9 second timeout give you any clues as to what the underlying error might be?
(In reply to Ryan Kelly [:rfkelly] from comment #3) > > They are all to the "bookmarks" collection This turns out to be a red herring, caused by a bug in the loadtest script. It shuffles the list of collections in order to select a random one, but then makes its selection from the original unshuffled list. "bookmarks" is simply the collection at position zero in the list. Patch forthcoming. > and they all took just over 9 seconds to come back with an error. This article suggests that errors with a 9-second delay can be caused by underlying problems in the TCP connection: http://www.mysqlperformanceblog.com/2011/04/19/mysql-connection-timeouts/ A possibility? I can't find anything we have explicitly configured to this strange timeout value.
Do we have metrics about how often an existing pool connection is used vs. a new one created?
(In reply to Richard Soderberg [:atoll] from comment #5) > Do we have metrics about how often an existing pool connection is used vs. a > new one created? We do not. This sounds useful, I will file a bug to add them. Hopefully SQLAlchemy provides the necessary hooks to collection this information.
I tried another funkload run this evening, got about half an hour of success before it started giving me 503 errors. These ones are different though, I'm seeing lines like this in the nginx access log: 10.14.214.201 - cuser920348 [13/Jul/2012:00:12:56 -0700] "POST /1.1/cuser920348/storage/bookmarks HTTP/1.0" 499 0 "https://stage-sync20.services.mozilla.com/1.1/cuser920348/info/collections" "FunkLoad/1.16.1" XFF="126.96.36.199" TIME=32.673 IIRC, the 499 response code is logged when zeus times out the connection, disconnecting from nginx and returning a HTTP 503 response to the client. The funkload log files show the following response being returned: HTTP 503 Service Unavailable server issue: pool exhausted But strangely, the syncstorage app is logging this request as a success, and doing so about 30s *before* the 499-status log line shows up in nginx: @400000004fffcac31917d20c 188.8.131.52, 10.14.214.201 - cuser920348 [13/Jul/2012:00:12:23 -0700] "POST /1.1/cuser920348/storage/bookmarks HTTP/1.0" 200 3342 "https://stage-sync20.services.mozilla.com/1.1/cuser920348/info/collections" "FunkLoad/1.16.1" So I'm really not sure what to make of that. Will keep digging...
It may not be possible with :petef's imminent departure, but running a half-hour to an hour of grinder against current stage would be very useful here. It the grinder tests don't produce these failures, that might give us a clue about where to look next.
:petef is running funkload today 7/13/2012.
Thanks :jbonacci and :petef. Not sure if posting pencil links in bugzilla is appropriate, but these are the parameters I used to capture graphs from both tests side-by-side: /dash/stage/syncstorage?start=2012-07-12+23%3A00%3A00&duration=16+hours&width=1000&height=400 There were no errors produced by the grinder test run. Some other points of difference were: * grinder test run ~50qps lower than funkload tests, although it looks like they may be asymptoting to the same value if they were left to run for longer. * grinder tests caused much lower CPU usage than funkload tests, 50% vs 150%. I will continue investigating this today.
I am now getting errors like this in the application error log, from a failing INSERT statement: (1114, u"The table 'wbo80' is full") It's not tied to a specific 'wboX' table, inserting into any table seem to produce the same error. I don't have access to the database machines to investigate the underlying cause, but it sounds like maybe a disk space issue?
With "Table Full" issues fixed, I am able to proceed with some light load on stage. I can reproduce what appears to be the same error as seen in Comment 7, and will continue to dig into this today.
Turns out the failures are not as mysterious as Comment 7 suggests, and results are not getting lost between gunicorn and nginx - the gunicorn logs are simply recording the start-time of the request rather than the end time. The failing requests are completing inside syncstorage, but *after* zeus has given up and returned a failure to the client. So the immediate problem appears to be: some requests are taking longer than 30s to complete, and zeus is timing them out after 30s. IIRC correctly the prod zeus timeout is 600s, :atoll can you please check/adjust the stage timeout to match? Still, I am only running the bench from a single client machine, it doesn't seem like there should be enough load for requests to slow down that badly. It's only managing ~20qps. Running the test across all 6 client machines, I am only able to push it up to ~30qps with many 30-second-timeout failures. We used to be able to push it up to a few hundred qps without breaking a sweat.
(In reply to Ryan Kelly [:rfkelly] from comment #14) > Running the test across all 6 client machines, I am only able to push it up > to ~30qps with many 30-second-timeout failures. We used to be able to push > it up to a few hundred qps without breaking a sweat. Oops, I had the write/read ratio turned way up which was probably contributing to this problem. QPS is a bit better now at ~150, although still not as good as it used to be and still with plenty of timeout errors.
Per Bug 776777, we have found a difference between grinder and funkload scripts that most likely accounts for the extra load/failures here. Funkload is sending significantly more items per POST batch, leading to much higher write workload in the database. Once zeus timeouts are confirmed to be matching production, we can try another run with a reduced batch size to confirm whether this is the root of the problem.
Adding :ckolos per his offer to help out where possible. The next step for this bug is to check whether zeus request timeouts in stage match those set in production (see Comment 14).
The timeouts in stage were set to 30 seconds. This has been modified to match the setting in production of 600 seconds.
Richard Soderberg has advised that the nginx and gunicorn timeouts be checked versus production as well.
Our various adjustments to make the funkload test script more closely match grinder seem to have paid off, we've been running load for around 20 hours now and these errors have not returned. Going to mark this fixed.
Yep. Thoroughly tested in Sync Stage.