Closed Bug 773093 Opened 12 years ago Closed 12 years ago

New Sync 1.1 Funkload test causing 503s/Assertions errors in Stage environment

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jbonacci, Assigned: rfkelly)

References

Details

(Whiteboard: [qa+])

:rfkelly and I found this while running some new funkload tests in Sync Stage.
After a fresh hg clone of server-storage, I started the dist.sh script with just one "worker" - one client in use.
The very short load test seemed to work will with no apparent errors or other issues.

The second attempt of dist.sh used all available workers (client4 - client9) was not at all successful.

We started to see 5xx spikes in the Pencil graphs for syncstorage on Stage and a significant number of "Failures" in the loadtest.xml files on the clientX boxes. Example:

...etc...
<response cycle="001" cvus="050" thread="049" suite="StressTest" name="test_storage_session" step="001" number="001" type="get" result="Failure" url="/1.1/cuser568797/info/collections" code="503" description="" time="1342049540.41" duration="0.0150330066681">
  <headers>
    <header name="retry-after" value="600" />
    <header name="date" value="Wed, 11 Jul 2012 23:32:20 GMT" />
    <header name="connection" value="close" />
    <header name="content-type" value="application/json; charset=UTF-8" />
    <header name="content-length" value="39" />
  </headers>
  <body><![CDATA[
"server issue: database is not healthy"
]]>
  </body>
</response>
<testResult cycle="001" cvus="050" thread="049" suite="StressTest" name="test_storage_session"  time="1342049540.41" result="Failure" steps="1" duration="0.0156259536743" connection_duration="0.0150330066681" requests="1" pages="0" xmlrpc="0" redirects="0" images="0" links="0" traceback='Traceback (most recent call last):&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 946, in __call__&#10;    testMethod()&#10;   File "/home/jbonacci/syncstorage-loadtest/stress.py", line 69, in test_storage_session&#10;    response = self.get(url)&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 391, in get&#10;    method="get", load_auto_links=load_auto_links)&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 299, in _browse&#10;    response = self._connect(url, params, ok_codes, method, description)&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 216, in _connect&#10;    raise self.failureException, str(value.response)&#10; AssertionError: /1.1/cuser568797/info/collections&#10;HTTP Response 503: Database unavailable&#10;' />

...etc...
Leaving the worker/clientX snapshot as is, so....

The server-storage install is in /home/jbonacci on client4
It has the nohup.out log of the STDOUT/STDERR from the load test.
There are also syncstorage-loadtest directories on all clients in /home/jbonacci
Each of these directories has a loadtest.log (not very useful) and at least one loadtest.xml (full of errors).
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [qa+]
These failures all appear to be 503 responses saying:

    HTTP Response 503: Database unavailable
    server issue: database is not healthy

These error messages are not generated within the python app, so I guess they are directly from Zeus.  Digging down into server-side log files to see if I can find out more.
I found 26 instances of a 503 error within syncstorage itself.  The additional failures will be zeus backing off clients after receiving a 503 from syncstorage.

Interestingly, all failing requests are pretty much identical, differing only in the randomly-generated userid and 'newer' paramter:

    10.14.214.201 - cuser521521 [11/Jul/2012:16:29:57 -0700] "GET /1.1/cuser521521/storage/bookmarks?full=1&newer=1341780988 HTTP/1.0" 503 66 "https://stage-sync17.services.mozilla.com/1.1/cuser521521/info/collections" "FunkLoad/1.16.1" XFF="63.245.208.248" TIME=9.042

They are all to the "bookmarks" collection and they all took just over 9 seconds to come back with an error.

In the application log I have instances of this traceback:

  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 615, in get_items
    return [WBO(line, converters) for line in res]
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 287, in _do_query_fetchall
    res = safe_execute(self._engine, *args, **kwds)
  File "/usr/lib/python2.6/site-packages/services/util.py", line 339, in safe_execute
    return engine.execute(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1788, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1191, in execute
    params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
    return self.__execute_context(context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
    context.parameters[0], context=context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
    context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1350, in _handle_dbapi_exception
    self._autorollback()
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1144, in _autorollback
    self._rollback_impl()
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1088, in _rollback_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1085, in _rollback_impl
    self.engine.dialect.do_rollback(self.connection)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/dialects/mysql/base.py", line 1657, in do_rollback
    connection.rollback()
  File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 575, in rollback
    self.errorhandler(None, exc, value)
  File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 187, in defaulterrorhandler
    raise Error(errorclass, errorvalue)
DBAPIError: (Error) (<class 'socket.error'>, error(32, 'Broken pipe')) None None


This is the app trying to execute a SELECT query, failing, trying to execute a rollback and then getting a socketerror on its connection to MySQL.  Unfortunately the rollback failure seems to be masking the original error.  Even more unfortunately, this auto-rollback logic is down inside sqlalchemy rather than our own code.

:atoll, does the ~9 second timeout give you any clues as to what the underlying error might be?
(In reply to Ryan Kelly [:rfkelly] from comment #3)
> 
> They are all to the "bookmarks" collection 

This turns out to be a red herring, caused by a bug in the loadtest script.  It shuffles the list of collections in order to select a random one, but then makes its selection from the original unshuffled list.  "bookmarks" is simply the collection at position zero in the list.

Patch forthcoming.

> and they all took just over 9 seconds to come back with an error.

This article suggests that errors with a 9-second delay can be caused by underlying problems in the TCP connection:

    http://www.mysqlperformanceblog.com/2011/04/19/mysql-connection-timeouts/

A possibility?  I can't find anything we have explicitly configured to this strange timeout value.
Assignee: nobody → rfkelly
Do we have metrics about how often an existing pool connection is used vs. a new one created?
Depends on: 773539
(In reply to Richard Soderberg [:atoll] from comment #5)
> Do we have metrics about how often an existing pool connection is used vs. a
> new one created?

We do not.  This sounds useful, I will file a bug to add them.  Hopefully SQLAlchemy provides the necessary hooks to collection this information.
I tried another funkload run this evening, got about half an hour of success before it started giving me 503 errors.  These ones are different though, I'm seeing lines like this in the nginx access log:

    10.14.214.201 - cuser920348 [13/Jul/2012:00:12:56 -0700] "POST /1.1/cuser920348/storage/bookmarks HTTP/1.0" 499 0 "https://stage-sync20.services.mozilla.com/1.1/cuser920348/info/collections" "FunkLoad/1.16.1" XFF="63.245.208.248" TIME=32.673 

IIRC, the 499 response code is logged when zeus times out the connection, disconnecting from nginx and returning a HTTP 503 response to the client.  The funkload log files show the following response being returned:

    HTTP 503 Service Unavailable
    server issue: pool exhausted

But strangely, the syncstorage app is logging this request as a success, and doing so about 30s *before* the 499-status log line shows up in nginx:

    @400000004fffcac31917d20c 63.245.208.248, 10.14.214.201 - cuser920348 [13/Jul/2012:00:12:23 -0700] "POST /1.1/cuser920348/storage/bookmarks HTTP/1.0" 200 3342 "https://stage-sync20.services.mozilla.com/1.1/cuser920348/info/collections" "FunkLoad/1.16.1"

So I'm really not sure what to make of that.  Will keep digging...
It may not be possible with :petef's imminent departure, but running a half-hour to an hour of grinder against current stage would be very useful here.  It the grinder tests don't produce these failures, that might give us a clue about where to look next.
:petef is running funkload today 7/13/2012.
Thanks :jbonacci and :petef.  Not sure if posting pencil links in bugzilla is appropriate, but these are the parameters I used to capture graphs from both tests side-by-side:

    /dash/stage/syncstorage?start=2012-07-12+23%3A00%3A00&duration=16+hours&width=1000&height=400

There were no errors produced by the grinder test run.  Some other points of difference were:

  * grinder test run ~50qps lower than funkload tests, although it looks like they may be asymptoting to the same value if they were left to run for longer.
  * grinder tests caused much lower CPU usage than funkload tests, 50% vs 150%.

I will continue investigating this today.
I am now getting errors like this in the application error log, from a failing INSERT statement:

  (1114, u"The table 'wbo80' is full")

It's not tied to a specific 'wboX' table, inserting into any table seem to produce the same error.  I don't have access to the database machines to investigate the underlying cause, but it sounds like maybe a disk space issue?
Filed Bug 774155 to track the above database issue.
Blocks: 770406
With "Table Full" issues fixed, I am able to proceed with some light load on stage.  

I can reproduce what appears to be the same error as seen in Comment 7, and will continue to dig into this today.
Turns out the failures are not as mysterious as Comment 7 suggests, and results are not getting lost between gunicorn and nginx - the gunicorn logs are simply recording the start-time of the request rather than the end time.  The failing requests are completing inside syncstorage, but *after* zeus has given up and returned a failure to the client.

So the immediate problem appears to be:  some requests are taking longer than 30s to complete, and zeus is timing them out after 30s.

IIRC correctly the prod zeus timeout is 600s, :atoll can you please check/adjust the stage timeout to match?

Still, I am only running the bench from a single client machine, it doesn't seem like there should be enough load for requests to slow down that badly.  It's only managing ~20qps.

Running the test across all 6 client machines, I am only able to push it up to ~30qps with many 30-second-timeout failures.  We used to be able to push it up to a few hundred qps without breaking a sweat.
(In reply to Ryan Kelly [:rfkelly] from comment #14)
> Running the test across all 6 client machines, I am only able to push it up
> to ~30qps with many 30-second-timeout failures.  We used to be able to push
> it up to a few hundred qps without breaking a sweat.

Oops, I had the write/read ratio turned way up which was probably contributing to this problem.  QPS is a bit better now at ~150, although still not as good as it used to be and still with plenty of timeout errors.
Per Bug 776777, we have found a difference between grinder and funkload scripts that most likely accounts for the extra load/failures here.  Funkload is sending significantly more items per POST batch, leading to much higher write workload in the database.

Once zeus timeouts are confirmed to be matching production, we can try another run with a reduced batch size to confirm whether this is the root of the problem.
Adding :ckolos per his offer to help out where possible.

The next step for this bug is to check whether zeus request timeouts in stage match those set in production (see Comment 14).
Blocks: 775791
Status: NEW → ASSIGNED
The timeouts in stage were set to 30 seconds.  This has been modified to match the setting in production of 600 seconds.
Richard Soderberg has advised that the nginx and gunicorn timeouts be checked versus production as well.
Our various adjustments to make the funkload test script more closely match grinder seem to have paid off, we've been running load for around 20 hours now and these errors have not returned.  Going to mark this fixed.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Yep. Thoroughly tested in Sync Stage.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.