773093 - New Sync 1.1 Funkload test causing 503s/Assertions errors in Stage environment

Reporter

Description

•

12 years ago

:rfkelly and I found this while running some new funkload tests in Sync Stage.
After a fresh hg clone of server-storage, I started the dist.sh script with just one "worker" - one client in use.
The very short load test seemed to work will with no apparent errors or other issues.

The second attempt of dist.sh used all available workers (client4 - client9) was not at all successful.

We started to see 5xx spikes in the Pencil graphs for syncstorage on Stage and a significant number of "Failures" in the loadtest.xml files on the clientX boxes. Example:

...etc...
<response cycle="001" cvus="050" thread="049" suite="StressTest" name="test_storage_session" step="001" number="001" type="get" result="Failure" url="/1.1/cuser568797/info/collections" code="503" description="" time="1342049540.41" duration="0.0150330066681">
  <headers>
    <header name="retry-after" value="600" />
    <header name="date" value="Wed, 11 Jul 2012 23:32:20 GMT" />
    <header name="connection" value="close" />
    <header name="content-type" value="application/json; charset=UTF-8" />
    <header name="content-length" value="39" />
  </headers>
  <body><![CDATA[
"server issue: database is not healthy"
]]>
  </body>
</response>
<testResult cycle="001" cvus="050" thread="049" suite="StressTest" name="test_storage_session"  time="1342049540.41" result="Failure" steps="1" duration="0.0156259536743" connection_duration="0.0150330066681" requests="1" pages="0" xmlrpc="0" redirects="0" images="0" links="0" traceback='Traceback (most recent call last):&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 946, in __call__&#10;    testMethod()&#10;   File "/home/jbonacci/syncstorage-loadtest/stress.py", line 69, in test_storage_session&#10;    response = self.get(url)&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 391, in get&#10;    method="get", load_auto_links=load_auto_links)&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 299, in _browse&#10;    response = self._connect(url, params, ok_codes, method, description)&#10;   File "/home/jbonacci/syncstorage-loadtest/lib/python2.6/site-packages/funkload/FunkLoadTestCase.py", line 216, in _connect&#10;    raise self.failureException, str(value.response)&#10; AssertionError: /1.1/cuser568797/info/collections&#10;HTTP Response 503: Database unavailable&#10;' />

...etc...

James Bonacci [:jbonacci]

Reporter

Comment 1

•

12 years ago

Leaving the worker/clientX snapshot as is, so....

The server-storage install is in /home/jbonacci on client4
It has the nohup.out log of the STDOUT/STDERR from the load test.
There are also syncstorage-loadtest directories on all clients in /home/jbonacci
Each of these directories has a loadtest.log (not very useful) and at least one loadtest.xml (full of errors).

OS: Mac OS X → All

Hardware: x86 → All

Whiteboard: [qa+]

Ryan Kelly [:rfkelly]

Assignee

Comment 2

•

12 years ago

These failures all appear to be 503 responses saying:

    HTTP Response 503: Database unavailable
    server issue: database is not healthy

These error messages are not generated within the python app, so I guess they are directly from Zeus.  Digging down into server-side log files to see if I can find out more.

Ryan Kelly [:rfkelly]

Assignee

Comment 3

•

12 years ago

I found 26 instances of a 503 error within syncstorage itself.  The additional failures will be zeus backing off clients after receiving a 503 from syncstorage.

Interestingly, all failing requests are pretty much identical, differing only in the randomly-generated userid and 'newer' paramter:

    10.14.214.201 - cuser521521 [11/Jul/2012:16:29:57 -0700] "GET /1.1/cuser521521/storage/bookmarks?full=1&newer=1341780988 HTTP/1.0" 503 66 "https://stage-sync17.services.mozilla.com/1.1/cuser521521/info/collections" "FunkLoad/1.16.1" XFF="63.245.208.248" TIME=9.042

They are all to the "bookmarks" collection and they all took just over 9 seconds to come back with an error.

In the application log I have instances of this traceback:

  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 615, in get_items
    return [WBO(line, converters) for line in res]
  File "/usr/lib/python2.6/site-packages/syncstorage/storage/sql.py", line 287, in _do_query_fetchall
    res = safe_execute(self._engine, *args, **kwds)
  File "/usr/lib/python2.6/site-packages/services/util.py", line 339, in safe_execute
    return engine.execute(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1788, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1191, in execute
    params)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1271, in _execute_clauseelement
    return self.__execute_context(context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1302, in __execute_context
    context.parameters[0], context=context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1401, in _cursor_execute
    context)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1350, in _handle_dbapi_exception
    self._autorollback()
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1144, in _autorollback
    self._rollback_impl()
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1088, in _rollback_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 1085, in _rollback_impl
    self.engine.dialect.do_rollback(self.connection)
  File "/usr/lib/python2.6/site-packages/sqlalchemy/dialects/mysql/base.py", line 1657, in do_rollback
    connection.rollback()
  File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 575, in rollback
    self.errorhandler(None, exc, value)
  File "/usr/lib/python2.6/site-packages/pymysql/connections.py", line 187, in defaulterrorhandler
    raise Error(errorclass, errorvalue)
DBAPIError: (Error) (<class 'socket.error'>, error(32, 'Broken pipe')) None None


This is the app trying to execute a SELECT query, failing, trying to execute a rollback and then getting a socketerror on its connection to MySQL.  Unfortunately the rollback failure seems to be masking the original error.  Even more unfortunately, this auto-rollback logic is down inside sqlalchemy rather than our own code.

:atoll, does the ~9 second timeout give you any clues as to what the underlying error might be?

Ryan Kelly [:rfkelly]

Assignee

Comment 4

•

12 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #3)
> 
> They are all to the "bookmarks" collection 

This turns out to be a red herring, caused by a bug in the loadtest script.  It shuffles the list of collections in order to select a random one, but then makes its selection from the original unshuffled list.  "bookmarks" is simply the collection at position zero in the list.

Patch forthcoming.

> and they all took just over 9 seconds to come back with an error.

This article suggests that errors with a 9-second delay can be caused by underlying problems in the TCP connection:

    http://www.mysqlperformanceblog.com/2011/04/19/mysql-connection-timeouts/

A possibility?  I can't find anything we have explicitly configured to this strange timeout value.

Assignee: nobody → rfkelly

:Atoll

Comment 5

•

12 years ago

Do we have metrics about how often an existing pool connection is used vs. a new one created?

Ryan Kelly [:rfkelly]

Assignee

Updated

•

12 years ago

Depends on: 773539

Ryan Kelly [:rfkelly]

Assignee

Comment 6

•

12 years ago

(In reply to Richard Soderberg [:atoll] from comment #5)
> Do we have metrics about how often an existing pool connection is used vs. a
> new one created?

We do not.  This sounds useful, I will file a bug to add them.  Hopefully SQLAlchemy provides the necessary hooks to collection this information.

Ryan Kelly [:rfkelly]

Assignee

Comment 7

•

12 years ago

I tried another funkload run this evening, got about half an hour of success before it started giving me 503 errors.  These ones are different though, I'm seeing lines like this in the nginx access log:

    10.14.214.201 - cuser920348 [13/Jul/2012:00:12:56 -0700] "POST /1.1/cuser920348/storage/bookmarks HTTP/1.0" 499 0 "https://stage-sync20.services.mozilla.com/1.1/cuser920348/info/collections" "FunkLoad/1.16.1" XFF="63.245.208.248" TIME=32.673 

IIRC, the 499 response code is logged when zeus times out the connection, disconnecting from nginx and returning a HTTP 503 response to the client.  The funkload log files show the following response being returned:

    HTTP 503 Service Unavailable
    server issue: pool exhausted

But strangely, the syncstorage app is logging this request as a success, and doing so about 30s *before* the 499-status log line shows up in nginx:

    @400000004fffcac31917d20c 63.245.208.248, 10.14.214.201 - cuser920348 [13/Jul/2012:00:12:23 -0700] "POST /1.1/cuser920348/storage/bookmarks HTTP/1.0" 200 3342 "https://stage-sync20.services.mozilla.com/1.1/cuser920348/info/collections" "FunkLoad/1.16.1"

So I'm really not sure what to make of that.  Will keep digging...

Ryan Kelly [:rfkelly]

Assignee

Comment 8

•

12 years ago

It may not be possible with :petef's imminent departure, but running a half-hour to an hour of grinder against current stage would be very useful here.  It the grinder tests don't produce these failures, that might give us a clue about where to look next.

James Bonacci [:jbonacci]

Reporter

Comment 9

•

12 years ago

:petef is running funkload today 7/13/2012.

Ryan Kelly [:rfkelly]

Assignee

Comment 10

•

12 years ago

Thanks :jbonacci and :petef.  Not sure if posting pencil links in bugzilla is appropriate, but these are the parameters I used to capture graphs from both tests side-by-side:

    /dash/stage/syncstorage?start=2012-07-12+23%3A00%3A00&duration=16+hours&width=1000&height=400

There were no errors produced by the grinder test run.  Some other points of difference were:

  * grinder test run ~50qps lower than funkload tests, although it looks like they may be asymptoting to the same value if they were left to run for longer.
  * grinder tests caused much lower CPU usage than funkload tests, 50% vs 150%.

I will continue investigating this today.

Ryan Kelly [:rfkelly]

Assignee

Comment 11

•

12 years ago

I am now getting errors like this in the application error log, from a failing INSERT statement:

  (1114, u"The table 'wbo80' is full")

It's not tied to a specific 'wboX' table, inserting into any table seem to produce the same error.  I don't have access to the database machines to investigate the underlying cause, but it sounds like maybe a disk space issue?

Ryan Kelly [:rfkelly]

Assignee

Comment 12

•

12 years ago

Filed Bug 774155 to track the above database issue.

Ryan Kelly [:rfkelly]

Assignee

Updated

•

12 years ago

Blocks: 770406

Ryan Kelly [:rfkelly]

Assignee

Comment 13

•

12 years ago

With "Table Full" issues fixed, I am able to proceed with some light load on stage.  

I can reproduce what appears to be the same error as seen in Comment 7, and will continue to dig into this today.

Ryan Kelly [:rfkelly]

Assignee

Comment 14

•

12 years ago

Turns out the failures are not as mysterious as Comment 7 suggests, and results are not getting lost between gunicorn and nginx - the gunicorn logs are simply recording the start-time of the request rather than the end time.  The failing requests are completing inside syncstorage, but *after* zeus has given up and returned a failure to the client.

So the immediate problem appears to be:  some requests are taking longer than 30s to complete, and zeus is timing them out after 30s.

IIRC correctly the prod zeus timeout is 600s, :atoll can you please check/adjust the stage timeout to match?

Still, I am only running the bench from a single client machine, it doesn't seem like there should be enough load for requests to slow down that badly.  It's only managing ~20qps.

Running the test across all 6 client machines, I am only able to push it up to ~30qps with many 30-second-timeout failures.  We used to be able to push it up to a few hundred qps without breaking a sweat.

Ryan Kelly [:rfkelly]

Assignee

Comment 15

•

12 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #14)
> Running the test across all 6 client machines, I am only able to push it up
> to ~30qps with many 30-second-timeout failures.  We used to be able to push
> it up to a few hundred qps without breaking a sweat.

Oops, I had the write/read ratio turned way up which was probably contributing to this problem.  QPS is a bit better now at ~150, although still not as good as it used to be and still with plenty of timeout errors.

Ryan Kelly [:rfkelly]

Assignee

Comment 16

•

12 years ago

Per Bug 776777, we have found a difference between grinder and funkload scripts that most likely accounts for the extra load/failures here.  Funkload is sending significantly more items per POST batch, leading to much higher write workload in the database.

Once zeus timeouts are confirmed to be matching production, we can try another run with a reduced batch size to confirm whether this is the root of the problem.

Ryan Kelly [:rfkelly]

Assignee

Comment 17

•

12 years ago

Adding :ckolos per his offer to help out where possible.

The next step for this bug is to check whether zeus request timeouts in stage match those set in production (see Comment 14).

James Bonacci [:jbonacci]

Reporter

Updated

•

12 years ago

Blocks: 775791

James Bonacci [:jbonacci]

Reporter

Updated

•

12 years ago

Status: NEW → ASSIGNED

Bob Micheletto [:bobm]

Comment 18

•

12 years ago

The timeouts in stage were set to 30 seconds.  This has been modified to match the setting in production of 600 seconds.

Bob Micheletto [:bobm]

Comment 19

•

12 years ago

Richard Soderberg has advised that the nginx and gunicorn timeouts be checked versus production as well.

Ryan Kelly [:rfkelly]

Assignee

Comment 20

•

12 years ago

Our various adjustments to make the funkload test script more closely match grinder seem to have paid off, we've been running load for around 20 hours now and these errors have not returned.  Going to mark this fixed.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Reporter

Comment 21

•

12 years ago

Yep. Thoroughly tested in Sync Stage.

Status: RESOLVED → VERIFIED

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard