Closed Bug 854073 Opened 12 years ago Closed 12 years ago

sync: server-storage stage deploy: server_storage -> 1.15-7, server_core -> 2.12-7

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: bobm)

Details

(Whiteboard: [qa+])

Please deploy server-storage 1.15-7 and server-core 2.12-7 to sync server stage environment. Build command: make build PYPI=http://pypi.build.mtv1.svc.mozilla.com/simple PYPIEXTRAS=http://pypi.build.mtv1.svc.mozilla.com/extras PYPISTRICT=1 SERVER_STORAGE=rpm-1.15-7 SERVER_CORE=rpm-2.12-7 CHANNEL=prod RPM_CHANNEL=prod build_rpms Ops Notes: * Please adjust the pool size and backlog settings as follows (with different numbers if you think appropriate!) pool_size = 12 pool_max_overflow = 0 pool_max_backlog = 6 * This include lots of logging and metrics around connection handling, per Bug 773541. These metlog counters/timers may be of interest for graphing etc: syncstorage.storage.sql.pool.get: timer for getting a connection from the pool services.util.safe_execute.retry: counter for queries retried due to connection error syncstorage.storage.sql.pool.backlog_exceeded: counter for connection refused immediately due to max_backlog setting syncstorage.storage.sql.pool.timeout: counter for connection refused due to timeout waiting in the pool syncstorage.storage.sql.pool.new_connection: counter for new connection to the database syncstorage.storage.sql.pool.get: timer for getting a connection from the pool
This has been deployed to the staging environment. The Services Operations team is addressing some operational issues in the staging environment. Running the load and functional tests will be run once that has been completed.
The staging environment has been repaired. The graphs have been added to the "syncstorage details" dashboard on the staging pencil instance. The deployment is now ready for QA.
I am looking over the Pencil Dashboard and see the following additions: dashboard :: sync :: stage :: syncstorage details connection pool times connection pool failures
Whiteboard: [qa+]
Overnight load test in Stage followed by the usual assortment of Sync bashing manual tests.
Status: NEW → ASSIGNED
Forgot to add: Weekend load test start this morning at approx. 10:30am PDT. Load test is running from Client11 using Client12 - Client16 to drive load. So far failure rate (per ~/syncstorage-loadtest/loadtest.log) is about 2 per hour.
:bobm and :rfkelly Killed the test at 9:48pm on Saturday. We started to see 503s in Stage on Friday around 8pm (about 10 hours after I started the test). Then a bigger, consistent run of 503s starting around 11pm on Friday. So, we didn't really make it even 12 hours before this started to generate failures. The test was run from Client11. You can see the console log saved here: /home/jbonacci/server-storage/loadtest/LoadSet1.log The test was run against Client12 - Client16 You can see the loadtest logs on those boxes in the following directory/file: /home/jbonacci/syncstorage-loadtest/loadtest.log In addition, we will probably want to inspect the public zeus logs. :bobm we will need you to check the logs on the webheads.
Apparently 2 dbs went down in Stage: sync3 and sync8. Will work with :rfkelly to rig the Stage load test to only hit Sync1 - Sync2, Sync4 - Sync7
Restarted after getting the "all clear" from Services OPs. I immediately starting seeing 503s in Stage, primarily from the following sync nodes: stage-sync2* and stage-sync7* The load test has been killed.
Filed Bug 857323 to report issue that caused the problems noted in Comment 8. Basically, stale pool connections to the back end databases are not reaped.
Sync 1.15-5 was deployed to stage to ascertain whether the problem reported in bug 857323 exists in the production code. It does. Rolling 1.15-7 back into the stage to complete the QA.
The staging environment has been upgraded to 1.15-7. The backlog option has been functionally unlimited in the current configuration. As I was unable to locate the setting in SQL Alchemy documentation, and therefore unsure if a setting of 0 or -1 would suffice to remove the limit, it has been set to 2048 to match the default backlog setting Gunicorn uses with connect(), and because it seems reasonably high. It is important to note that they are unrelated types of backlogs. The connect() backlog applies to incoming TCP connections that have not yet been handled with accept(), while the pool_max_backlog setting applies to connections waiting to check a database connection from a pool.
> The backlog option has been functionally unlimited in the current configuration. > As I was unable to locate the setting in SQL Alchemy documentation, and > therefore unsure if a setting of 0 or -1 would suffice to remove the limit This setting is our own invention, in a layer on top of the standard SQLAlchemy pool class. Setting it to -1 should disable the limit compeltely. (Or just removing pool_max_backlog from the config file entirely)
Staging has now been set to -1.
Related tickets (that are not dependencies or blockers): Bug 853270 - Enable pool_max_backlog in stage/prod sync config Bug 857323 - Sync server SQL Alchemy connection pool fails to recycle stale DB connections. Started another load test around 3:30pm PDT April 4. After 16 hours, we have only seen 12 errors/failures. All look something like this: <DATE STAMP> <TIME STAMP> INFO test_storage_session:0:50:5: FAIL: https://stage-syncXX.services.mozilla.com/1.1/cuserXXXXXX/storage/meta/global error("Can't load https://stage-syncXX.services.mozilla.com/1.1/cuserXXXXXX/storage/meta/global.",) Will continue this through 24 hours of load and summarize after that. Manual testing is in progress...
Sync-Storage-related manual testing has been completed without error or incident. The load test will run till about 4pm PDT.
The load test completed its 24-hour run with no significant issues that I could find. There was a database failure (sync8.db) right in the last 30 - 45 minutes of this test. I assume this is coincident with the load test, but need :bobm to investigate. We had 15 non-503 related errors basically broken down as follows: 6 /storage/meta/global error 4 /info/collections error 1 /storage/bookmarks error 3 /storage/history error 1 /storage/forms error
I saw a few stray 503s in the /var/log/zeus/stage-sync.services.mozilla.com.access_2013-04-05-* files on the public zeus servers I saw errors of this type in the application.log files on the webheads: (OperationalError) (2003, "Can't connect to MySQL server on 'sync8.db.BLAH' (110)") None None AND THESE: [2013/Apr/05:15:32:07 -0700] Uncaught exception while processing request: ...etc... Error(<class 'socket.error'>, error(110, 'Connection timed out')) The times of these errors appear to align with our 503s (and our sync8.db issues). So, I think we are good, but would like :rfkelly to OK all the issues I found and :bobm to verify that the sync8.db issue was not caused by the load test. With all that, then QA can sign off.
I couldn't find any reasons as to why sync8.db.scl2.stage shutdown, as the host was completely unavailable, just an empty black screen on the OOB console. The logs show that the last activity before falling offline: Apr 5 15:16:11 sync8 audit-cef: CEF:0|Unix|auditd|1|EXECVE|Unix Exec|3|end=1365200171 fname=(null) dhost=sync8.db.scl2.stage.svc.mozilla.com suser=root suid=0 dproc=/bin/cat msg=gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=4294967295 cwd\="/var/services/innodb-collect" inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) cn1Label=auid cn1=4294967295 cs1Label=Command cs1=cat /tmp/tmp.PxIljkU0hb cs2Label=Truncated cs2=No cs3Label=AuditKey cs3=exec cs4Label=TTY cs4=(none) cs5Label=ParentProcess cs5=innodb-collect. cs6Label=MsgTruncated cs6=No Apr 5 15:16:11 sync8 audit-cef: CEF:0|Unix|auditd|1|EXECVE|Unix Exec|3|end=1365200171 fname=(null) dhost=sync8.db.scl2.stage.svc.mozilla.com suser=root suid=0 dproc=/bin/cat msg=gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=4294967295 cwd\="/var/services/innodb-collect" inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) cn1Label=auid cn1=4294967295 cs1Label=Command cs1=cat cs2Label=Truncated cs2=No cs3Label=AuditKey cs3=exec cs4Label=TTY cs4=(none) cs5Label=ParentProcess cs5=innodb-collect. cs6Label=MsgTruncated cs6=No Apr 5 16:05:42 sync8 audit-cef: audisp-cef loaded
We've happily ignored "Cant load <blah>" errors in the past, as long as they're rare, because they seem to be some sort of client-side networking issue. If we've confident that the logged 503s are due to an issue with sync8.db, then I'm happy to give this the OK.
As far as I can tell, all 503s were related to the DB issue. QA signs off on this testing in Stage. Let's move this and the Heka stuff to Production. REF: Bug 853270 - Enable pool_max_backlog in stage/prod sync config Bug 857323 - Sync server SQL Alchemy connection pool fails to recycle stale DB connections. Bug 844366 - sync: deploy hekad-0.1.1-2 and metlog-py-0.10.0-1 to syncstorage prod
The Stage testing has completed successfully.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.