854073 - sync: server-storage stage deploy: server_storage -> 1.15-7, server_core -> 2.12-7

Reporter

Description

•

12 years ago

Please deploy server-storage 1.15-7 and server-core 2.12-7 to sync server stage environment. Build command: make build PYPI=http://pypi.build.mtv1.svc.mozilla.com/simple PYPIEXTRAS=http://pypi.build.mtv1.svc.mozilla.com/extras PYPISTRICT=1 SERVER_STORAGE=rpm-1.15-7 SERVER_CORE=rpm-2.12-7 CHANNEL=prod RPM_CHANNEL=prod build_rpms Ops Notes: * Please adjust the pool size and backlog settings as follows (with different numbers if you think appropriate!) pool_size = 12 pool_max_overflow = 0 pool_max_backlog = 6 * This include lots of logging and metrics around connection handling, per Bug 773541. These metlog counters/timers may be of interest for graphing etc: syncstorage.storage.sql.pool.get: timer for getting a connection from the pool services.util.safe_execute.retry: counter for queries retried due to connection error syncstorage.storage.sql.pool.backlog_exceeded: counter for connection refused immediately due to max_backlog setting syncstorage.storage.sql.pool.timeout: counter for connection refused due to timeout waiting in the pool syncstorage.storage.sql.pool.new_connection: counter for new connection to the database syncstorage.storage.sql.pool.get: timer for getting a connection from the pool

Bob Micheletto [:bobm]

Assignee

Comment 1

•

12 years ago

This has been deployed to the staging environment. The Services Operations team is addressing some operational issues in the staging environment. Running the load and functional tests will be run once that has been completed.

Bob Micheletto [:bobm]

Assignee

Comment 2

•

12 years ago

The staging environment has been repaired. The graphs have been added to the "syncstorage details" dashboard on the staging pencil instance. The deployment is now ready for QA.

James Bonacci [:jbonacci]

Comment 3

•

12 years ago

I am looking over the Pencil Dashboard and see the following additions: dashboard :: sync :: stage :: syncstorage details connection pool times connection pool failures

Whiteboard: [qa+]

James Bonacci [:jbonacci]

Comment 4

•

12 years ago

Overnight load test in Stage followed by the usual assortment of Sync bashing manual tests.

Status: NEW → ASSIGNED

James Bonacci [:jbonacci]

Comment 5

•

12 years ago

Forgot to add: Weekend load test start this morning at approx. 10:30am PDT. Load test is running from Client11 using Client12 - Client16 to drive load. So far failure rate (per ~/syncstorage-loadtest/loadtest.log) is about 2 per hour.

James Bonacci [:jbonacci]

Comment 6

•

12 years ago

:bobm and :rfkelly Killed the test at 9:48pm on Saturday. We started to see 503s in Stage on Friday around 8pm (about 10 hours after I started the test). Then a bigger, consistent run of 503s starting around 11pm on Friday. So, we didn't really make it even 12 hours before this started to generate failures. The test was run from Client11. You can see the console log saved here: /home/jbonacci/server-storage/loadtest/LoadSet1.log The test was run against Client12 - Client16 You can see the loadtest logs on those boxes in the following directory/file: /home/jbonacci/syncstorage-loadtest/loadtest.log In addition, we will probably want to inspect the public zeus logs. :bobm we will need you to check the logs on the webheads.

James Bonacci [:jbonacci]

Comment 7

•

12 years ago

Apparently 2 dbs went down in Stage: sync3 and sync8. Will work with :rfkelly to rig the Stage load test to only hit Sync1 - Sync2, Sync4 - Sync7

James Bonacci [:jbonacci]

Comment 8

•

12 years ago

Restarted after getting the "all clear" from Services OPs. I immediately starting seeing 503s in Stage, primarily from the following sync nodes: stage-sync2* and stage-sync7* The load test has been killed.

Bob Micheletto [:bobm]

Assignee

Comment 9

•

12 years ago

Filed Bug 857323 to report issue that caused the problems noted in Comment 8. Basically, stale pool connections to the back end databases are not reaped.

Bob Micheletto [:bobm]

Assignee

Comment 10

•

12 years ago

Sync 1.15-5 was deployed to stage to ascertain whether the problem reported in bug 857323 exists in the production code. It does. Rolling 1.15-7 back into the stage to complete the QA.

Bob Micheletto [:bobm]

Assignee

Comment 11

•

12 years ago

The staging environment has been upgraded to 1.15-7. The backlog option has been functionally unlimited in the current configuration. As I was unable to locate the setting in SQL Alchemy documentation, and therefore unsure if a setting of 0 or -1 would suffice to remove the limit, it has been set to 2048 to match the default backlog setting Gunicorn uses with connect(), and because it seems reasonably high. It is important to note that they are unrelated types of backlogs. The connect() backlog applies to incoming TCP connections that have not yet been handled with accept(), while the pool_max_backlog setting applies to connections waiting to check a database connection from a pool.

Ryan Kelly [:rfkelly]

Reporter

Comment 12

•

12 years ago

> The backlog option has been functionally unlimited in the current configuration. > As I was unable to locate the setting in SQL Alchemy documentation, and > therefore unsure if a setting of 0 or -1 would suffice to remove the limit This setting is our own invention, in a layer on top of the standard SQLAlchemy pool class. Setting it to -1 should disable the limit compeltely. (Or just removing pool_max_backlog from the config file entirely)

Bob Micheletto [:bobm]

Assignee

Comment 13

•

12 years ago

Staging has now been set to -1.

James Bonacci [:jbonacci]

Comment 14

•

12 years ago

Related tickets (that are not dependencies or blockers): Bug 853270 - Enable pool_max_backlog in stage/prod sync config Bug 857323 - Sync server SQL Alchemy connection pool fails to recycle stale DB connections. Started another load test around 3:30pm PDT April 4. After 16 hours, we have only seen 12 errors/failures. All look something like this: <DATE STAMP> <TIME STAMP> INFO test_storage_session:0:50:5: FAIL: https://stage-syncXX.services.mozilla.com/1.1/cuserXXXXXX/storage/meta/global error("Can't load https://stage-syncXX.services.mozilla.com/1.1/cuserXXXXXX/storage/meta/global.",) Will continue this through 24 hours of load and summarize after that. Manual testing is in progress...

James Bonacci [:jbonacci]

Comment 15

•

12 years ago

Sync-Storage-related manual testing has been completed without error or incident. The load test will run till about 4pm PDT.

James Bonacci [:jbonacci]

Comment 16

•

12 years ago

The load test completed its 24-hour run with no significant issues that I could find. There was a database failure (sync8.db) right in the last 30 - 45 minutes of this test. I assume this is coincident with the load test, but need :bobm to investigate. We had 15 non-503 related errors basically broken down as follows: 6 /storage/meta/global error 4 /info/collections error 1 /storage/bookmarks error 3 /storage/history error 1 /storage/forms error

James Bonacci [:jbonacci]

Comment 17

•

12 years ago

I saw a few stray 503s in the /var/log/zeus/stage-sync.services.mozilla.com.access_2013-04-05-* files on the public zeus servers I saw errors of this type in the application.log files on the webheads: (OperationalError) (2003, "Can't connect to MySQL server on 'sync8.db.BLAH' (110)") None None AND THESE: [2013/Apr/05:15:32:07 -0700] Uncaught exception while processing request: ...etc... Error(<class 'socket.error'>, error(110, 'Connection timed out')) The times of these errors appear to align with our 503s (and our sync8.db issues). So, I think we are good, but would like :rfkelly to OK all the issues I found and :bobm to verify that the sync8.db issue was not caused by the load test. With all that, then QA can sign off.

Justin Lazaro [:jlaz] (use needinfo)

Comment 18

•

12 years ago

I couldn't find any reasons as to why sync8.db.scl2.stage shutdown, as the host was completely unavailable, just an empty black screen on the OOB console. The logs show that the last activity before falling offline: Apr 5 15:16:11 sync8 audit-cef: CEF:0|Unix|auditd|1|EXECVE|Unix Exec|3|end=1365200171 fname=(null) dhost=sync8.db.scl2.stage.svc.mozilla.com suser=root suid=0 dproc=/bin/cat msg=gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=4294967295 cwd\="/var/services/innodb-collect" inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) cn1Label=auid cn1=4294967295 cs1Label=Command cs1=cat /tmp/tmp.PxIljkU0hb cs2Label=Truncated cs2=No cs3Label=AuditKey cs3=exec cs4Label=TTY cs4=(none) cs5Label=ParentProcess cs5=innodb-collect. cs6Label=MsgTruncated cs6=No Apr 5 15:16:11 sync8 audit-cef: CEF:0|Unix|auditd|1|EXECVE|Unix Exec|3|end=1365200171 fname=(null) dhost=sync8.db.scl2.stage.svc.mozilla.com suser=root suid=0 dproc=/bin/cat msg=gid\=0 euid\=0 suid\=0 fsuid\=0 egid\=0 sgid\=0 fsgid\=0 ses\=4294967295 cwd\="/var/services/innodb-collect" inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) inode\=00:00 dev\=(null) mode\=(null) ouid\=(null) ogid\=(null) rdev\=(null) cn1Label=auid cn1=4294967295 cs1Label=Command cs1=cat cs2Label=Truncated cs2=No cs3Label=AuditKey cs3=exec cs4Label=TTY cs4=(none) cs5Label=ParentProcess cs5=innodb-collect. cs6Label=MsgTruncated cs6=No Apr 5 16:05:42 sync8 audit-cef: audisp-cef loaded

Ryan Kelly [:rfkelly]

Reporter

Comment 19

•

12 years ago

We've happily ignored "Cant load <blah>" errors in the past, as long as they're rare, because they seem to be some sort of client-side networking issue. If we've confident that the logged 503s are due to an issue with sync8.db, then I'm happy to give this the OK.

James Bonacci [:jbonacci]

Comment 20

•

12 years ago

As far as I can tell, all 503s were related to the DB issue. QA signs off on this testing in Stage. Let's move this and the Heka stuff to Production. REF: Bug 853270 - Enable pool_max_backlog in stage/prod sync config Bug 857323 - Sync server SQL Alchemy connection pool fails to recycle stale DB connections. Bug 844366 - sync: deploy hekad-0.1.1-2 and metlog-py-0.10.0-1 to syncstorage prod

James Bonacci [:jbonacci]

Comment 21

•

12 years ago

The Stage testing has completed successfully.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Updated

•

12 years ago

Status: RESOLVED → VERIFIED

Bugzilla

sync: server-storage stage deploy: server_storage -> 1.15-7, server_core -> 2.12-7

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

Tracking

(Not tracked)

People

(Reporter: rfkelly, Assigned: bobm)

References

Details

(Whiteboard: [qa+])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21

Updated