We're going to push a metlog-enabled sync1.1. \o/. It's a bit of a bumpy road though, since we haven't deployed from tip in a good long while, so here's a bug to track it. Basic plan from server-code viewpoint: 1) Merge fixes from server-core branch "2.7" and server-storage branch "1.10-release" into their respective default branches, to make sure tip is the most current codebase. 2) Add in any high-priority operational fixes like Bug 756653. 3) Get this tested and deployed with LDAP connection pooling switched *off*. Toby and I went through the changes and there don't appear to be any other high-risk changes, so this will hopefully be a smooth and simple deployment (yeah right..?) 4) Merge LDAP pooling fixes from server-core branch "2.6.1" into default branch. 5) Get this tested and deployed with LDAP connection pooling switched *on*. This will require extensive testing because it's code we haven't previously been using in production. If this doesn't work out, we have the option of switching it back off and pushing ahead. 6) Merge server-core "metrics.logging" branch into default, and make the necessary changes to server-storage. 7) Deploy with metlog for much win.
Given the number of releases we plan, I think it might be worth backporting Bug 731511 to sync1.1 to provide an additional layer of QA to the mix.
Step (1) is done, with server-core and server-storage tip now referring to the merge commits. Nice and painless and passes `make test`. I now need to work with :jbonacci to install and preliminary test these on a dev box.
(In reply to Ryan Kelly [:rfkelly] from comment #0) > 3) Get this tested and deployed with LDAP connection pooling > switched *off*. I don't feel comfortable turning off ldap connection pooling in prod. It overwhelms the LDAP slaves, the load balancers, and the firewalls when we are making that many new TCP connections (and we run into ephemeral port problems on the load balancer if there is a small blip and things end up in CLOSE_WAIT/TIME_WAIT and eat up ports).
Hmm, LDAP pooling is on? I thought we got blocked on testing and never released it. Did that actually happen?
Speaking of LDAP, we will need to switch the auth backend over to the new-style code from services.user. With luck we can just pinch the [auth] section from the production config of some other product, since everything else is already running with a new-style backend.
(In reply to Toby Elliott [:telliott] from comment #4) > Hmm, LDAP pooling is on? I thought we got blocked on testing and never > released it. Did that actually happen? 0 fetep-x201(...dmins/puppet/weave/files/etc) % ack -aw ldap_use_pool sync.* sync.storage.phx1/sync.conf 636:ldap_use_pool = true sync.storage.scl2/sync.conf 1355:ldap_use_pool = true sync.storage.stage/sync.conf 198:ldap_use_pool = true I don't know if ldap pooling has been tested with the new backend, though. It's easy enough to see if this is working in stage (tcpdump for SYNs from webheads to ldap VIP on 389).
Marking dependency on Bug 718739 - this should be fixed already in server-core, but we'll confirm that as part of this deployment.
OK, per :petef's comments, let's remove step (3) above. We can still build and test the results of the initial merge without sending them out to to stage, as a sanity-check that we haven't broken anything.
Steps (1), (2) and (4) are now completed, skipping over step (3). All production fixes have been merged back into their respective default branches and are passing basic tests. I have tagged SERVER_CORE=rpm-2.9-1 and SERVER_STORAGE=1.12-1. Next step is to install a fresh build of server-full onto qa1 and test it there. Running `make build` should build the latest tags by default. For completeness, here is the output I expect from the build and test run: $ cd server-full $ $ make build ...lots of noise elided... $ $ (cd deps/server-core && hg summary) parent: 940:23b6307f1e43 rpm-2.9-1 Fix test error string to match previous commit. branch: default commit: (clean) update: 1 new changesets (update) $ $ (cd deps/server-storage && hg summary) parent: 631:147dd35b5cec rpm-1.12-1 Add some release notes for 1.12 branch: default commit: (clean) update: 1 new changesets (update) $ $ make test bin/nosetests -s --with-xunit deps/server-core/services/tests deps/server-reg/syncreg/tests deps/server-storage/syncstorage/tests ...............................................................................S.S..S.....................................F.......................................................... ====================================================================== FAIL: test_password_reset_direct (syncreg.tests.functional.test_user.TestUser) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/rfk/TEMP/server-full/deps/server-reg/syncreg/tests/functional/test_user.py", line 121, in test_password_reset_direct self.assertTrue(e.args.endswith(str(ERROR_INVALID_USER))) AssertionError: False is not true ---------------------------------------------------------------------- Ran 181 tests in 165.642s FAILED (SKIP=3, failures=1) make: *** [test] Error 1 This single test failure is due to a bug in the latest release of server-reg (Bug 755464, fixed in trunk). I think it's simpler to ignore it than to try to piggy-back a new server-reg release on this deployment.
Got this on my list for today (5/30/2012)...
My build was good. My test results match yours. Bringing up the sync server behind a web server for testing... http://qa1.mtv1.dev.svc.mozilla.com/ Will sanity test the custom host across various devices, OS, accounts, etc.
I am signing off on the initial tests of Sync 1.1 (no metlog) on qa1. Next step is to get a load run on Sync stage after upgrading to 1.1 code. This will be a "sanity" load test before adding metlog support.
I have built rpms with the following command and attempted to deploy them to sync1.web.mtv1.dev for testing: make build_rpms CHANNEL=prod RPM_CHANNEL=prod PYPI=http://pypi.build.mtv1.svc.mozilla.com/simple PYPIEXTRAS=http://pypi.build.mtv1.svc.mozilla.com/extras PYPISTRICT=1 SERVER_CORE=rpm-2.9-1 SERVER_STORAGE=rpm-1.12-1 However, I get a hard failure inside the memcache client library when running the tests: $ nosetests -xs syncstorage.................................python26: libmemcached/io.cc:356: memcached_return_t memcached_io_read(memcached_server_st*, void*, size_t, ssize_t*): Assertion `0' failed. Aborted I guess this is due to sync1.dev being a CentOS machine while the rpms were built on r6-build. Is there a dev machine matching r6-build on which I can sanity-check the RPMs before we try to push the out to stage?
(In reply to Ryan Kelly [:rfkelly] from comment #14) > I guess this is due to sync1.dev being a CentOS machine while the rpms were > built on r6-build. Is there a dev machine matching r6-build on which I can > sanity-check the RPMs before we try to push the out to stage? centos5.build.mtv1.svc.m.c, next door to r6.
(In reply to Ryan Kelly [:rfkelly] from comment #14) > > > $ nosetests -xs syncstorage.................................python26: > libmemcached/io.cc:356: memcached_return_t > memcached_io_read(memcached_server_st*, void*, size_t, ssize_t*): Assertion > `0' failed. > Aborted Hmm, same result with properly-built CentOS rpms. Looks like this might just be a bug in libmemcached: https://bugs.launchpad.net/libmemcached/+bug/810482
Curious: if I stop couchbase on sync1.dev and replace it with regular memcached then all the tests pass. Restoring couchbase causes the failure to re-appear. Possibly some incompatibility between libmemcache and couchbase on these machines?
(In reply to Ryan Kelly [:rfkelly] from comment #17) > Curious: if I stop couchbase on sync1.dev and replace it with regular > memcached then all the tests pass. Restoring couchbase causes the failure > to re-appear. Possibly some incompatibility between libmemcache and > couchbase on these machines? Does this happen on an r6 webhead, too? sync4.web.mtv1.dev maybe?
OK, turns out that couchbase runs on a non-standard port on the dev machines, so the failure on sync1 is probably due to it badly handling garbage on the port. Configuring it to use port 11222 makes all tests pass on both sync1 and sync4. I think this is ready to move into stage, will file a separate bug with the details.
The standard Couchbase-replacing-memcached port is 11222/tcp. In theory, we can permit 11211 to continue working, but in practice it ends up (correctly) exposing configs using the older port.
Loadtesting of the pre-metlog code base in Bug 761068 has completed with no issues found, so I'm going to push ahead and tag a metlog-based release. It will be the 1.13 release series and I'll link the bug here when it's ready.
Oh yeah, this finally happened :-)