[meta] build and deploy a metlog-enabled sync1.1 server

RESOLVED FIXED

Status

Cloud Services
Server: Sync
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: rfkelly, Assigned: rfkelly)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qa+])

(Assignee)

Description

6 years ago
We're going to push a metlog-enabled sync1.1. \o/.  It's a bit of a bumpy road though, since we haven't deployed from tip in a good long while, so here's a bug to track it.  Basic plan from server-code viewpoint:

   1)  Merge fixes from server-core branch "2.7" and server-storage
       branch "1.10-release" into their respective default branches, to
       make sure tip is the most current codebase.

   2)  Add in any high-priority operational fixes like Bug 756653.

   3)  Get this tested and deployed with LDAP connection pooling
       switched *off*.  Toby and I went through the changes and
       there don't appear to be any other high-risk changes, so this
       will hopefully be a smooth and simple deployment (yeah right..?)

   4)  Merge LDAP pooling fixes from server-core branch "2.6.1" into
       default branch.

   5)  Get this tested and deployed with LDAP connection pooling
       switched *on*.  This will require extensive testing because
       it's code we haven't previously been using in production.

       If this doesn't work out, we have the option of switching it
       back off and pushing ahead.

   6)  Merge server-core "metrics.logging" branch into default, and
       make the necessary changes to server-storage.

   7)  Deploy with metlog for much win.
(Assignee)

Comment 1

6 years ago
Given the number of releases we plan, I think it might be worth backporting Bug 731511 to sync1.1 to provide an additional layer of QA to the mix.
(Assignee)

Comment 2

6 years ago
Step (1) is done, with server-core and server-storage tip now referring to the merge commits.  Nice and painless and passes `make test`.  I now need to work with :jbonacci to install and preliminary test these on a dev box.
(In reply to Ryan Kelly [:rfkelly] from comment #0)
>    3)  Get this tested and deployed with LDAP connection pooling
>        switched *off*.

I don't feel comfortable turning off ldap connection pooling in prod. It overwhelms the LDAP slaves, the load balancers, and the firewalls when we are making that many new TCP connections (and we run into ephemeral port problems on the load balancer if there is a small blip and things end up in CLOSE_WAIT/TIME_WAIT and eat up ports).
Hmm, LDAP pooling is on? I thought we got blocked on testing and never released it. Did that actually happen?
(Assignee)

Comment 5

6 years ago
Speaking of LDAP, we will need to switch the auth backend over to the new-style code from services.user.  With luck we can just pinch the [auth] section from the production config of some other product, since everything else is already running with a new-style backend.
(In reply to Toby Elliott [:telliott] from comment #4)
> Hmm, LDAP pooling is on? I thought we got blocked on testing and never
> released it. Did that actually happen?

0 fetep-x201(...dmins/puppet/weave/files/etc) % ack -aw ldap_use_pool sync.*
sync.storage.phx1/sync.conf
636:ldap_use_pool = true

sync.storage.scl2/sync.conf
1355:ldap_use_pool = true

sync.storage.stage/sync.conf
198:ldap_use_pool = true

I don't know if ldap pooling has been tested with the new backend, though. It's easy enough to see if this is working in stage (tcpdump for SYNs from webheads to ldap VIP on 389).
(Assignee)

Comment 7

6 years ago
Marking dependency on Bug 718739 - this should be fixed already in server-core, but we'll confirm that as part of this deployment.
Depends on: 718739
(Assignee)

Comment 8

6 years ago
OK, per :petef's comments, let's remove step (3) above.  We can still build and test the results of the initial merge without sending them out to to stage, as a sanity-check that we haven't broken anything.
Whiteboard: [qa+]
(Assignee)

Comment 9

6 years ago
Linking to metlog+sync loadtesting metabug
Depends on: 724726, 729159
(Assignee)

Comment 10

6 years ago
Steps (1), (2) and (4) are now completed, skipping over step (3).  All production fixes have been merged back into their respective default branches and are passing basic tests.  I have tagged SERVER_CORE=rpm-2.9-1 and SERVER_STORAGE=1.12-1.

Next step is to install a fresh build of server-full onto qa1 and test it there.  Running `make build` should build the latest tags by default.  For completeness, here is the output I expect from the build and test run:

$ cd server-full
$
$ make build
 ...lots of noise elided...
$
$ (cd deps/server-core && hg summary)
parent: 940:23b6307f1e43 rpm-2.9-1
 Fix test error string to match previous commit.
branch: default
commit: (clean)
update: 1 new changesets (update)
$
$ (cd deps/server-storage && hg summary)
parent: 631:147dd35b5cec rpm-1.12-1
 Add some release notes for 1.12
branch: default
commit: (clean)
update: 1 new changesets (update)
$
$ make test
bin/nosetests -s --with-xunit deps/server-core/services/tests deps/server-reg/syncreg/tests deps/server-storage/syncstorage/tests
...............................................................................S.S..S.....................................F..........................................................
======================================================================
FAIL: test_password_reset_direct (syncreg.tests.functional.test_user.TestUser)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/rfk/TEMP/server-full/deps/server-reg/syncreg/tests/functional/test_user.py", line 121, in test_password_reset_direct
    self.assertTrue(e.args[0].endswith(str(ERROR_INVALID_USER)))
AssertionError: False is not true

----------------------------------------------------------------------
Ran 181 tests in 165.642s

FAILED (SKIP=3, failures=1)
make: *** [test] Error 1


This single test failure is due to a bug in the latest release of server-reg (Bug 755464, fixed in trunk).  I think it's simpler to ignore it than to try to piggy-back a new server-reg release on this deployment.
Got this on my list for today (5/30/2012)...
My build was good.
My test results match yours.
Bringing up the sync server behind a web server for testing...
http://qa1.mtv1.dev.svc.mozilla.com/

Will sanity test the custom host across various devices, OS, accounts, etc.
Status: NEW → ASSIGNED
I am signing off on the initial tests of Sync 1.1 (no metlog) on qa1.
Next step is to get a load run on Sync stage after upgrading to 1.1 code.
This will be a "sanity" load test before adding metlog support.
(Assignee)

Comment 14

6 years ago
I have built rpms with the following command and attempted to deploy them to sync1.web.mtv1.dev for testing:

    make build_rpms CHANNEL=prod RPM_CHANNEL=prod PYPI=http://pypi.build.mtv1.svc.mozilla.com/simple PYPIEXTRAS=http://pypi.build.mtv1.svc.mozilla.com/extras PYPISTRICT=1 SERVER_CORE=rpm-2.9-1 SERVER_STORAGE=rpm-1.12-1

However, I get a hard failure inside the memcache client library when running the tests:

    $ nosetests -xs syncstorage.................................python26: libmemcached/io.cc:356: memcached_return_t memcached_io_read(memcached_server_st*, void*, size_t, ssize_t*): Assertion `0' failed.
    Aborted

I guess this is due to sync1.dev being a CentOS machine while the rpms were built on r6-build.  Is there a dev machine matching r6-build on which I can sanity-check the RPMs before we try to push the out to stage?
(In reply to Ryan Kelly [:rfkelly] from comment #14)
> I guess this is due to sync1.dev being a CentOS machine while the rpms were
> built on r6-build.  Is there a dev machine matching r6-build on which I can
> sanity-check the RPMs before we try to push the out to stage?

centos5.build.mtv1.svc.m.c, next door to r6.
(Assignee)

Comment 16

6 years ago
(In reply to Ryan Kelly [:rfkelly] from comment #14)
>
> 
>     $ nosetests -xs syncstorage.................................python26:
> libmemcached/io.cc:356: memcached_return_t
> memcached_io_read(memcached_server_st*, void*, size_t, ssize_t*): Assertion
> `0' failed.
>     Aborted

Hmm, same result with properly-built CentOS rpms.  Looks like this might just be a bug in libmemcached:  https://bugs.launchpad.net/libmemcached/+bug/810482
(Assignee)

Comment 17

6 years ago
Curious:  if I stop couchbase on sync1.dev and replace it with regular memcached then all the tests pass.  Restoring couchbase causes the failure to re-appear.  Possibly some incompatibility between libmemcache and couchbase on these machines?
(In reply to Ryan Kelly [:rfkelly] from comment #17)
> Curious:  if I stop couchbase on sync1.dev and replace it with regular
> memcached then all the tests pass.  Restoring couchbase causes the failure
> to re-appear.  Possibly some incompatibility between libmemcache and
> couchbase on these machines?

Does this happen on an r6 webhead, too? sync4.web.mtv1.dev maybe?
(Assignee)

Comment 19

6 years ago
OK, turns out that couchbase runs on a non-standard port on the dev machines, so the failure on sync1 is probably due to it badly handling garbage on the port.  Configuring it to use port 11222 makes all tests pass on both sync1 and sync4.

I think this is ready to move into stage, will file a separate bug with the details.
The standard Couchbase-replacing-memcached port is 11222/tcp. In theory, we can permit 11211 to continue working, but in practice it ends up (correctly) exposing configs using the older port.
(Assignee)

Updated

6 years ago
Depends on: 761068
(Assignee)

Comment 21

5 years ago
Loadtesting of the pre-metlog code base in Bug 761068 has completed with no issues found, so I'm going to push ahead and tag a metlog-based release.  It will be the 1.13 release series and I'll link the bug here when it's ready.
(Assignee)

Comment 22

5 years ago
Sync1.1+metlog tagged and Bug 770406 opened.
Depends on: 770406
Depends on: 775791
(Assignee)

Comment 23

5 years ago
Oh yeah, this finally happened :-)
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.