Closed Bug 1520091 Opened 6 years ago Closed 5 years ago

Please deploy and loadtest a syncstorage spanner node in stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rfkelly, Assigned: bobm)

References

Details

As discussed in the app-services/ops catchup earlier today, we'd like to do an initial loadtest of the new syncstorage spanner backend in stage. I expect the procedure will go like this:

  • :benbangert merges [1] when he's happy with any final nits etc, and makes a new tag
  • :bobm figures out how to deploy an instance of it in stage
  • :bobm and :rbillings run the loadtests against it in "single-node" mode, that is, the mode where you specify the token signing secret directly in the URL hash fragment and avoid hitting the stage tokenserver.

ni? Ben for step (1) of the procedure.

[1] https://github.com/mozilla-services/server-syncstorage/pull/95

Flags: needinfo?(bbangert)
Assignee: nobody → bobm
Status: NEW → ASSIGNED

The PR has been merged.

:bobm, if you need any help/info for deploying, let me know. The table DSL is in queries_spanner at the top, and can be included in the Google Console when creating the database on the Spanner instance.

Assignee: bobm → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(bbangert)
Assignee: nobody → bobm
Status: NEW → ASSIGNED

I have begun creating the GCP project for this. I expect we should be ready to test by mid next week.

Great, how's that going?

Flags: needinfo?(bobm)

(In reply to Ben Bangert [:benbangert] from comment #3)

After many trials and tribulations, we are now getting the following traceback after starting the container:
Exception KeyError: KeyError(XXXXXXXXXXXXXXX,) in <module 'threading' from '/usr/local/lib/python2.7/threading.pyc'> ignored

The env is as follows:
GEVENT_MAX_BLOCKING_TIME=0
HOST=127.0.0.1
PORT=8000
SYNC_SETTINGS_FILE=/app/sync.ini
WEB_CONCURRENCY=1

Flags: needinfo?(bobm) → needinfo?(bbangert)

(In reply to Bob Micheletto [:bobm] from comment #4)

Ben's reply, as transcribed and paraphrased from IRC: gevent can conflict with some C extensions, and he believes one is in use. Scaling Spanner will require creating more processes and/or nodes. The worker class should be changed.

I have created a PR to make that configurable via the WORKER_CLASS environmental variable.

Flags: needinfo?(bbangert)

The application has been deployed to stage, but it appears to not be working from running some quick verification scripts (./bin/python ./syncstorage/tests/functional/test_storage.py https://stage.sync.nonprod.cloudops.mozgcp.net#SYNC_SECRET) with errors like the following:

  code:  999   
  hostname:  "sync-stage-sync-app-1-canary-68fd5c5dd7-ldfdf"   
  method:  "DELETE"   
  metrics_device_id:  null   
  metrics_uid:  null   
  name:  "mozsvc.metrics"   
  op:  "mozsvc.metrics"   
  path:  "https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/38862"   
  pid:  13

From #services-private scrollback, it looks like this GCP node is in the live node rotation in stage, meaning it is getting hit for things like TPS. To have that work correctly we will need to deploy an updated stage tokenserver with support for passing through raw FxA uid params, so I will tag a new release for that here:

https://github.com/mozilla-services/tokenserver/pull/127

Bug 1528821 is the tokenserver deployment request. Alternatively, we could remove the GCP node from the tokenserver's live rotation so that ordinary users of sync on stage do not see it by default.

Depends on: 1528821

Alternatively, we could remove the GCP node from the tokenserver's live rotation so that ordinary
users of sync on stage do not see it by default.

From related conversations in slack, I think this is probably the best option. Can we update the GCP node's row in the tokenserver db to mark it as downed or similar, so that ordinary stage users will not get node-assigned to this untested node?

Flags: needinfo?(bobm)

(In reply to Ryan Kelly [:rfkelly] from comment #9)

Alternatively, we could remove the GCP node from the tokenserver's live rotation so that ordinary
users of sync on stage do not see it by default.

I've marked it down and in backoff mode:

| id | service | node                                           | available | current_load | capacity | downed | backoff |
+----+---------+------------------------------------------------+-----------+--------------+----------+--------+---------+
| 73 |       1 | https://stage.sync.nonprod.cloudops.mozgcp.net |   4999950 |           50 |  5000000 |      1 |       1 |

And just in case I've unassigned any users that were on that node: UPDATE users SET replaced_at = (UNIX_TIMESTAMP() * 1000) WHERE nodeid='73';

Flags: needinfo?(bobm)

The verification script run against our GCP node has is not completing successfully. It fails on the bad cache test. Note, no memcached collections are configured on this node.

test_accessing_info_collections_with_an_expired_token (syncstorage.tests.functional.support.LiveTestCases) ... skipped ''
test_alternative_formats (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_app_newlines_when_payloads_contain_newlines (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_bad_cache (syncstorage.tests.functional.support.LiveTestCases) ... ERROR
test_batch_empty_commit (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_id_is_correctly_scoped_to_a_collection (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_id_is_correctly_scoped_to_a_user (syncstorage.tests.functional.support.LiveTestCases) ... skipped 'Skipped when testing a live server'
test_batch_partial_update (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_size_limits (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_ttl_is_based_on_commit_timestamp (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_ttl_update (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_uploads_properly_update_info_collections (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_with_failing_bsos (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_with_immediate_commit (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batches (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_bulk_update_of_ttls_without_sending_data (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_collection_usage (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_create_bso_with_null_ttl (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_delete_collection_items (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_delete_item (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_delete_storage (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_collection (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_collection_count (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_collection_ttl (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_info_collections (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_item (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_guid_deletion (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_handling_of_invalid_bso_fields (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_handling_of_invalid_json_in_bso_uploads (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_if_modified_since_on_info_views (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_ifunmodifiedsince (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_meta_global_sanity (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_multi_item_post_limits (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_overquota (syncstorage.tests.functional.support.LiveTestCases) ... skipped ''
test_pagination_with_newer_and_sort_by_oldest (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_pagination_with_older_and_sort_by_newest (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_quota (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_rejection_of_known_bad_payloads (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_collection (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_collection_input_formats (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_collection_with_if_modified_since (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_item (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_item_input_formats (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_specifying_ids_with_percent_encoded_query_string (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_strict_newer (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_strict_older (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_404_responses_have_a_json_body (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_batch_deletes_are_limited_to_max_number_of_ids (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_batch_gets_are_limited_to_max_number_of_ids (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_expired_items_can_be_overwritten_via_PUT (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_internal_server_fields_are_not_echoed (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_negative_integer_fields_are_not_accepted (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_we_dont_resurrect_committed_batches (syncstorage.tests.functional.support.LiveTestCases) ... skipped 'failed to trigger re-use of batchid'
test_that_x_last_modified_is_sent_for_all_get_requests (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_timestamp_numbers_are_decimals (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_update_of_ttl_without_sending_data (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_users_with_the_same_batch_id_get_separate_data (syncstorage.tests.functional.support.LiveTestCases) ... skipped 'Skipped when testing a live server'
test_we_dont_need_no_stinkin_batches (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_weird_args (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_x_timestamp_header (syncstorage.tests.functional.support.LiveTestCases) ... skipped ''

======================================================================
ERROR: test_bad_cache (syncstorage.tests.functional.support.LiveTestCases)

Traceback (most recent call last):
File "./syncstorage/tests/functional/test_storage.py", line 131, in test_bad_cache
resp = self.app.get(self.root + '/info/collections')
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 322, in get
expect_errors=expect_errors)
File "/app/syncstorage/tests/functional/support.py", line 44, in new_do_request
return orig_do_request(req, *args, **kwds)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 631, in do_request
self._check_status(status, res)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 663, in _check_status
res)
webtest.app.AppError: Bad response: 503 Service Temporarily Unavailable (not 200 OK or 3xx redirect for https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/55859/info/collections)
'<html>\n<head><title>503 Service Unavailable</title></head>\n<body bgcolor="white">\n<center><h1>503 Service Cloudy: Try again later. </h1></center>\n</body>\n</html>\n'


Ran 60 tests in 260.693s

FAILED (errors=1, skipped=6)

(In reply to Bob Micheletto [:bobm] from comment #11)

Here's the associated Gunicorn traceback:

error: "TypeError("'Type' object is not iterable",)"
message: "Error handling request /1.5/55859/info/collections"
name: "gunicorn.error"
op: "gunicorn.error"
pid: 16
traceback: "Uncaught exception:
File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 176, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 270, in call
response = self.execution_policy(environ, self)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 278, in default_execution_policy
return request.invoke_exception_view(reraise=True)
File "/usr/local/lib/python2.7/site-packages/pyramid/view.py", line 755, in invoke_exception_view
reraise_(*exc_info)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 276, in default_execution_policy
return router.invoke_request(request)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 249, in invoke_request
response = handle_request(request)
File "/app/syncstorage/tweens.py", line 128, in convert_non_json_responses_tween
response = handler(request)
File "/app/syncstorage/tweens.py", line 104, in convert_cornice_errors_to_respcodes_tween
response = handler(request)
File "/app/syncstorage/tweens.py", line 58, in set_default_accept_header_tween
return handler(request)
File "/app/syncstorage/tweens.py", line 30, in set_x_timestamp_header_tween
response = handler(request)
File "/usr/local/lib/python2.7/site-packages/mozsvc/tweens.py", line 94, in fuzz_backoff_headers_tween
response = handler(request)
File "/usr/local/lib/python2.7/site-packages/mozsvc/tweens.py", line 59, in log_uncaught_exceptions_tween
return handler(request)
File "/usr/local/lib/python2.7/site-packages/mozsvc/tweens.py", line 26, in catch_backend_errors_tween
return handler(request)
File "/usr/local/lib/python2.7/site-packages/pyramid/tweens.py", line 41, in excview_tween
response = _error_handler(request, exc)
File "/usr/local/lib/python2.7/site-packages/pyramid/tweens.py", line 16, in _error_handler
reraise(*exc_info)
File "/usr/local/lib/python2.7/site-packages/pyramid/tweens.py", line 39, in excview_tween
response = handler(request)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 156, in handle_request
view_name
File "/usr/local/lib/python2.7/site-packages/pyramid/view.py", line 642, in _call_view
response = view_callable(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/config/views.py", line 181, in call
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 390, in attr_view
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 368, in predicate_wrapper
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 301, in secured_view
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 439, in rendered_view
result = view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 148, in requestonly_view
response = view(request)
File "/usr/local/lib/python2.7/site-packages/cornice/service.py", line 514, in wrapper
response = view
(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 40, in convert_storage_errors
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 71, in sleep_and_retry_on_conflict
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 183, in with_collection_lock
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 164, in check_precondition_headers
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 94, in check_storage_quota
return viewfunc(request)
File "/app/syncstorage/views/init.py", line 170, in get_info_timestamps
timestamps = storage.get_collection_timestamps(request.user)
File "/app/syncstorage/storage/spanner.py", line 77, in with_session_wrapper
return func(self, self._tldata.session, *args, **kwds)
File "/app/syncstorage/storage/spanner.py", line 221, in get_collection_timestamps
res = self._map_collection_names(res)
File "/app/syncstorage/storage/spanner.py", line 940, in _map_collection_names
names = self._load_collection_names(collection_ids)
File "/app/syncstorage/storage/spanner.py", line 921, in _load_collection_names
for id, name in uncached_names:
File "/usr/local/lib/python2.7/site-packages/google/cloud/spanner_v1/streamed.py", line 143, in iter
self._consume_next()
File "/usr/local/lib/python2.7/site-packages/google/cloud/spanner_v1/streamed.py", line 116, in _consume_next
response = six.next(self._response_iterator)
File "/usr/local/lib/python2.7/site-packages/google/cloud/spanner_v1/snapshot.py", line 42, in _restart_on_unavailable
iterator = restart()
File "/usr/local/lib/python2.7/site-packages/google/cloud/spanner_v1/gapic/spanner_client.py", line 787, in execute_streaming_sql
seqno=seqno,
File "/usr/local/lib/python2.7/site-packages/google/protobuf/internal/python_message.py", line 496, in init
for key in field_value:
<type 'exceptions.TypeError'>
TypeError("'Type' object is not iterable",)

Bob, I've just merged a fix for the test_bad_cache failure into master. Support for a dockerflow /lbheartbeat endpoint was also recently added

(In reply to Philip Jenvey [:pjenvey] from comment #13)

Bob, I've just merged a fix for the test_bad_cache failure into master. Support for a dockerflow /lbheartbeat endpoint was also recently added

We've switched to the lbheartbeat endpoint for the health checks (liveness and readiness Kubernetes probes). And it's working.

However, I'm running the functional tests in the latest container and still getting that error, and a few other errors. Is there a specific tag I should be using?

Here are the tests that failed:
test_bad_cache (syncstorage.tests.functional.support.LiveTestCases) ... ERROR
test_batch_uploads_properly_update_info_collections (syncstorage.tests.functional.support.LiveTestCases) ... ERROR
test_get_collection_count (syncstorage.tests.functional.support.LiveTestCases) ... ERROR
test_if_modified_since_on_info_views (syncstorage.tests.functional.support.LiveTestCases) ... ERROR

And the associated tracebacks:

======================================================================
ERROR: test_bad_cache (syncstorage.tests.functional.support.LiveTestCases)

Traceback (most recent call last):
File "./syncstorage/tests/functional/test_storage.py", line 131, in test_bad_cache
resp = self.app.get(self.root + '/info/collections')
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 322, in get
expect_errors=expect_errors)
File "/app/syncstorage/tests/functional/support.py", line 44, in new_do_request
return orig_do_request(req, *args, **kwds)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 631, in do_request
self._check_status(status, res)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 663, in _check_status
res)
webtest.app.AppError: Bad response: 503 Service Temporarily Unavailable (not 200 OK or 3xx redirect for https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/56890/info/collections)
'<html>\n<head><title>503 Service Unavailable</title></head>\n<body bgcolor="white">\n<center><h1>503 Service Cloudy: Try again later. </h1></center>\n</body>\n</html>\n'

======================================================================
ERROR: test_batch_uploads_properly_update_info_collections (syncstorage.tests.functional.support.LiveTestCases)

Traceback (most recent call last):
File "./syncstorage/tests/functional/test_storage.py", line 1772, in test_batch_uploads_properly_update_info_collections
resp = self.app.get(self.root + '/info/collections')
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 322, in get
expect_errors=expect_errors)
File "/app/syncstorage/tests/functional/support.py", line 44, in new_do_request
return orig_do_request(req, *args, **kwds)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 631, in do_request
self._check_status(status, res)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 663, in _check_status
res)
webtest.app.AppError: Bad response: 503 Service Temporarily Unavailable (not 200 OK or 3xx redirect for https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/80621/info/collections)
'<html>\n<head><title>503 Service Unavailable</title></head>\n<body bgcolor="white">\n<center><h1>503 Service Cloudy: Try again later. </h1></center>\n</body>\n</html>\n'

======================================================================
ERROR: test_get_collection_count (syncstorage.tests.functional.support.LiveTestCases)

Traceback (most recent call last):
File "./syncstorage/tests/functional/test_storage.py", line 110, in test_get_collection_count
resp = self.app.get(self.root + '/info/collection_counts')
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 322, in get
expect_errors=expect_errors)
File "/app/syncstorage/tests/functional/support.py", line 44, in new_do_request
return orig_do_request(req, *args, **kwds)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 631, in do_request
self._check_status(status, res)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 663, in _check_status
res)
webtest.app.AppError: Bad response: 503 Service Temporarily Unavailable (not 200 OK or 3xx redirect for https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/13360/info/collection_counts)
'<html>\n<head><title>503 Service Unavailable</title></head>\n<body bgcolor="white">\n<center><h1>503 Service Cloudy: Try again later. </h1></center>\n</body>\n</html>\n'

======================================================================
ERROR: test_if_modified_since_on_info_views (syncstorage.tests.functional.support.LiveTestCases)

Traceback (most recent call last):
File "./syncstorage/tests/functional/test_storage.py", line 1141, in test_if_modified_since_on_info_views
self.app.get(self.root + view, headers=headers, status=200)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 322, in get
expect_errors=expect_errors)
File "/app/syncstorage/tests/functional/support.py", line 44, in new_do_request
return orig_do_request(req, *args, **kwds)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 631, in do_request
self._check_status(status, res)
File "/usr/local/lib/python2.7/site-packages/webtest/app.py", line 666, in _check_status
"Bad response: %s (not %s)", res_status, status)
webtest.app.AppError: Bad response: 503 Service Temporarily Unavailable (not 200)

Are you deploying from master? The current commit should be 1fb73a5998f06020ca7954192666460c5dc98120.

However, I'm running the functional tests in the latest container and still getting that error,

I believe we should be able to pull the docker container and run these same tests locally, right? Could you please provide the command you're running and we can have a shot at reproducing locally.

Flags: needinfo?(bobm)

(In reply to Ryan Kelly [:rfkelly] from comment #15)

Are you deploying from master? The current commit should be 1fb73a5998f06020ca7954192666460c5dc98120.

I didn't see tags cut along with the latest changes (most recent tag is v1.6.14), so we've been using the latest container from dockerhub.

I believe we should be able to pull the docker container and run these same tests locally, right? Could you please provide the command you're running and we can have a shot at reproducing locally.

$ docker pull mozilla/server-syncstorage:latest
$ docker run -it --entrypoint /bin/sh mozilla/server-syncstorage
~ $ /usr/local/bin/python ./syncstorage/tests/functional/test_storage.py https://stage.sync.nonprod.cloudops.mozgcp.net#SECRET
Flags: needinfo?(bobm)

As mentioned on IRC, we should use tags to coordinate this. I've pushed a "dev-spanner-0.1.2" tag to try it out, hopefully it will build in CI etc. (I didn't bother actually updating the version number setup.py or anything like that, but I can if you feel like there's value in it).

(In reply to Ryan Kelly [:rfkelly] from comment #17)

As mentioned on IRC, we should use tags to coordinate this. I've pushed a "dev-spanner-0.1.2" tag to try it out, hopefully it will build in CI etc. (I didn't bother actually updating the version number setup.py or anything like that, but I can if you feel like there's value in it).

Success!

test_accessing_info_collections_with_an_expired_token (syncstorage.tests.functional.support.LiveTestCases) ... skipped ''
test_alternative_formats (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_app_newlines_when_payloads_contain_newlines (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_bad_cache (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_empty_commit (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_id_is_correctly_scoped_to_a_collection (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_id_is_correctly_scoped_to_a_user (syncstorage.tests.functional.support.LiveTestCases) ... skipped 'Skipped when testing a live server'
test_batch_partial_update (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_size_limits (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_ttl_is_based_on_commit_timestamp (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_ttl_update (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_uploads_properly_update_info_collections (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_with_failing_bsos (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batch_with_immediate_commit (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_batches (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_bulk_update_of_ttls_without_sending_data (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_collection_usage (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_create_bso_with_null_ttl (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_delete_collection_items (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_delete_item (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_delete_storage (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_collection (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_collection_count (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_collection_ttl (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_info_collections (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_get_item (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_guid_deletion (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_handling_of_invalid_bso_fields (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_handling_of_invalid_json_in_bso_uploads (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_if_modified_since_on_info_views (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_ifunmodifiedsince (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_meta_global_sanity (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_multi_item_post_limits (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_overquota (syncstorage.tests.functional.support.LiveTestCases) ... skipped ''
test_pagination_with_newer_and_sort_by_oldest (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_pagination_with_older_and_sort_by_newest (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_quota (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_rejection_of_known_bad_payloads (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_collection (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_collection_input_formats (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_collection_with_if_modified_since (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_item (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_set_item_input_formats (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_specifying_ids_with_percent_encoded_query_string (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_strict_newer (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_strict_older (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_404_responses_have_a_json_body (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_batch_deletes_are_limited_to_max_number_of_ids (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_batch_gets_are_limited_to_max_number_of_ids (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_expired_items_can_be_overwritten_via_PUT (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_internal_server_fields_are_not_echoed (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_negative_integer_fields_are_not_accepted (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_that_we_dont_resurrect_committed_batches (syncstorage.tests.functional.support.LiveTestCases) ... skipped 'failed to trigger re-use of batchid'
test_that_x_last_modified_is_sent_for_all_get_requests (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_timestamp_numbers_are_decimals (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_update_of_ttl_without_sending_data (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_users_with_the_same_batch_id_get_separate_data (syncstorage.tests.functional.support.LiveTestCases) ... skipped 'Skipped when testing a live server'
test_we_dont_need_no_stinkin_batches (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_weird_args (syncstorage.tests.functional.support.LiveTestCases) ... ok
test_x_timestamp_header (syncstorage.tests.functional.support.LiveTestCases) ... skipped ''


Ran 60 tests in 279.528s

OK (skipped=6)

I think this is ready for load testing now.

Flags: needinfo?(rbillings)

Ops and QA have made several unsuccessful attempts to establish a running load test against this installation. Here are some example errors and tracebacks:

From the load test:

$ make bench SERVER_URL=https://SERVER#SECRET
./bin/loads-runner --config=./config/bench.ini --server-url=https://SERVER#SECRET stress.StressTest.test_storage_session
./server-syncstorage/loadtest/lib/python2.7/site-packages/zmq/green/eventloop/ioloop.py:1: VisibleDeprecationWarning: zmq.eventloop.minitornado is deprecated in pyzmq 14.0 and will be removed.
Install tornado itself to use zmq with the tornado IOLoop.

from zmq.eventloop.ioloop import *
[===================================================================================] 100%

Duration: 10.10 seconds
Hits: 1426
Started: 2019-02-27 17:15:46.520335
Approximate Average RPS: 141
Average request time: 0.14s
Opened web sockets: 0
Bytes received via web sockets : 0

Success: 0
Errors: 0
Failures: 1426

1 occurrences of:
AssertionError: False is not true Traceback:
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
testMethod()
File "stress.py", line 132, in test_storage_session
self.assertTrue(response.status_code in (200, 404))
File "/usr/lib64/python2.7/unittest/case.py", line 462, in assertTrue
raise self.failureException(msg)
1 occurrences of:
AssertionError: False is not true Traceback:
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
testMethod()
File "stress.py", line 132, in test_storage_session
self.assertTrue(response.status_code in (200, 404))
File "/usr/lib64/python2.7/unittest/case.py", line 462, in assertTrue
raise self.failureException(msg)
1 occurrences of:
AssertionError: False is not true Traceback:
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
testMethod()
File "stress.py", line 132, in test_storage_session
self.assertTrue(response.status_code in (200, 404))
File "/usr/lib64/python2.7/unittest/case.py", line 462, in assertTrue
raise self.failureException(msg)

Flags: needinfo?(rbillings)

From the server:

error: "KeyError('fxa_uid',)"
hostname: "sync-stage-sync-app-1-546f9f96dd-5w2jg"
message: "Error handling request /1.5/804108/info/collections"
name: "gunicorn.error"
op: "gunicorn.error"
pid: 34114
traceback: "Uncaught exception:
File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 135, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python2.7/site-packages/gunicorn/workers/sync.py", line 176, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 270, in call
response = self.execution_policy(environ, self)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 278, in default_execution_policy
return request.invoke_exception_view(reraise=True)
File "/usr/local/lib/python2.7/site-packages/pyramid/view.py", line 755, in invoke_exception_view
reraise_(*exc_info)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 276, in default_execution_policy
return router.invoke_request(request)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 249, in invoke_request
response = handle_request(request)
File "/app/syncstorage/tweens.py", line 128, in convert_non_json_responses_tween
response = handler(request)
File "/app/syncstorage/tweens.py", line 104, in convert_cornice_errors_to_respcodes_tween
response = handler(request)
File "/app/syncstorage/tweens.py", line 58, in set_default_accept_header_tween
return handler(request)
File "/app/syncstorage/tweens.py", line 30, in set_x_timestamp_header_tween
response = handler(request)
File "/usr/local/lib/python2.7/site-packages/mozsvc/tweens.py", line 94, in fuzz_backoff_headers_tween
response = handler(request)
File "/usr/local/lib/python2.7/site-packages/mozsvc/tweens.py", line 59, in log_uncaught_exceptions_tween
return handler(request)
File "/usr/local/lib/python2.7/site-packages/mozsvc/tweens.py", line 26, in catch_backend_errors_tween
return handler(request)
File "/usr/local/lib/python2.7/site-packages/pyramid/tweens.py", line 41, in excview_tween
response = _error_handler(request, exc)
File "/usr/local/lib/python2.7/site-packages/pyramid/tweens.py", line 16, in _error_handler
reraise(*exc_info)
File "/usr/local/lib/python2.7/site-packages/pyramid/tweens.py", line 39, in excview_tween
response = handler(request)
File "/usr/local/lib/python2.7/site-packages/pyramid/router.py", line 156, in handle_request
view_name
File "/usr/local/lib/python2.7/site-packages/pyramid/view.py", line 642, in _call_view
response = view_callable(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/config/views.py", line 181, in call
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 390, in attr_view
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 368, in predicate_wrapper
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 301, in secured_view
return view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 439, in rendered_view
result = view(context, request)
File "/usr/local/lib/python2.7/site-packages/pyramid/viewderivers.py", line 148, in requestonly_view
response = view(request)
File "/usr/local/lib/python2.7/site-packages/cornice/service.py", line 514, in wrapper
response = view
(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 40, in convert_storage_errors
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 71, in sleep_and_retry_on_conflict
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 183, in with_collection_lock
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 164, in check_precondition_headers
return viewfunc(request)
File "/app/syncstorage/views/util.py", line 54, in wrapper
return decorator_func(target_func, *args, **kwds)
File "/app/syncstorage/views/decorators.py", line 94, in check_storage_quota
return viewfunc(request)
File "/app/syncstorage/views/init.py", line 189, in get_info_timestamps
timestamps = storage.get_collection_timestamps(request.user)
File "/app/syncstorage/storage/spanner.py", line 77, in with_session_wrapper
return func(self, self._tldata.session, *args, **kwds)
File "/app/syncstorage/storage/spanner.py", line 215, in get_collection_timestamps
userid = user_key(user)
File "/app/syncstorage/storage/spanner.py", line 89, in user_key
return ":".join([user["fxa_uid"], user["fxa_kid"]])
<type 'exceptions.KeyError'>
KeyError('fxa_uid',)

The above load test attempts were made from the loadtest directory that was, formerly, part of the server-syncstorage git repository. The correct load tests live in the syncstorage-loadtest git repository.

$ git clone git@github.com:mozilla-services/syncstorage-loadtest.git
$ cd syncstorage-loadtest
$ git checkout configurable-server-url
$ cat > Dockerfile << EOF
FROM python:3.5-alpine

WORKDIR /app

ENTRYPOINT ["/bin/sh"]

# install / cache dependencies first
COPY requirements.txt /app/requirements.txt

RUN apk add --update build-base ca-certificates git && \
    pip install -r requirements.txt

# Copy in the whole app after dependencies have been installed & cached
COPY . /app
EOF
$ docker build --no-cache -t syncloadtest:url -f Dockerfile .
$ docker run -it --entrypoint /bin/sh syncloadtest:url
/app # export SERVER_URL=https://SERVERURL#SECRET
/app # molotov --max-runs 5 -cxv loadtest.py
**** Molotov v1.7. Happy breaking! ****
Preparing 1 worker...
OK
SUCCESSES: 5 | FAILURES: 0 | WORKERS: 1
*** Bye ***

agent: "Python/3.5 aiohttp/3.5.4"
code: 200
method: "POST"
metrics_device_id: null
name: "mozsvc.metrics"
op: "mozsvc.metrics"
path: "https://stage.sync.nonprod.cloudops.mozgcp.net/1.5/545490/storage/clients"

Load tests are now successfully running on Google Spanner node. Next step is to create a dashboard for results and re-run the tests.

(In reply to Rebecca Billings [:rbillings] from comment #23)

Load tests are now successfully running on Google Spanner node. Next step is to create a dashboard for results and re-run the tests.

Dashboard created. Please re-run load tests.

Flags: needinfo?(rbillings)
Flags: needinfo?(rbillings)

Load tests have been re-run. Work is still in progress as we learn how to ramp up workers to better match regular Sync nodes.

The dev-spanner-0.1.12 tag has been deployed. The BsoLastModified index has been modified. :rbillings can you run another load test against this?

Flags: needinfo?(rbillings)

This morning's test run: bin/molotov -p 4 -w 200 -d 3900
SUCCESSES: 578882 | FAILURES: 7239 | WORKERS: 024

Flags: needinfo?(rbillings)

Closing this bug.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.