Closed Bug 1059025 Opened 11 years ago Closed 11 years ago

Please deploy server-syncstorage 1.5.8 with TokuDB to stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

References

Details

(Whiteboard: [qa+])

Let's take another run at deploying some TokuDB nodes, after aborting the previous attempt in Bug 1055368. Our attempts to reproduce the errors from that bug have failed. Maybe it has magically fixed itself and we can get these nodes through to prod :-) :bobm, please do a clean deploy of TokuDB nodes to stage, running the latest production-ready tag of syncstorage (1.5.8). :jbonacci, let's give this the full "taking it through to production" QA treatment and we'll see what shows up this time around...
Depends on: 1057892
If Stage survives the TokuDB onslaught, I can open an companion Prod ticket...
Status: NEW → ASSIGNED
Whiteboard: [qa+]
(In reply to James Bonacci [:jbonacci] from comment #1) > If Stage survives the TokuDB onslaught, I can open an companion Prod > ticket... Deployed to Stage.
Verified four new instances: ec2-174-129-114-111 ec2-54-224-103-67 ec2-54-83-130-1 ec2-107-22-37-43 Running this code: server-syncstorage 1.5.8-1.el6 x86_64 41925251 This time, I will skip the quick tests (because why MOAR 400s, 412s, 415s, and 304s....) Moving to 30min parallel load tests.
Started with the most basic "load test" - running 'make test' on each host. That was successful. Backtracked and ran the quick (remote integration) tests on each host and on TS Stage. These were all successful - saw the usual 304s, 400s, 412s, and 415s go by in each access.log file. Starting two 30min load tests now. Will run the other two a bit later then move on to a 30 minute and 60minute combined load test.
The first two load tests look good. Will check the logs then start two more...
The nginx access log shows 3 503s on Sync1, and some 406s. Everything else is expected. All 3 503s look like this: "DELETE /1.5/17766 HTTP/1.1" 503 162 598 "-" 0.033 0.033 "503" The nginx access log has no 503s on Sync2, just some 406s. Everything else is expected.
sync.err logs of course look like this: Exception KeyError: KeyError(31735472,) in <module 'threading' from '/usr/lib64/python2.6/threading.pyc'> ignored sync.log has 200s, 404s, the 'Broken pipe' exceptions, 400s, 304s, 415s, 412s, QueuePool connection timeouts, the 503s, and the 406s.
OK, looks like 503s and traceback are from before the loads tests. So, I think we are good for Sync1 and Sync2 on the standalone tests.
OK, results are as expected on Sync3 and Sync4. Moving on to a single 60min combined load test. Will pick up the results and logs in the morning...
Last night / earlier this morning the load test was able to reproduce the 1032 error. Though this version of MariaDB was compiled without the debug option, TokuDB has a debug option, so it was enabled on the Sync1 server. Further load testing did not reproduce the bug. TokuDB Debug (enabled by setting the global variable tokudb_debug = to something greater than 0) has been activated on all stage Sync nodes. :jbonacci Please do what you can to reproduce this bug.
Today's 30min combined load test: https://loads.services.mozilla.com/run/867e014d-53f6-4a69-846c-70c9d9fea306 Shows nothing in the dashboard. Going to up the users to 40 (keeping agents at 5) and try again. After this I will check all the logs...
Not seeing any errors in the dashboard for the second round: https://loads.services.mozilla.com/run/a27b45dd-7b16-4536-bade-9fcba791888d Checking logs on Sync1 - Sync4
Double and triple checked. There are no new types of errors in the nginx logs. There are no new errors in the sync logs. All the 503s and traceback are from about 18 hours ago or longer. So the combined load tests and logs look clean to me.
Started another round from the beginning: 1. Remote integration tests on all four nodes 2. Parallel load tests on two nodes/run (30min each parallel run) 3. Combined load test (60min)
Remote integrations tests were good. Parallel load tests are done: 30min Sync1: https://loads.services.mozilla.com/run/ec71d407-5478-4447-b113-6b93f8d3281c 30min Sync2: https://loads.services.mozilla.com/run/a611ea3b-9184-4ea2-8363-6f67ef986872 30min Sync3: https://loads.services.mozilla.com/run/c7ec5ffd-6c62-4f0a-818e-f3a4187f4001 30min Sync4: https://loads.services.mozilla.com/run/473fb28c-0065-451d-bedb-0ef120d2ba45 One 503 that's not a 500 in disguise: 54.245.44.231 [2014-08-29T22:45:33+00:00] "DELETE /1.5/64721/storage/prefs HTTP/1.1" 503 162 594 "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 4.204 4.204 "503" 1 occurrences: False is not true File "/usr/lib/python2.7/unittest/case.py", line 332, in run testMethod() File "stress.py", line 212, in test_storage_session self.assertTrue(response.status_code in (200, 204)) File "/usr/lib/python2.7/unittest/case.py", line 425, in assertTrue raise self.failureException(msg) I am going to try to duplicate that 503 on Sync4...
Combined 60min: https://loads.services.mozilla.com/run/6ae2a1ff-d91d-40fd-a7ca-3cf902c94344 1 occurrences: 503 != 200 File "/usr/lib/python2.7/unittest/case.py", line 332, in run testMethod() File "stress.py", line 142, in test_storage_session self.assertEqual(response.status_code, 200) File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual raise self.failureException(msg) Nothing turned up in the nginx access logs, though. The sync.log files on some nodes are showing the InternalError/BSO exceptions. Looking for 1032 on Sync1: {"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso... ...etc... ... (1032, u\"Can\\'t find record in \\'bso15\\'\")',)\n", "time": "2014-08-28T05:37:14.352929Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-158-9-49", "pid": 2693, "op": "gunicorn.error", "name": "gunicorn.error"} Nothing newer than 8/28 on Sync1 Looking for 1032 on Sync2: {"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso... ...etc... ... (1032, u\"Can\\'t find record in \\'bso16\\'\")',)\n", "time": "2014-08-30T00:11:52.874159Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-233-0-31", "pid": 11073, "op": "gunicorn.error", "name": "gunicorn.error"} This is the very latest 60min combined load test. Looking for 1032 on Sync3: {"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso... ...etc... (1032, u\"Can\\'t find record in \\'bso15\\'\")',)\n", "time": "2014-08-28T05:37:14.352929Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-158-9-49", "pid": 2693, "op": "gunicorn.error", "name": "gunicorn.error"} So, again, nothing newer than 8/28 on Sync3 Looking for 1032 on Sync4: None of these on Sync4
After discussion with Bob and Toby, I've added the TokuDB-bug error code to the list of retryable errors (Bug 1060153) and we're going to deploy it into prod and see what happens. The risk of anything catastrophic seems very small. James, is there anything more we can do with this particular bug or do you want to just close it out?
Closing it out.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
We open more specific bugs after 1.5.9 gets into Production.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.