Closed Bug 1059025 Opened 11 years ago Closed 11 years ago

Please deploy server-syncstorage 1.5.8 with TokuDB to stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

Product:

Component:

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

VERIFIED FIXED

People

(Reporter: rfkelly, Unassigned)

References

Details

(Whiteboard: [qa+])

Ryan Kelly [:rfkelly]

Reporter

Description

•

11 years ago

Let's take another run at deploying some TokuDB nodes, after aborting the previous attempt in Bug 1055368. Our attempts to reproduce the errors from that bug have failed. Maybe it has magically fixed itself and we can get these nodes through to prod :-) :bobm, please do a clean deploy of TokuDB nodes to stage, running the latest production-ready tag of syncstorage (1.5.8). :jbonacci, let's give this the full "taking it through to production" QA treatment and we'll see what shows up this time around...

James Bonacci [:jbonacci]

Updated

•

11 years ago

Depends on: 1057892

James Bonacci [:jbonacci]

Comment 1

•

11 years ago

If Stage survives the TokuDB onslaught, I can open an companion Prod ticket...

Status: NEW → ASSIGNED

Whiteboard: [qa+]

Bob Micheletto [:bobm]

Comment 2

•

11 years ago

(In reply to James Bonacci [:jbonacci] from comment #1) > If Stage survives the TokuDB onslaught, I can open an companion Prod > ticket... Deployed to Stage.

James Bonacci [:jbonacci]

Comment 3

•

11 years ago

Verified four new instances: ec2-174-129-114-111 ec2-54-224-103-67 ec2-54-83-130-1 ec2-107-22-37-43 Running this code: server-syncstorage 1.5.8-1.el6 x86_64 41925251 This time, I will skip the quick tests (because why MOAR 400s, 412s, 415s, and 304s....) Moving to 30min parallel load tests.

James Bonacci [:jbonacci]

Comment 4

•

11 years ago

Started with the most basic "load test" - running 'make test' on each host. That was successful. Backtracked and ran the quick (remote integration) tests on each host and on TS Stage. These were all successful - saw the usual 304s, 400s, 412s, and 415s go by in each access.log file. Starting two 30min load tests now. Will run the other two a bit later then move on to a 30 minute and 60minute combined load test.

James Bonacci [:jbonacci]

Comment 5

•

11 years ago

The first two load tests look good. Will check the logs then start two more...

James Bonacci [:jbonacci]

Comment 6

•

11 years ago

The nginx access log shows 3 503s on Sync1, and some 406s. Everything else is expected. All 3 503s look like this: "DELETE /1.5/17766 HTTP/1.1" 503 162 598 "-" 0.033 0.033 "503" The nginx access log has no 503s on Sync2, just some 406s. Everything else is expected.

James Bonacci [:jbonacci]

Comment 7

•

11 years ago

sync.err logs of course look like this: Exception KeyError: KeyError(31735472,) in <module 'threading' from '/usr/lib64/python2.6/threading.pyc'> ignored sync.log has 200s, 404s, the 'Broken pipe' exceptions, 400s, 304s, 415s, 412s, QueuePool connection timeouts, the 503s, and the 406s.

James Bonacci [:jbonacci]

Comment 8

•

11 years ago

OK, looks like 503s and traceback are from before the loads tests. So, I think we are good for Sync1 and Sync2 on the standalone tests.

James Bonacci [:jbonacci]

Comment 9

•

11 years ago

OK, results are as expected on Sync3 and Sync4. Moving on to a single 60min combined load test. Will pick up the results and logs in the morning...

Bob Micheletto [:bobm]

Comment 10

•

11 years ago

Last night / earlier this morning the load test was able to reproduce the 1032 error. Though this version of MariaDB was compiled without the debug option, TokuDB has a debug option, so it was enabled on the Sync1 server. Further load testing did not reproduce the bug. TokuDB Debug (enabled by setting the global variable tokudb_debug = to something greater than 0) has been activated on all stage Sync nodes. :jbonacci Please do what you can to reproduce this bug.

James Bonacci [:jbonacci]

Comment 11

•

11 years ago

Today's 30min combined load test: https://loads.services.mozilla.com/run/867e014d-53f6-4a69-846c-70c9d9fea306 Shows nothing in the dashboard. Going to up the users to 40 (keeping agents at 5) and try again. After this I will check all the logs...

James Bonacci [:jbonacci]

Comment 12

•

11 years ago

Not seeing any errors in the dashboard for the second round: https://loads.services.mozilla.com/run/a27b45dd-7b16-4536-bade-9fcba791888d Checking logs on Sync1 - Sync4

James Bonacci [:jbonacci]

Comment 13

•

11 years ago

Double and triple checked. There are no new types of errors in the nginx logs. There are no new errors in the sync logs. All the 503s and traceback are from about 18 hours ago or longer. So the combined load tests and logs look clean to me.

James Bonacci [:jbonacci]

Comment 14

•

11 years ago

Started another round from the beginning: 1. Remote integration tests on all four nodes 2. Parallel load tests on two nodes/run (30min each parallel run) 3. Combined load test (60min)

James Bonacci [:jbonacci]

Comment 15

•

11 years ago

Remote integrations tests were good. Parallel load tests are done: 30min Sync1: https://loads.services.mozilla.com/run/ec71d407-5478-4447-b113-6b93f8d3281c 30min Sync2: https://loads.services.mozilla.com/run/a611ea3b-9184-4ea2-8363-6f67ef986872 30min Sync3: https://loads.services.mozilla.com/run/c7ec5ffd-6c62-4f0a-818e-f3a4187f4001 30min Sync4: https://loads.services.mozilla.com/run/473fb28c-0065-451d-bedb-0ef120d2ba45 One 503 that's not a 500 in disguise: 54.245.44.231 [2014-08-29T22:45:33+00:00] "DELETE /1.5/64721/storage/prefs HTTP/1.1" 503 162 594 "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 4.204 4.204 "503" 1 occurrences: False is not true File "/usr/lib/python2.7/unittest/case.py", line 332, in run testMethod() File "stress.py", line 212, in test_storage_session self.assertTrue(response.status_code in (200, 204)) File "/usr/lib/python2.7/unittest/case.py", line 425, in assertTrue raise self.failureException(msg) I am going to try to duplicate that 503 on Sync4...

James Bonacci [:jbonacci]

Comment 16

•

11 years ago

Second round on Sync4 was clean: https://loads.services.mozilla.com/run/95d1ba05-5be0-4e2b-9303-fbbe91c33a8e

James Bonacci [:jbonacci]

Comment 17

•

11 years ago

Combined 60min: https://loads.services.mozilla.com/run/6ae2a1ff-d91d-40fd-a7ca-3cf902c94344 1 occurrences: 503 != 200 File "/usr/lib/python2.7/unittest/case.py", line 332, in run testMethod() File "stress.py", line 142, in test_storage_session self.assertEqual(response.status_code, 200) File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual assertion_func(first, second, msg=msg) File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual raise self.failureException(msg) Nothing turned up in the nginx access logs, though. The sync.log files on some nodes are showing the InternalError/BSO exceptions. Looking for 1032 on Sync1: {"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso... ...etc... ... (1032, u\"Can\\'t find record in \\'bso15\\'\")',)\n", "time": "2014-08-28T05:37:14.352929Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-158-9-49", "pid": 2693, "op": "gunicorn.error", "name": "gunicorn.error"} Nothing newer than 8/28 on Sync1 Looking for 1032 on Sync2: {"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso... ...etc... ... (1032, u\"Can\\'t find record in \\'bso16\\'\")',)\n", "time": "2014-08-30T00:11:52.874159Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-233-0-31", "pid": 11073, "op": "gunicorn.error", "name": "gunicorn.error"} This is the very latest 60min combined load test. Looking for 1032 on Sync3: {"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso... ...etc... (1032, u\"Can\\'t find record in \\'bso15\\'\")',)\n", "time": "2014-08-28T05:37:14.352929Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-158-9-49", "pid": 2693, "op": "gunicorn.error", "name": "gunicorn.error"} So, again, nothing newer than 8/28 on Sync3 Looking for 1032 on Sync4: None of these on Sync4

Ryan Kelly [:rfkelly]

Reporter

Comment 18

•

11 years ago

After discussion with Bob and Toby, I've added the TokuDB-bug error code to the list of retryable errors (Bug 1060153) and we're going to deploy it into prod and see what happens. The risk of anything catastrophic seems very small. James, is there anything more we can do with this particular bug or do you want to just close it out?

James Bonacci [:jbonacci]

Comment 19

•

11 years ago

Closing it out.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

James Bonacci [:jbonacci]

Comment 20

•

11 years ago

We open more specific bugs after 1.5.9 gets into Production.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.