Closed
Bug 1059025
Opened 11 years ago
Closed 11 years ago
Please deploy server-syncstorage 1.5.8 with TokuDB to stage
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Unassigned)
References
Details
(Whiteboard: [qa+])
Let's take another run at deploying some TokuDB nodes, after aborting the previous attempt in Bug 1055368. Our attempts to reproduce the errors from that bug have failed. Maybe it has magically fixed itself and we can get these nodes through to prod :-)
:bobm, please do a clean deploy of TokuDB nodes to stage, running the latest production-ready tag of syncstorage (1.5.8).
:jbonacci, let's give this the full "taking it through to production" QA treatment and we'll see what shows up this time around...
Comment 1•11 years ago
|
||
If Stage survives the TokuDB onslaught, I can open an companion Prod ticket...
Status: NEW → ASSIGNED
Whiteboard: [qa+]
Comment 2•11 years ago
|
||
(In reply to James Bonacci [:jbonacci] from comment #1)
> If Stage survives the TokuDB onslaught, I can open an companion Prod
> ticket...
Deployed to Stage.
Comment 3•11 years ago
|
||
Verified four new instances:
ec2-174-129-114-111
ec2-54-224-103-67
ec2-54-83-130-1
ec2-107-22-37-43
Running this code:
server-syncstorage 1.5.8-1.el6 x86_64 41925251
This time, I will skip the quick tests (because why MOAR 400s, 412s, 415s, and 304s....)
Moving to 30min parallel load tests.
Comment 4•11 years ago
|
||
Started with the most basic "load test" - running 'make test' on each host. That was successful.
Backtracked and ran the quick (remote integration) tests on each host and on TS Stage.
These were all successful - saw the usual 304s, 400s, 412s, and 415s go by in each access.log file.
Starting two 30min load tests now.
Will run the other two a bit later then move on to a 30 minute and 60minute combined load test.
Comment 5•11 years ago
|
||
The first two load tests look good. Will check the logs then start two more...
Comment 6•11 years ago
|
||
The nginx access log shows 3 503s on Sync1, and some 406s. Everything else is expected.
All 3 503s look like this:
"DELETE /1.5/17766 HTTP/1.1" 503 162 598 "-" 0.033 0.033 "503"
The nginx access log has no 503s on Sync2, just some 406s. Everything else is expected.
Comment 7•11 years ago
|
||
sync.err logs of course look like this:
Exception KeyError: KeyError(31735472,) in <module 'threading' from '/usr/lib64/python2.6/threading.pyc'> ignored
sync.log has 200s, 404s, the 'Broken pipe' exceptions, 400s, 304s, 415s, 412s, QueuePool connection timeouts, the 503s, and the 406s.
Comment 8•11 years ago
|
||
OK, looks like 503s and traceback are from before the loads tests.
So, I think we are good for Sync1 and Sync2 on the standalone tests.
Comment 9•11 years ago
|
||
OK, results are as expected on Sync3 and Sync4.
Moving on to a single 60min combined load test.
Will pick up the results and logs in the morning...
Comment 10•11 years ago
|
||
Last night / earlier this morning the load test was able to reproduce the 1032 error. Though this version of MariaDB was compiled without the debug option, TokuDB has a debug option, so it was enabled on the Sync1 server. Further load testing did not reproduce the bug.
TokuDB Debug (enabled by setting the global variable tokudb_debug = to something greater than 0) has been activated on all stage Sync nodes.
:jbonacci Please do what you can to reproduce this bug.
Comment 11•11 years ago
|
||
Today's 30min combined load test:
https://loads.services.mozilla.com/run/867e014d-53f6-4a69-846c-70c9d9fea306
Shows nothing in the dashboard.
Going to up the users to 40 (keeping agents at 5) and try again.
After this I will check all the logs...
Comment 12•11 years ago
|
||
Not seeing any errors in the dashboard for the second round:
https://loads.services.mozilla.com/run/a27b45dd-7b16-4536-bade-9fcba791888d
Checking logs on Sync1 - Sync4
Comment 13•11 years ago
|
||
Double and triple checked.
There are no new types of errors in the nginx logs.
There are no new errors in the sync logs. All the 503s and traceback are from about 18 hours ago or longer.
So the combined load tests and logs look clean to me.
Comment 14•11 years ago
|
||
Started another round from the beginning:
1. Remote integration tests on all four nodes
2. Parallel load tests on two nodes/run (30min each parallel run)
3. Combined load test (60min)
Comment 15•11 years ago
|
||
Remote integrations tests were good.
Parallel load tests are done:
30min Sync1:
https://loads.services.mozilla.com/run/ec71d407-5478-4447-b113-6b93f8d3281c
30min Sync2:
https://loads.services.mozilla.com/run/a611ea3b-9184-4ea2-8363-6f67ef986872
30min Sync3:
https://loads.services.mozilla.com/run/c7ec5ffd-6c62-4f0a-818e-f3a4187f4001
30min Sync4:
https://loads.services.mozilla.com/run/473fb28c-0065-451d-bedb-0ef120d2ba45
One 503 that's not a 500 in disguise:
54.245.44.231 [2014-08-29T22:45:33+00:00] "DELETE /1.5/64721/storage/prefs HTTP/1.1" 503 162 594 "python-requests/2.2.1 CPython/2.7.3 Linux/3.5.0-23-generic" 4.204 4.204 "503"
1 occurrences:
False is not true
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "stress.py", line 212, in test_storage_session
self.assertTrue(response.status_code in (200, 204))
File "/usr/lib/python2.7/unittest/case.py", line 425, in assertTrue
raise self.failureException(msg)
I am going to try to duplicate that 503 on Sync4...
Comment 16•11 years ago
|
||
Second round on Sync4 was clean:
https://loads.services.mozilla.com/run/95d1ba05-5be0-4e2b-9303-fbbe91c33a8e
Comment 17•11 years ago
|
||
Combined 60min:
https://loads.services.mozilla.com/run/6ae2a1ff-d91d-40fd-a7ca-3cf902c94344
1 occurrences:
503 != 200
File "/usr/lib/python2.7/unittest/case.py", line 332, in run
testMethod()
File "stress.py", line 142, in test_storage_session
self.assertEqual(response.status_code, 200)
File "/usr/lib/python2.7/unittest/case.py", line 516, in assertEqual
assertion_func(first, second, msg=msg)
File "/usr/lib/python2.7/unittest/case.py", line 509, in _baseAssertEqual
raise self.failureException(msg)
Nothing turned up in the nginx access logs, though.
The sync.log files on some nodes are showing the InternalError/BSO exceptions.
Looking for 1032 on Sync1:
{"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso...
...etc...
... (1032, u\"Can\\'t find record in \\'bso15\\'\")',)\n", "time": "2014-08-28T05:37:14.352929Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-158-9-49", "pid": 2693, "op": "gunicorn.error", "name": "gunicorn.error"}
Nothing newer than 8/28 on Sync1
Looking for 1032 on Sync2:
{"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso...
...etc...
... (1032, u\"Can\\'t find record in \\'bso16\\'\")',)\n", "time": "2014-08-30T00:11:52.874159Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-233-0-31", "pid": 11073, "op": "gunicorn.error", "name": "gunicorn.error"}
This is the very latest 60min combined load test.
Looking for 1032 on Sync3:
{"error": "InternalError('(InternalError) (1032, u\"Can\\'t find record in \\'bso...
...etc...
(1032, u\"Can\\'t find record in \\'bso15\\'\")',)\n", "time": "2014-08-28T05:37:14.352929Z", "v": 1, "message": "Error handling request", "hostname": "ip-10-158-9-49", "pid": 2693, "op": "gunicorn.error", "name": "gunicorn.error"}
So, again, nothing newer than 8/28 on Sync3
Looking for 1032 on Sync4:
None of these on Sync4
| Reporter | ||
Comment 18•11 years ago
|
||
After discussion with Bob and Toby, I've added the TokuDB-bug error code to the list of retryable errors (Bug 1060153) and we're going to deploy it into prod and see what happens. The risk of anything catastrophic seems very small.
James, is there anything more we can do with this particular bug or do you want to just close it out?
Comment 19•11 years ago
|
||
Closing it out.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 20•11 years ago
|
||
We open more specific bugs after 1.5.9 gets into Production.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•