Closed
Bug 751171
Opened 12 years ago
Closed 12 years ago
build distributed ES cluster
Categories
(Cloud Services :: Operations: Marketplace, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cshields, Assigned: dmaher)
References
Details
I'm not sure if ES has the concept of "local vs remote" but if it does we should make sure that it understands the difference between the 4 nodes that will be in SCL3 and the 4 nodes in PHX1. We have the 4 nodes in SCL3 already kickstarted, elasticsearch[1-4].amopod1.addons.scl3 the 4 matching nodes in phx1 have hit a delay in racking that would put too much of a delay in this proof of concept so we are going to use some lesser-ram nodes to test with for now, and then swap those out for the prod nodes in a few weeks. Filed 751169 to have those done.
Assignee | ||
Updated•12 years ago
|
Severity: normal → major
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•12 years ago
|
||
WIP.
Assignee | ||
Comment 2•12 years ago
|
||
Preliminary tests of site-awareness are positive; a test index spread across 2 sites of 4 nodes each was balanced correctly. Further tests of shard/replica combinations are forthcoming, and following that: performance testing.
Assignee | ||
Comment 3•12 years ago
|
||
The cluster is with four nodes each at SCL3 and PHX1. ES will automatically spread the shards with respect to the two sites, and given an appropriate combination of shards and replicas, it is possible to create indices with different levels of redundancy with respect to data centre and node loss.
Assignee | ||
Comment 4•12 years ago
|
||
06:22:10 <@cshields> can you sync up with jason and oremj - get the amo stage environment to use this new cluster for testing
Comment 5•12 years ago
|
||
I think we need to raise the number of open file limits for the elastic search instances on these hosts: java.io.IOException: Too many open files at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:163) at org.elasticsearch.common.netty.channel.socket.nio.NioServerSocketPipelineSink$Boss.run(NioServerSocketPipelineSink.java:227) at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) [root@elasticsearch-tmp2.addons.phx1 elasticsearch]# cat /proc/8322/limits | grep file Max file size unlimited unlimited bytes Max core file size 0 unlimited bytes Max open files 1024 1024 files Max file locks unlimited unlimited locks http://www.elasticsearch.org/tutorials/2011/04/06/too-many-open-files.html
Assignee | ||
Comment 6•12 years ago
|
||
Indeed - I've not actually adjusted any of the system settings yet, as I was strictly interested in the mechanics of a multi-homed ES cluster last week. This week I'm going to tune things properly, then set up a testing programme to put a few different shard / replica combinations through some automated testing. Once that's done, the same tests will be done, but with the cluster in various states of degradation. I'm not going to re-create the wheel, as we've already done a fair amount of testing[1], but there are some scenarios that require additional study. Once the model multi-homed model has been validated we can extend AMO staging and see how it fares against that use case. I'd also be highly interested in extending some nodes into the Amazon cloud, just to see if and how that element can come into play - but this latter task isn't a top priority (afaik). [1] https://mana.mozilla.org/wiki/display/websites/Elastic+Search#ElasticSearch-Testing
Comment 7•12 years ago
|
||
Okay great. I was trying to hook up AMO staging for testing earlier today however I was blocked by the following error in comment 5 when trying to create indexes for BAMO and AMO.
Assignee | ||
Comment 8•12 years ago
|
||
Automated tests are currently underway. I'm going to let them run through the night - these are just the baseline tests, so the results won't be particularly interesting. Tomorrow I'll initiate the degradation testing, which should produce much more enlightening results. When (if?) we're satisfied with that, we'll take and roll AMO staging out; target early next week (11 June 2012).
Assignee | ||
Comment 9•12 years ago
|
||
After roughly 24 of automated testing, the averages are as follows : default : _all.primaries.docs.count : 2444 _all.primaries.indexing.index_time_in_millis : 8793 _all.primaries.store.size_in_bytes : 142516776 _all.total.docs.count : 25933 _all.total.indexing.index_time_in_millis : 24295 _all.total.store.size_in_bytes : 218378509 critical : _all.primaries.docs.count : 4266 _all.primaries.indexing.index_time_in_millis : 8623 _all.primaries.store.size_in_bytes : 135713329 _all.total.docs.count : 5550 _all.total.indexing.index_time_in_millis : 80195 _all.total.store.size_in_bytes : 175593968 bigtime : _all.primaries.docs.count : 2745 _all.primaries.indexing.index_time_in_millis : 15879 _all.primaries.store.size_in_bytes : 100333238 _all.total.docs.count : 4480 _all.total.indexing.index_time_in_millis : 148960 _all.total.store.size_in_bytes : 143752669
Assignee | ||
Comment 10•12 years ago
|
||
Please ignore the statistics in comment #9; I made an error in the portion of my testing script that creates the indexes, and as a result, they were not reliably being created with the desired shard / replica combinations. I fixed the error and re-ran the tests this week-end, and the results are actually quite interesting : default : _all.primaries.docs.count : 2025 _all.primaries.indexing.index_time_in_millis : 58847 _all.primaries.store.size_in_bytes : 114995479 _all.total.docs.count : 9894 _all.total.indexing.index_time_in_millis : 28532 _all.total.store.size_in_bytes : 99228557 critical : _all.primaries.docs.count : 2631 _all.primaries.indexing.index_time_in_millis : 30927 _all.primaries.store.size_in_bytes : 141211684 _all.total.docs.count : 41183 _all.total.indexing.index_time_in_millis : 36273 _all.total.store.size_in_bytes : 757185744 bigtime : _all.primaries.docs.count : 11716 _all.primaries.indexing.index_time_in_millis : 18094 _all.primaries.store.size_in_bytes : 75169101 _all.total.docs.count : 17566 _all.total.indexing.index_time_in_millis : 94147 _all.total.store.size_in_bytes : 155106212 Interestingly, the "critical" combination was able to index the largest number of documents during each test run.
Assignee | ||
Comment 11•12 years ago
|
||
Reducing priority since we need to deal with getting new hardware up for the existing single-homed clusters. Related : * bug 754547 * bug 763975 * bug 771532 * bug 763923
Severity: major → minor
Status: ASSIGNED → NEW
Comment 12•12 years ago
|
||
At this point do we know what it will take to build a distributed ES cluster? If we do, I'd like to close this bug out and open bugs for individual clusters, as needed.
Assignee | ||
Comment 13•12 years ago
|
||
Yes, we know what it will take, and we can build them out as necessary. Closing this bug.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Component: Server Operations: AMO Operations → Operations: Marketplace
Product: mozilla.org → Mozilla Services
You need to log in
before you can comment on or make changes to this bug.
Description
•