Closed
Bug 907673
Opened 11 years ago
Closed 10 years ago
[Socorro] elasticsearch production cluster is nearly full
Categories
(Infrastructure & Operations Graveyard :: WebOps: Socorro, task, P3)
Infrastructure & Operations Graveyard
WebOps: Socorro
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: adrian, Assigned: nmaul)
References
Details
> [agaudebert@socorro-es1.webapp.phx1 ~]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg00-root > 2.0T 1.8T 39G 98% / > tmpfs 21G 0 21G 0% /dev/shm > /dev/sda1 504M 89M 390M 19% /boot > [agaudebert@socorro-es2.webapp.phx1 ~]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg00-root > 2.0T 1.6T 319G 84% / > tmpfs 21G 0 21G 0% /dev/shm > /dev/sda1 504M 89M 390M 19% /boot > [agaudebert@socorro-es3.webapp.phx1 ~]$ df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg00-root > 2.0T 1.6T 244G 87% / > tmpfs 21G 0 21G 0% /dev/shm > /dev/sda1 504M 68M 411M 15% /boot We need to do something about that pretty quickly. The original plan was to add new machines to the cluster when needed, can we do that and add 2 or 4 new machines? At the moment, that cluster stores ~8 months of data. We would like to be able to store 1 year (though we might want to change that number if needed), and we have plans to add even more data there. I thus think that adding another 4 machines wouldn't be too much. Thoughts? What can we do here?
Comment 1•11 years ago
|
||
There are couple of things you could do here. The most straightforward action (as you've identified) is to add two more nodes, which will give you another 4TB of HDD to play with - this will more than likely get you the 12 months of retention you desire. Once the machines are racked, configured, and fired up, there will be no additional administration to do. You may also wish to examine your sharding strategy. Currently, each index is split into five shards, with each shard adding overhead to the whole. Reducing the number of shards (via re-indexing) will reduce the overhead. This is a relatively easy task, since it's just a matter of writing a small script to automate the re-indexing process; however, the amount of space you recuperate may only be in the tens of GB. Finally, you may wish to examine your schema. I haven't looked at your data at all, but is is possible that there are ways to structure your data in order to reduce duplication (and increase usability). This is likely a non-trivial undertaking, and depending on the data, you may see anything from "no gain" to "enormous gain" - results may vary. :)
Comment 2•11 years ago
|
||
I can start the process of getting new hardware ordered if that's what we want to do but I don't know enough about ES to add anything to what :phrawzty wrote above Do you want to try some tuning or should I get the ball rolling on hardware?
Flags: needinfo?(adrian)
Reporter | ||
Comment 3•11 years ago
|
||
We will want to see if we can tune it, but given what we are planning to add in the near future, we will definitely need more boxes. I am not sure on the number though: do we want to build a 5-boxes or a 7-boxes cluster? Do we want to add 2 boxes now and add 2 more later if necessary?
Flags: needinfo?(adrian)
Comment 4•11 years ago
|
||
(In reply to Adrian Gaudebert [:adrian] from comment #3) > in the near future, we will definitely need more boxes. I am not sure on the > number though: do we want to build a 5-boxes or a 7-boxes cluster? Do we > want to add 2 boxes now and add 2 more later if necessary? tl;dr add two more boxes. Your current HDD usage is 5281370988 KB (1929096212 + 1636189288 + 1716085480, respectively) out of 6207058296 KB total, or 85 % usage. This is, by your estimation, roughly eight months of data, with your target being a full year. Four more months will therefore weigh approximately 1760456996 KB, bringing you up to a total storage requirement of 7041827984 KB. Since each node provides 2069019432 KB of HDD, a single additional node of this type would meet your storage requirements; however, since odd numbers of nodes are recommended for small clusters, you'll want to bump that up to two additional nodes. This means that you'll have 10345097160 KB available, which is substantially more than you need (32 % more, as it goes). As an aside, you may have noticed that the nodes are not evenly balanced in terms of space usage. In ES 0.20.x there is no native mechanism for considering anything other than global shard count into account when assigning shards to nodes; however, there is a plugin called equilibrium[1] that purports to do shard balancing based on disk space criteria. This is something we can look into if necessary. [1] https://github.com/sonian/elasticsearch-equilibrium
Comment 5•11 years ago
|
||
I vote 4 because more is always better ;) but if two is faster, get two.
Comment 6•11 years ago
|
||
I'll file bugs for additional hardware
Assignee: server-ops-webops → bburton
Status: NEW → ASSIGNED
Priority: -- → P3
Comment 7•11 years ago
|
||
Financial review still in progress. :adrian did some data pruning and I added socorro-es[1-2].dev.webapp.phx1 to the prod cluster, cluster is rebalancing and data usage is down to better levels already bburton@althalus [03:01:56] [~] -> % for i in {1..2}; do ssh socorro-es$i.dev.webapp.phx1.mozilla.com df -h /; done Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 2.0T 94G 1.8T 5% / Connection to socorro-es1.dev.webapp.phx1.mozilla.com closed. Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 2.0T 85G 1.8T 5% / Connection to socorro-es2.dev.webapp.phx1.mozilla.com closed. -> % for i in {1..3}; do ssh socorro-es$i.webapp.phx1.mozilla.com df -h /; done Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 2.0T 1.4T 448G 77% / Connection to socorro-es1.webapp.phx1.mozilla.com closed. Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 2.0T 1.4T 532G 72% / Connection to socorro-es2.webapp.phx1.mozilla.com closed. Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 2.0T 1.4T 453G 76% / Connection to socorro-es3.webapp.phx1.mozilla.com closed.
Comment 8•11 years ago
|
||
:laura / :adrian, let's see how the cluster balances out in the morning, but we may be ok to start doing some % of raw crashes now, we can discuss in the meeting on Wed if you want
Comment 9•11 years ago
|
||
$ for i in {1..2}; do ssh socorro-es$i.dev.webapp.phx1.mozilla.com df -h /| grep G; done 2.0T 1.1T 825G 56% / 2.0T 1.1T 828G 56% / $ for i in {1..3}; do ssh socorro-es$i.webapp.phx1.mozilla.com df -h /| grep G; done 2.0T 1.1T 809G 57% / 2.0T 987G 886G 53% / 2.0T 1.1T 785G 59% /
Updated•10 years ago
|
Assignee: bburton → server-ops-webops
Assignee | ||
Comment 11•10 years ago
|
||
After the new nodes are added to the cluster, we'll almost certainly have to do some work to clear this: 1) If there are more shards than nodes in the current indexes, ES will automatically move them. That is almost certainly not the case... we'll probably need to move shards manually. If we can reindex/reshard faster, that'd work too, but I suspect not. 2) We should change the automation that rotates the indexes and creates the new ones to create indexes with the proper amount of shards matching the *new* number of ES nodes.
Assignee: server-ops-webops → nmaul
Assignee | ||
Comment 12•10 years ago
|
||
I was wrong... #1 fixed itself, and after consulting with :phrawzty we're not too keen on doing #2 without some statistical basis for doing so. All done here!
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•