Boost performance for nodes/cluster

Hi,

I ran into some performance problems with elasticsearch. We are at the development/test stage.
We used the "kopf" plugin and it doesn't work proper anymore and some api calls like "_cat/nodes" run into a timeout.

Our Setup is a Cluster with 3 dedicated master nodes, 4 client nodes (2 for imports and 2 for searches) and 9 Data Nodes.
Data node hardware:
8 cpus , 8-16 Gb mem and we use ssd's (10 GB each)

We allready migratet 2 Years of costumer documents and have about 170 Millionen documents and about 1,5 TB used space.

We store our documents per user and time, e.g. costumer_2015_05 for one index with one shard( 12 indices per year and costumer) and set an alias for every costumer for better search performance. Because there is no difference to make an indice with 12 shards per year or make 12 indices with one shard (always 12 lucene instances).
The _cluster/health api call looks fine too:

{
"cluster_name" : "our clustername",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 17,
"number_of_data_nodes" : 9,
"active_primary_shards" : 9418,
"active_shards" : 18836,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Now the questions:
How can we make our Cluster better and avoid timeouts with apicalls?
Should we upgrade our memory to 16 or 32 Gb for better performance?
Is there another free plugin that works better than kopf?

We'd like stay with our architecture with 1 indice per month and costumer, since the alias searches went well.

Regards
Andi

My first thought is that you have way too many shards. Nearly 20,000 total shards is overkill. I have seen cases where an elasticsearch cluster's problems have been drastically improved just by reducing the number of shards in the cluster, and that's what your case looks like.

I know you mentioned that you'd like to keep the index setup the same. One option to reduce the number of indices you have would be to keep single indices yearly for all customers, but have a field with the customer ID and then create aliases that use that filter, as explained in the documentation on aliases. 170 Million documents is not nearly big enough to warrant so many indices, that number could easily fit in a single index if that's a concern.

@tylerjl

first of all , thank you for the fast response.

I thought a "kagillion" shards are ok , if there are small lookups (Shay Banon) . I thought we solved that problem with alias on the indices, since we only search on one specific costumers with every request.

I understand your advice, but then i don't know how many shards should we use for the one index? Would be 20 okay? (since we have 9 Data Nodes and we could add some in future)

If we would change the structur with one indix per year, how can we transfer our data in the new structur without deleting and migrate it from zero? Is there a good documentation or some scripts , that we can use?

And my last question. What do you think about the memory? At the moment , it looks like we are out of memory. Should we raise the memory or would the new setup help and reduce the used memory?

Regards
Andi

I'd point out that in the presentation you're referring to, Shay uses the example of 300 shards in a 50 node cluster. We're talking about 20,000 shards distributed amongst 9 nodes, a difference of about three orders of magnitude. Once you're dealing with so many shards, the benefits of distributing the data gets pitted against the cost of coordinating between all of them, and it incurs a significant cost. I can't guarantee reducing the shards will fix it, but there are definitely cases where I've seen it resolve problems. :smile:

As far as shard counts, overallocating can give you room to grow, yes. If the primary index storage is 1.5 TB, it may be beneficial to split that among more than just a few indices to try and keep shards from becoming too large. Alternatively you could keep monthly indices with fewer shards, and could avoid hitting so many shards as you would when querying a yearly index if you're querying smaller time ranges.

To move documents between indices, you could use the logstash elasticsearch input or stream2es, either gives you the flexibility to pull out all the documents from an index.

For memory, based upon the document count of 170M, that should easily fit into what you have available. However, that number of 1.5TB of data suggests you may have very large documents. I'd suggest confirming that number is accurate (i.e. make sure it's not counting replicas - google around for some queries to look at sizing.) Also, how is the data being used? Some non-optimal querying can hurt memory, for example, performing aggregations on some analyzed fields can quickly fill up the JVM memory space.

Yes, I agree. that's way too much for that total nodes in the cluster. Once we had 3 nodes with 900 shards, it kill the cluster, and we came all the way down until 10 shards. The data grow gradually every seconds, and index sizes grows and tune settings along the way.

1 Like

@Jason_Wee
Can you tell whats the size of your shards and the amount of documents? Cause we are curious about the maximum shards size with request resulst under a second.
Do you use only one index ?

Hello @andimak , each of the shard average to 75GB and close to 300million documents. That's the heaviest index in the cluster but we have other indices as well.