I have been using ES for quite a few time now, but currently I am facing a huge performance issue while searching the ES DB. I have an Index with 5 shards, 5 nodes each of 32 Gb Ram and 16 cores CPU with SAN storage. My index size is around 1.8 Tb and total documents are approx 480 million. We have a requirement to search 30 threads at a time, inside our code we are firing 7 queries (We have 7 different POOLS) in thread per request, i.e. 30*7 =210 request at a time. We have observed that while executing single pools we are getting good performance but when we are firing all pools queries all in multithreading our queries becomes slow, we debugged and found that now all pools are taking time, which earlier was very fast when running sequentially. Please help us how to improve the performance of the queries.
I'd try to keep every shard under 20gb. So you should probably increase the number of shards.
This is something you should test obviously because it depends on your data and searches.
This rule of thumb would give you 90 primary shards. With 16 cores, 5 to 6 nodes should be fine.
This is not including replicas. If you are using default, it means that you have to allocate 180 shards. If so, I'd probably increase (double?) the number of nodes.
But you have to test all that!
Note that increasing the number of shards needs that you reindex your data.
Something important you did not mention:
- elasticsearch version
- java version
Thanks for your prompt reply...we are using ES version 1.7.4 with Java 1.8, we cant increase the number of shards now as it has already taken 2 months to load the entire data. We have extracted our oracle DB tables and inserted in ES index.
We have used following setting for each nodes...
sysctl -w vm.max_map_count=1048576
ulimit -l unlimited
ulimit -n 200000
We are not using any replicas...
If you have disk space available, then you can create a new index with the new shard numbers. This will allow you to reindex from the old index to the new index. There should already be tools to help with this such as logstash or python es reindex api.
I need to understand why to create new index with more shards. Will that improve the concurrent search performance improvement? We have 5 nodes and we have created 5 shards, load average and cpu utilization is not a problem for us. Every nodes is getting 90% load average and 90% cpu utilization.