We are using Elasticsearch 2.4 to index and search web pages, and recently it started to get slow.
Currently we have 5 nodes (4 data nodes) with:
16 GB of RAM allocated to elastic (out of 32)
8 cores
300 GB HDD (two nodes with SSD, two nodes with mechanical)
We are about to replace the nodes with new machines.
Looking at getting 5 used ProLiant DL360e Gen8 machines with following spec
2x Xeon E5-2450L 8-Core 1.8 GHz
64 GB RAM (allocate 32 GB to elastic)
500 GB SSD (6G SATA)
Our index has 12 shards, 1 replica, refresh interval 5 seconds. Index size is ~500GB (including replicas) over 45M documents. Average document size is 6Kb.
The mapping contains mostly not analyzed fields and a few fields with analyzed text in which we search with several match_phrase queries (which got slow).
We are continuously updating and indexing documents using bulk at a rate of about 15k/minute.
What would your recommendations be for our setup considering that we expect to get more data (1-2M per day) and also need better search speed.
In addition to better hardware, I'd also recommend upgrading to 5.4 if this is an option, which should have better performance on range queries than 2.4.
We will definitly review and adopt as much as possible from your link.
About hardware. It states that fast SSD, fast CPU and much memory for filesystem cache is important.
Is there anything particular that is more important than others?
From what we understood, 32GB is maximum we could allocate Elasticsearch.
Would there be a big difference in allocating for example 64GB to OS/filesystem instead of 32GB? (buying 96GB instead of 64GB)
For CPU, is clock speed more important than core count?
It depends. If you have plenty of filesystem cache, then the performance of your disks is less of an issue. Fast CPUs are a must however.
You should iterate recursively over all files from you data directory and accumulate the size of all files that do not have a fdt extension. If this is less than the size of your FS cache, then adding more memory will barely help. Otherwise, it might help but I can't quantify.
The former will help with latency and the latter will help with throughput.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.