Hi, we are working on large scale elastic search deployment. Its expected to handle 500K log inserts/sec. Each log is about 200 bytes. Search queries are expected to be about 2K/sec. We have created 10 node cluster. Our machines are 64GB RAM, 20 core with 50TB 72K RPM HDDs. Any recommendations/tips ?
I'm no expert on raw indexing performance but I'm pretty sure SSDs are way
way way faster. You might want try some nodes with SSDs. Its possible to
build a setup where new data is indexed into the SSDs and then when the day
is done that index is optimized and pushed over to nodes with spinning
disks.
Nik
+1 on SSDs as Nik mentioned. This page also contains useful information about tuning indexing speed: http://www.elastic.co/guide/en/elasticsearch/guide/master/indexing-performance.html
Apart from the high indexing rate, which as mentioned will benefit from using SSDs as it can be very I/O intensive, you also have a relatively high search rate, at least compared to most typical logging use cases I come across. Depending on the nature of the data and searches as well as the latency requirements related to these, I think it is likely that you will need to tune your cluster for a balance of indexing and querying rather than pure indexing performance.
You may also benefit from a tiered structure where you have nodes with SSDs dedicated to performing indexing as well as serving searches on the most recent data and other nodes that are backed by HDDs and only handle longer term storage of older indices that are effectively read-only and searched less frequently. The configuration of these nodes and the ratio between them will however likely depend on the expected retention period and what your serach patterns look like.