I have a lot of data sitting on a hadoop cluster (12 nodes). I have logstash installed on all 12 hadoop data nodes and I point these to my 4 node ES cluster. 3 hadoop data nodes per 1 ES node. Logstash pumps out the data using the bulk API and the ES nodes index away.
This is up and running and I get about 8000 docs/sec indexing per ES node, so about 32,000 docs/sec indexing.
My index has 4 shards, one per node, but it is already getting rather large. I am doing it on a monthly basis.
The question is, should I limit the size of the index to daily and then get 30 times as many indexes and 30 times as many shards? What are the considerations?
logstash-2015.01 476GB 483 million docs (4 shards)
logstash-2015.02 589GB 459 million docs (4 shards)
This is a test environment and the search load is very very load. It is used for analytics and not high volume. Index performance is key.