Just how big should an index be allowed to be?

(Paul) #1

I have a lot of data sitting on a hadoop cluster (12 nodes). I have logstash installed on all 12 hadoop data nodes and I point these to my 4 node ES cluster. 3 hadoop data nodes per 1 ES node. Logstash pumps out the data using the bulk API and the ES nodes index away.
This is up and running and I get about 8000 docs/sec indexing per ES node, so about 32,000 docs/sec indexing.
My index has 4 shards, one per node, but it is already getting rather large. I am doing it on a monthly basis.

The question is, should I limit the size of the index to daily and then get 30 times as many indexes and 30 times as many shards? What are the considerations?

Current sizing:
logstash-2015.01 476GB 483 million docs (4 shards)
logstash-2015.02 589GB 459 million docs (4 shards)

This is a test environment and the search load is very very load. It is used for analytics and not high volume. Index performance is key.


(Adrien Grand) #2

The only actual limit is that a shard cannot hold more than 2 billion documents. In practice, users tend to enforce smaller shards in order to have better search performance (assuming you have enough nodes to distributes the load). Your current sizing looks reasonable to me.

(system) #3