How to scale daily indexing


(Grant U) #1

I know the default in Logstash is to create a new index daily. I've also seen in different places people recommend not running more than a couple shards per node, because each shard is a Lucene instance under the surface. If you're creating a new index daily, my understanding is you're creating a couple shards daily.

What am I missing? It doesn't seem possible that people are adding new nodes for every new day's worth of data.


(Magnus B├Ąck) #2

I know the default in Logstash is to create a new index daily. I've also seen in different places people recommend not running more than a couple shards per node, because each shard is a Lucene instance under the surface.

Theoretically they're perhaps right, but having big indexes with a larger number of shards that span over longer periods of time is inflexible from a scaling perspective (you can't change the number of shards in an index without reindexing) and you risk ending up with really big shards. To avoid excessive recovery times you'll want to keep your shards up to a few tens of gigabytes.

They could also have meant no more than a couple of shards per index but having multiple indexes. A node can easily host hundreds of shards.

If you're creating a new index daily, my understanding is you're creating a couple shards daily.

What am I missing? It doesn't seem possible that people are adding new nodes for every new day's worth of data.

No, of course not. Daily indexes are fine, but you may want to reduce the number of shards per index from the default of five. Each shard carries a constant cost so you don't want too many of them, but you also want data to be distributed across nodes. What's optimal depends on how many nodes you have, how many documents per day, the size of the documents, how many days' worth of indexes you're going to keep online, what type of queries are made, how frequently queries span over multiple days ... you get the idea.


(Spuder) #3

It is ok to have more than a couple of shards per node.

In my infrastructure, I have 3 elasticsearch nodes, 1 logstash index per day, 5 shards per index, with 1x replication (pretty much the defaults).

I use curator to optimize each index after 1 day, close indexes more than 30 days, and delete indexes more than 90 days old.

This means that there are 300 shards and 30 indexes open at any one time. I don't do a lot of queries, but it performs well enough. I may drop the shards from 5 to 3.

One guideline I've heard from others who use ELK is to keep each shard to a max of a few 10's of GB.


(system) #4