I have a 10 node es cluster plus 3 masters. We are running 2.0. 8CPU, 64G Memory Virtual environment. We have about a billion records in the cluster. We are loading in the low hundred million per day and keep about 5 days of data. Each day is it's own index with about 250 million recs. We create an alias for last 5 days. We have about 4 million or so unique groups (group_id) for this entire 5 day set which we use to "route" records. 90% of our queries utilize the groupID so this is a good key/route for us. The problem is that our queries still take about 4-5 seconds to run. The average record result set is about 250 records. We have 100 shards per index so the 5day has 500 shards.
I was thinking about increasing the shard number to 200 per index so that each shard gets cut in half...thinking that with the routing it might reduce the response time. We have 10 additional nodes that we could add to the cluster which we are planning on doing which should help.
However, I Just wanted to check on the best design for what were doing. Is there a way to "tune" the shard routing table? Could their be contention here...or is it more likely that with just 250 million records, data is the problem. It seems as though if we route successfully and go to a single shard that this shouldn't take 5 seconds to return. Would it make sense to increase shards to like 500 per index with the assumption that most/all queries are "route" based so that this would help because shards are alot smaller. The problem here might be queries without a route could really pound the cluster because of the number of shards.
Any help/thoughts would be appreciated.