I'm currently in the process of designing/deploying a cluster which needs to handle >2B docs/day and keep these docs for at least 3 months. This means an overall capacity of ~180B docs which can be searched at any moment. They are all relatively small docs (100M of them seems to be around 35GB on disk). What are the main considerations I should be making to optimise for search speed?
Currently I am planning on using a rolling index which cycles once per day and is split into 6 shards. After the index is a day old (ie ingest has been rolled over onto a new index) it will transition to the warm phase which will reduce it down to 3 shards and reduce to 1 segment per shard. After 2-4 weeks (not sure yet) it will then transition to cold and be moved to cheaper machines with HDDs. All shards have 1 replica. I will also use 6 high performance machines for hot/warm indices and 3-6 slower machines for cold indices.
As far as I can tell this is reasonable but is there anything that may end up being a pain to deal with? Is there anything I could do to improve search performance further? It's hard to find any information on how much impact the number of shards/replicas has on search performance or if it's better to have more/less shards (there seems to be conflicting information).
The ideal shard size will depend on the data as well as how you are querying this. Are you searching for documents or performing aggregations? Are the queries likely to primarily focus on the most recent data? How many concurrent queries do you need to serve? What is the latency requirements? Is it ok if it takes longer to query older data?
Our data is time based logs (currently generating ~120M/day) of player activity in our game. We perform both large aggregations as well as looking up logs of a specific player in a given time frame. We don't need to serve many current queries and we have no hard latency requirements, we would just prefer faster queries. And yes it's ok and expected to take longer to query older data.
At the moment finding logs for a specific player are quite fast (usually within 100-200ms) but large aggregations can take quite a long time. I'm planning on using a rollup index (or a continuous transform? Not sure what the difference is tbh) to summarise the data we would often perform aggregations on but my initial tests don't seem to show the huge increase in speed I was expecting. I might just be doing it wrong as I'm still experimenting atm.
More generally what I'm wondering is how does splitting into shards and using replica shards affect performance? Will replica shards improve search performance? Is it faster to keep everything in one shard or use many to distribute the search load? If this is dependant on the scenario, what factors affect it?
In any case our current performance is entirely acceptable, I'm more just curious what factors affect performance here.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.