I set up Elasticsearch to create a new index on a daily interval but when I started the stack up, I "synchronized" a lot of log data from the past right away resulting in the first index being a LOT larger than the others following it (50g instead of 1g). Is this going to cause my search speed to be a lot slower? I'm worried that a single worker thread will be assigned to searching the whole thing by itself and will take a lot longer than the rest. Would it make a difference if I reindexed the data so that the initial upload was spread over several indices instead or is this not even a factor?
Each query runs single threaded against each shards, but multiple queries and shards can be processed in parallel. The size of a shard does therefore affect query latencies, which is why we generally recommend benchmarking the ideal shard size.
To follow up on this issue, it does seem that removing the one large shard stabilized the Elasticsearch node. After increasing the number of shards, I ran into an additional issue with running out of worker thread queue space. Increasing the search queue size from 1000 to 5000 resolved this issue but did result in slightly longer lasting searches.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.