Hello
I have 2 ES nodes in cluster one master with type data and one slave with type data. 3 indexes with 5 shards 0 replicas with 300 million documents each. Nodes have 2 core CPUs and 32gb RAM with 20gb configured for elasticsearch.
There is an indexing via bulk api 3000 documents every 2 minutes with force refresh.
The question is: do I get limit of documents for one index in my case?
Aggregation took 20 seconds when number of documents was 150 million now takes 600 seconds.
Search takes long time as well.
Documents have nested fields, and some of fields use keyword_analyzer. Can it be the bottleneck?
Thanks a lot for your opinion and ideas!
There are limits to the number of docs per shard of 2 billion, which is a hard lucene limit.
However when doing aggs over an increasing amount of data things will slow down, how slow will depend on what your queries are doing.
Also the more you load into ES the more resources it'll use, which is when scaling horizontally is good as you have more resources to use
Hello Mark, thank you a lot for information. At least I know the limit now :-).
I have quite heavy aggregation, there is an example below.
I have 5 shards in index with Master: 0,1,4 Slave 2,3. If I add 3 nodes and it will be 5 nodes with 1 shard each Do I get search speed increased? Query? Aggregation? If yes why? I have read that Lucene searches shards one by one linear, how will it increase search speed?
Is there any correlation between number of document in index and number of shards? How many documents should be in one shard approximately?
As for refresh operation: I have two types in index one type is a main type 300 millions of documents with this type, and UI type. I have big inflow of UI types but almost all of them are the same I need to check that there is no same document in index - index operation is done quite rear, I need to make refresh operation to be sure that last added UI type document is searchable. Does refresh operation perform to whole index, not only the type? May be it is better to move UI to another index?
Query and aggregation performance will vary with the size of the shards. The relationship between shard size and query performance will depend on the type of data you are indexing as well as the type and complexity of queries you are running. In order to determine the optimal shard size, it is recommended that you create an index with a single shard and no replicas and then index batches of documents into it. After each batch, run your queries and aggregations and keep track of the response times. These should increase with the size of the shard and will give you a good idea of how large shards you should have in order to meet your latency requirements. Based on this and the expected data volumes, you can determine how many shards an index should have.
Thank you a lot Christian! I will try to calculate with way you mentioned.
There are 5 shards for 300 million documents 60 million each shard approximately.
I make index with 10 shards and each shard will handle 30 million documents.
Now 2 nodes have Master 3 shards, slave 2 shards - search is slow
If I have 2 nodes Master 5 shards, slave 5 shards, do I get search faster since shards are smaller? Or I need to add nodes?
I am trying to understand the way ES perform search. Sorry if my questions seems stupid.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.