We are doing a project to use ElasticSearch as searching engine for the large amount of text content. The goal is to offer Full text search capacity to retrieve first 10 document based on phrases user provides within 30M documents with index data size around 50GB. Is it possible in theory we can achieve 100ms per query by using whatever the features ElasticSearch can offer, such as pre-caching to allow almost all time in-memory data access, clusters so query can be split into multiple servers with each server searching on < 10M documents?
Initial testing we did on 4GB trial cloud server gave us 3s per query, which is way too slow.
Anyone has any use case like this ? The best performance you have achieved?
I think you need to find what is the typical limit of a single shard before it reaches that limit.
Then test by multiplying by 2 the number of shards and the number of documents and check.
Then increase again the number of shards and test again.
You should be able to determine the right number of shards for your use case.
Note that you need to run some typical queries multiple times for this.
ESRally might help you to automate hopefully the tests.
BTW check all the time what is the took time in elasticsearch response. May be you are measuring something else (like the network) which makes you think that elasticsearch is slow?
Thanks very much for the heads up. As I can understand correctly here is that the performance is basically determined by the shard (number of documents it holds determines how fast query can be executed ). We will carry out the test as you suggested here to see how we can achieve our goals.
BTW, the 4GB trial cloud server testing is set as 5 shards, "took" showed ~3s on the 30M documents index we created, guess that is something normal counting the available RAM and number of documents per shard?
The content in "field_1" is quite long, normally 100 generic English words at least. "field_2" : is quite short, converted from integer, single numeric word.
I have generated multiple testing cases to evaluate the actual "match_phrase" query on 30M document (nested fields, approximately 100M documents generated in the index), the fastest performance I have achieved so far with single 128GB server is: with Java heap size set at 64GB, indices data file pre-loaded, data stored into 20 shards, it is around 20 ms - 30ms average per query. (query phrases are randomly generated, so no extreme case like hitting > thousands; just pure single "match_phrase" query)
Is that normal? Any faster I can do? (Since I have all data in-memory with above settings, guess that is the practical limit on FTS query? )
number of shards == number of cores/threads (all cores work in parallel for each query, minimum overhead)
Heap=30GB (less overhead)
sort the index by "field_2", will help you not look into certain segments at all (slower writing)
Also, you have to look at "how many documents where matched" ! If there were many, even though data in ram, it has to score them (which takes time). So if they have a lot, you have to find a way to fast cancel.
Can you sort by another field ?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.