I have an index that has roughly 60 million documents in it with more being added to it daily (about half a million every 2-3 months).
The field that will be searched in the index is a text field that can contain a large amount of text( anything from 3-4 pages of text up to 200 pages of text). As you can see it can contain a lot of terms to search for across shards. The search that is done to that index also uses highlighting to find the terms matched from the query string.
My question is that what would be the first thing i should do to speed up search speeds on this index and where to continue from there? Even if this means sacrificing index speeds it does't really affect my workflow. The main thing i need is the search speed to be increased.
At the moment it has only 5 primary shards and i do not really know how many shards should i reindex the index with. Maybe one shard per million documents?
I also have 3 data nodes that host the data and the index is replicated in each node(1 primary and 2 replicas). Would i see an increase in search speed if i add more data nodes and replicate the shard into them? Or is this level of redundancy enough and anything past it would point into diminishing returns.
The data nodes are currently running in 2 CPU and 8 RAM. I've heard that RAM will usually be the main bottleneck in elasticsearch search but it seems that highlight searching does tax the CPU more than the RAM( this is based on my observations, anyone that can prove this wrong is more than welcome and would actually help me )
I don't think the I/O is also an issue here as 1500 IOPS seem to be enough at the moment( again anyone that can dispute this is more than welcome )
I just need to know which path to take first as i don't have a lot of financial resources to work with but if i can gather enough evidence i might be able to change that.
If more information is needed i will gladly give more but i though these were more relevant to the situation.
Highlighting of large documents can be slow and CPU intensive. Try running queries with and without highlighting to see what the difference is. You can also use the analyse API. This will show you what makes the query slow.
What do you mean exactly by moving to a time based index? I can't really make distinctions when i'm searching the index based on time in this case.
As far as i understood what you said i could do something like get me the documents of the last 2-3 days with the text i'm searching.
If that is the case then i can't really implement that approach in this case. It's not that the documents don't have timestamps but because i actually have to search all of them.
As for the hardware, I was already suspecting that it might just boil down to that in the end. I'll see what kind of improvements i can make.
Yeah it's just more data being added to a sort off corpus. It does not matter what the timestamp is and it is not based on it.
The query itself is anywhere between 4-5 words up to a maximum of 30 words searched in documents which contain anywhere from 500 words up to 300 or 500 thousand words and the highlighting is used to display where in the documents was the query text found.
I guess things seem to go on the way that hardware is mostly the issue in this case for the moment.
I'll follow up here when i have arrived at a solution that has satisfied my needs and will describe what i did in order to arrive at that point (so that someone with the same problem can use it as a reference).
I have an idea that can help me make a better decision but i'm not quite sure if it is recommended.
At the moment i have 3 data and 3 master nodes. From what I've read master nodes should become dedicated master nodes in larger cluster or in cases where indexing and query operations are so heavy that a dedicated master node is a must in order to handle a data node going down.
The cluster at the moment is not that index heavy(in terms of indexing per second) nor is it very read heavy(in terms of searches per second).
I'm planning to make the 3 data nodes to also be master nodes and save some $ to put the data nodes into better hardware.
What do you guys think? Is this something that i'll probably run into issues with?
I am aware that when demand increases i will have no choice but to have dedicated master nodes but for the moment $ is sparse and i have to make some sacrifices somewhere.
I applied the strategy of creating only 3 nodes but with better performance, and it looks like there was no escaping the fact that hardware was the issue in this case.
Even creating more primary shards to decrease the size of a shard did not help increase the speed of the searches.
We made the decision to go with 3 nodes that are both master and data nodes but with 8CPU and 16GB RAM. IOPS is also increased up to 5000 but the main thing that made the performance of the searches go faster was the higher CPU and RAM. IOPS helped in the regard of smaller but more frequent searches (more searches per second).
Performance increased 4 times (guess that scaled exactly with the number of CPU and RAM in this case).
However, CPU usage doesn't seem to be as high as RAM usage. Guess the main way to go from now on is to increase the RAM and keep the CPU cores at 8. This assumption might probably change in the future but based on our current use case it's what we have found.
I might update this again in the future but this forum doesn't seem the right place to post updates on it. Have any suggestions where i could post updates on this other than here?. I don't know where the appropriate place would be for this kind of "journal" (for lack of a better word).
You're definitely welcome to add to this topic! We do have it set to auto-close after 28 days, but you can also create a new topic and link to this one, and the forums will help connect it all
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.