We are recently facing issues with high latency in our Elastic Search Cluster, especially during Ingestion/Indexing in cluster. We run a batchJob which calls the update API of ES, so it will either update the documents (which are already there) or Insert the documents (which are not there). Even though the CPU Utilisation during this time period is < 30%.
However, in our ingestion jobs, around 60-70% documents are updates instead of inserts.
We house 99% of our data on only one Index containing 1.7B documents, around ~4.5TB.
Index has 30 shards (15 primary, 15 replicas) each with average size of ~130GB, refresh interval of 1 sec.
We are already working to get the shard size to around ~40 GB - 50 GB. And also to change the refresh interval or to stop the refresh during ingestion.
I would like to know some other ways to upgrade the latency (especially during bulk indexing), I have the following questions regarding the same:
Which is expensive operation Inserting a new document or updating a document?, I am especially asking this since the update is ES makes a copy with the updated field and deletes the old document and most of our ingestion consists of document updates?
Does updating a set of documents (which have a same field) and we are querying the same set of documents (based on the common field) will it increase the latency of these queries?
Since we have only 1 index which is housing a lot of document, ingestions are running on the index almost all the time, does breaking the index down to smaller indexes improve performance?
We are also looking at a architecture where we will separate the read index and write index, we will change the read alias to the new index once the ingestion is done, will this help?, since we are querying a index which not being updated during the ingestion?
What type of storage are you using? What does iostat -x look like on the node when you experience slowness?
If you are spcifying the document ID when you index Elasticsearch need to first look it up, which is basically the same work as for an update. If you let Elasticsearch assign the document ID inserts are more efficient, but you may not know the document ID.
Am not sure I understand what you are asking. Querying and indexing at the same time consumes more resources and I suspect you are htting a resource bottleneck (this is often storage performance).
It may as it may make merging easier and increase parallelism.
Not sure how you are going to make this work. Will you clone the index and apply changes and then switch the alias? Whether this helps or not will likely depend on what the bottleneck is. It may help with caching as the read index does not change, but you on the other hand have more data on disk so the page cache hit rate may suffer. I suspect you will need to test.
Didn't understand the response, as I said in every ingestion job, we call Update API with batch size of 10000 (yes, we pass _id), out of which 70% documents already exist, so the documents fields get updated, and the other 30% documents will be inserted.
My question is that: is it better to just insert all the documents into a separate index, will this help with the latency during the ingestion?, since we will not be updating the documents which are being queried during the update?
So basically we generate a version of documents every week, which are indexed and will be used that week (last weeks documents are completely not required), but week on week there is 70% overlap between the documents so we used the update API earlier, can we just create a new index every time and index the new version of documents in the new index instead of updating the most of the documents in the same index (I understand that we will have more disk usage, since we are creating documents in the new index if we get better search latency during ingestion)
OK. If you are indexing/updating all documents (basically rebuilding the whole index) periodically versioning the index makes sense. You serach version N and in the background you build version N+1 and then switch the alias when it has completed and then delete version N. This should be more efficient, but also allows you to potentially designate some nodes in the cluster for indexing (building index version N+1) and others for serving queries based on version N. This would allow the query nodes to not be affected by indexing at all.
Can you also tell us if a update is more expensive than insert?, since update is just a atomic delete and create, can we say that updating a documents is either more or equally expensive to a insertion?, this will help us a lot, in this case we can move to complete indexing (basically a indexing to a new index), cause even if we update in the same index, we not get the storage back right away until an underlying merge happens and we have enough same available on our datanodes
If you index into a new index and set the document id client side, it will often be fast at first and then gradually slow down as the index and number of segments grow. The reason for this is that Elasticsearch for each indexing operation need to search existing segments to see if it is an update or not. I wrote a blog post related to this a long time ago, but believe it is still mostly valid and could be useful.
We are also thinking of increasing the refresh_interval or stopping the refresh during the ingestion jobs since we don't need the documents for search right away, I know this will increase the indexing throughput, will this help in the case of updates, like if we use update API and most of the documents are updates instead of insertions?
Increasing the refresh interval will result in larger segments being created and less merging occurring. I would recommend doing this but am not sure exactly how much impact it will have. I would index into a new index and swap with the old once done as this allows you to perform the indexing on dedicated data nodes if necessary.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.