Hi, we are currently working with a single node ES instance, to which we connect with our application using ESJavaClient. Recently we re-indexed all our indexes after enabling search on a time based field. Post migration, we have been observing sudden increase in the data indexing performance specifically for the newly created indexes. The average indexing performance for the new indexes are greater than 300ms compared consistently to an average of 10ms for other indexes.
Some basic debugging details:-
We tries deleting the whole index and letting a new one get created with no initial data. Still the average time to push any document in it was over 250ms, sometimes even going to 1000ms. Whereas, older indexes at same time got updated under 10ms.
There is no relation in the size of the index with the delay, as indexes with 20 times the size are not facing these delays.
There has been an overall increase the indexing time but, its not extremely large/
Ram percent usage is very high for the single node, usually higher than 95%.
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1 11 96 34 21.43 22.86 23.39 cdfhilmrstw * socrates
Checked the hot threads usage too and usually they display 100% CPU utilization in write and flush threads.
100.0% [cpu=99.7%, other=0.3%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[socrates08][write][T#18]'
100.0% [cpu=99.0%, other=1.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[socrates08][flush][T#8]'
The core usage and memory usage in much below than the provided resources.
Extra details on implementation and memory allocation:-
Maximum number of indexing queries that can be made at a time: 6
Provided number of CPU cores: 10 (Increased from 5, before the migration)
Provided memory to the ES process: 80GB (increased from 64GB, before the migration)
Heap memory allocated: 30GB
Number of primary shards allocated to each index: 1
Number of replica shards allocated to each index: 0
We understand that there are some basic optimizations we can make like moving to a cluster or using the bulk indexing much more efficiently. However, looking at the nature of the issue we wanted to know if providing some extra resource, or making some other change can fix the issue for now. Please let us know if some other detail is required for debugging further.
Are you indexing using bulk requests? If so, what size are you using for your bulk requests?
Are you using dynamic mappings? If this is the case it is possible that smaller indices will result in more mapping updates, which need to be persisted to the cluster state.
RAM usage should approach 100% as the page cache fills up and is normal - not a problem.
What kind of disks are you using? Elasticsearch is very disk intense so having slow disks (HDD) might heavily impact performance. If I don't remember wrong you should be able to see this as the system spending a lot of time in wait for io (top on a linux machine will show you this for example).
Are you indexing using bulk requests? If so, what size are you using for your bulk requests?
No, we are sending individual document request. The mappings are same for all indexes.
Are you using dynamic mappings? If this is the case it is possible that smaller indices will result in more mapping updates, which need to be persisted to the cluster state.
No, the mappings are defined while the index is created and any dynamic new field is by default not indexed.
We are using GPFS partitions for the storage of our ES data. Its not as fast as SSDs but, are better performing that HDD.
While going through the suggestions updated by you, I also figured out that the index for which there is highest delay, also has absurdly high average size for its updates. This could be a probable reason I presume.
Average Index Update Sizes
index1: 79,208 bytes
index2: 5,170 bytes
index3: 6,295 bytes
Average Indexing Times
index1: 283 milliseconds
index2: 14 milliseconds
index3: 13 milliseconds
Will check the slow logs for the index, if I am able to figure out some discrepancy.
Indexing and updating individual documents adds a lot of overhead compared to using bulk requests as each request must be committed to disk. This in my experience increases the impact of slow storage, so I would recommend switching to bulk requests.
Check await and other I/O stats on the nodes to see if the storage is the bottleneck.
Indexing and updating individual documents adds a lot of overhead compared to using bulk requests as each request must be committed to disk.
Yeah, that makes sense and we are planning to migrate to bulk updates. However, considering the behavior that the indexing time is high only for certain indexes, also leads to the reasoning that there could be some other reason too, as the file storages and await are similar for all.
The correlation between update size and latency seems quite clear. Is there any difference in document size (not just size of updates) or mappings, e,g, type of fields and features (nested documents, parent-child, vectors etc) used?
Is the indexing rate the same for the indices you are comparing? As you are indexing/updating with an external ID, might a more frequently updated index may be cached to a greater extent?
Is there any difference in document size (not just size of updates) or mappings, e,g, type of fields and features (nested documents, parent-child, vectors etc) used?
There is no difference there, as such also after making some modifications in the data we are pushing the document size for all indexes have become similar. However, it seems that the internal merging of segments is being triggered for the index1 very frequently leading to the delay in indexing speed. Verified this once too after checking that the slow logs with 500ms+ duration updates are coming more often when the merge thread for index1 is being observed in the hot_threads.
Is there any recommended setting for segments. For now, we have kept default segment settings for all indexes.
If merging slows down indexing it sounds to me like the storage you are using may provide inadequate performance. Have you looked at await and I/O statistics, e.g. using iostat -x?
Moreover, the delay has recently increased for all indexes. Ofcourse its extremely high for that one particular index, but, for other indexes too its pretty high.
I tried with making some minor modifications in the settings of the indexes for merge operation.
As you are performing updates or indexing with an external document id each indexing operation will need one or more reads and a write. The values for r_await are quite high and could be affecting indexing latencies if the files to be read are not cached.
I would recommend switching to bulk requests and see what difference that makes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.