Occasional spike in data indexing of Elasticsearch

Priyansh_Maheshwari · September 10, 2024, 6:42am

Hi, we are currently working with a single node ES instance, to which we connect with our application using ESJavaClient. Recently we re-indexed all our indexes after enabling search on a time based field. Post migration, we have been observing sudden increase in the data indexing performance specifically for the newly created indexes. The average indexing performance for the new indexes are greater than 300ms compared consistently to an average of 10ms for other indexes.

Some basic debugging details:-

We tries deleting the whole index and letting a new one get created with no initial data. Still the average time to push any document in it was over 250ms, sometimes even going to 1000ms. Whereas, older indexes at same time got updated under 10ms.
There is no relation in the size of the index with the delay, as indexes with 20 times the size are not facing these delays.
There has been an overall increase the indexing time but, its not extremely large/
Ram percent usage is very high for the single node, usually higher than 95%.

ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
127.0.0.1           11          96  34   21.43   22.86    23.39 cdfhilmrstw *      socrates

Checked the hot threads usage too and usually they display 100% CPU utilization in write and flush threads.

100.0% [cpu=99.7%, other=0.3%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[socrates08][write][T#18]'

 100.0% [cpu=99.0%, other=1.0%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[socrates08][flush][T#8]'

The core usage and memory usage in much below than the provided resources.

Capture22246×647 54.1 KB

Extra details on implementation and memory allocation:-

Maximum number of indexing queries that can be made at a time: 6
Provided number of CPU cores: 10 (Increased from 5, before the migration)
Provided memory to the ES process: 80GB (increased from 64GB, before the migration)
Heap memory allocated: 30GB
Number of primary shards allocated to each index: 1
Number of replica shards allocated to each index: 0

We understand that there are some basic optimizations we can make like moving to a cluster or using the bulk indexing much more efficiently. However, looking at the nature of the issue we wanted to know if providing some extra resource, or making some other change can fix the issue for now. Please let us know if some other detail is required for debugging further.

Thanks,
Priyansh Maheshwari

Christian_Dahlqvist · September 10, 2024, 7:22am

Are you indexing using bulk requests? If so, what size are you using for your bulk requests?

Are you using dynamic mappings? If this is the case it is possible that smaller indices will result in more mapping updates, which need to be persisted to the cluster state.

RAM usage should approach 100% as the page cache fills up and is normal - not a problem.

Priyansh_Maheshwari:

ip        heap.percent ram.percent cpu load_1m load_5m load_15m node.role   master name
127.0.0.1           11          96  34   21.43   22.86    23.39 cdfhilmrstw *      socrates

Your load average is very high. Do you by any chance have very slow storage, e.g. HDD? For optimal indexing speed it is recommended to use local SSDs (or storage with the same level of performance)..

joean407 · September 10, 2024, 7:25am

What kind of disks are you using? Elasticsearch is very disk intense so having slow disks (HDD) might heavily impact performance. If I don't remember wrong you should be able to see this as the system spending a lot of time in wait for io (top on a linux machine will show you this for example).

Priyansh_Maheshwari · September 10, 2024, 7:43am

Thanks for the quick response.

Are you indexing using bulk requests? If so, what size are you using for your bulk requests?

No, we are sending individual document request. The mappings are same for all indexes.

Are you using dynamic mappings? If this is the case it is possible that smaller indices will result in more mapping updates, which need to be persisted to the cluster state.

No, the mappings are defined while the index is created and any dynamic new field is by default not indexed.

Your load average is very high. Do you by any chance have very slow storage, e.g. HDD? For optimal indexing speed it is recommended to use local SSDs (or storage with the same level of performance)..

We are using GPFS partitions for the storage of our ES data. Its not as fast as SSDs but, are better performing that HDD.

While going through the suggestions updated by you, I also figured out that the index for which there is highest delay, also has absurdly high average size for its updates. This could be a probable reason I presume.

Average Index Update Sizes

index1: 79,208 bytes
index2: 5,170 bytes
index3: 6,295 bytes

Average Indexing Times

index1: 283 milliseconds
index2: 14 milliseconds
index3: 13 milliseconds

Will check the slow logs for the index, if I am able to figure out some discrepancy.

Christian_Dahlqvist · September 10, 2024, 7:50am

Indexing and updating individual documents adds a lot of overhead compared to using bulk requests as each request must be committed to disk. This in my experience increases the impact of slow storage, so I would recommend switching to bulk requests.

Check await and other I/O stats on the nodes to see if the storage is the bottleneck.

Priyansh_Maheshwari · September 10, 2024, 8:03am

Indexing and updating individual documents adds a lot of overhead compared to using bulk requests as each request must be committed to disk.

Yeah, that makes sense and we are planning to migrate to bulk updates. However, considering the behavior that the indexing time is high only for certain indexes, also leads to the reasoning that there could be some other reason too, as the file storages and await are similar for all.

Christian_Dahlqvist · September 10, 2024, 8:06am

The correlation between update size and latency seems quite clear. Is there any difference in document size (not just size of updates) or mappings, e,g, type of fields and features (nested documents, parent-child, vectors etc) used?

Is the indexing rate the same for the indices you are comparing? As you are indexing/updating with an external ID, might a more frequently updated index may be cached to a greater extent?

Priyansh_Maheshwari · September 10, 2024, 2:32pm

Is there any difference in document size (not just size of updates) or mappings, e,g, type of fields and features (nested documents, parent-child, vectors etc) used?

There is no difference there, as such also after making some modifications in the data we are pushing the document size for all indexes have become similar. However, it seems that the internal merging of segments is being triggered for the index1 very frequently leading to the delay in indexing speed. Verified this once too after checking that the slow logs with 500ms+ duration updates are coming more often when the merge thread for index1 is being observed in the hot_threads.

Is there any recommended setting for segments. For now, we have kept default segment settings for all indexes.

Christian_Dahlqvist · September 10, 2024, 3:14pm

If merging slows down indexing it sounds to me like the storage you are using may provide inadequate performance. Have you looked at await and I/O statistics, e.g. using iostat -x?

Priyansh_Maheshwari · September 11, 2024, 10:12am


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.07    0.01    1.77    0.10    0.00   95.05

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
sda              8.46   14.41    890.47    215.79     2.27     3.18  21.15  18.06   16.02    0.32   0.14   105.26    14.97   0.16   0.37
dm-0             8.37   17.23    890.29    215.76     0.00     0.00   0.00   0.00   22.79    1.08   0.21   106.42    12.52   0.14   0.35

Going through the stats I believe they are not the bottleneck causing the delay as they are idle in general.

Priyansh_Maheshwari · September 11, 2024, 11:04am

Moreover, the delay has recently increased for all indexes. Ofcourse its extremely high for that one particular index, but, for other indexes too its pretty high.

I tried with making some minor modifications in the settings of the indexes for merge operation.

{
  "index": {
    "refresh_interval": "3s",
    "merge": {
      "policy": {
        "max_merge_at_once": "20",
        "segments_per_tier": "20",
        "max_merged_segment": "10gb"
      },
      "scheduler": {
        "max_thread_count": "1"
      }
    }
  }
}

Also, tried with setting the flush operation to async, but, none of them led to any specific improvements in performance.

Christian_Dahlqvist · September 11, 2024, 11:11am

As you are performing updates or indexing with an external document id each indexing operation will need one or more reads and a write. The values for r_await are quite high and could be affecting indexing latencies if the files to be read are not cached.

I would recommend switching to bulk requests and see what difference that makes.

Topic		Replies	Views
Suggestion needed on Indexing Performance Elasticsearch	1	503	July 6, 2017
Slow bulk indexing Elasticsearch	4	2097	July 5, 2017
Bulk indexing slow down when data amount increase Elasticsearch	6	2989	July 6, 2017
How Can I increase ES's indexing Data speed?Bulk can't achieve it! Elasticsearch	12	1306	July 5, 2017
Huge performance degradation during bulk indexing Elasticsearch	8	3439	May 16, 2019

Occasional spike in data indexing of Elasticsearch

Average Index Update Sizes

Average Indexing Times

Related topics