Performance issues when indexing data

Hi there,
In our application we decided to use elasticsearch create a daily snapshot of some critical application data for visualizations.

We have about 700K documents that will inserted into one of our index on daily basis.

We wrote a talend job to retried the data from line of business system and user curl inside talend to do bulk inserts of documents to elasticsearch.

Current issue we are facing is that the time it takes to index all 700K documents is very unpredictable, anywhere from 40 min to 14 hours.

System information:

Elasticsearch deployed in Azure Kubernetes Services.
Nodepool made up of 3 nodes of VM type: Standard_B12ms [12 vcpus and 48 memory].
K8s resource: statefulset, 3 node cluster, version 8.1.0. docker image
Pod cpu limit: 8 cpus
Pod memory limit: 16Gi
jvm settings: -Xms8g -Xmx8g

The talend job runs in as a cronjob in the same cluster

Indexing information:
bulk indexing using curl in java code [talend]
refresh interval changed to 30seconds
Number of Shards: 2
Replicas: 0
5 requests at a time of 20 documents in each request for bulk index [Size of the document is a little large] (when we had 100 documents in each bulk request it was failing so we are sticking to 20 documnets)
indices.memory.index_buffer_size is default 10% which should be 10% of 8Gi (JVM allocation)... based on documentation that seems sufficient.

Based on the screen shots below you can see that after 280K documents the indexing rate slows down. I am not sure what is going on.

Any input and insight is greatly appreciated.
Thanks you

Index health and metrics:

there are 2 shard sitting on 2 different nodes:

Node 2:

Indexing in Elasticsearch can be CPU as well as disk I/O intensive. You did not mention what type of storage you are using, so that is something to look at, especially as indexing speed slows down as shards get larger and there likely is more merging activity. Disk I/O can also be increased if you are setting your own document IDs as each indexing operation in effect becomes an update.

CPU usage can be quite high if you have complex mappings and large documents. It seems that the instance types you are using are burstable, so it may be worthwhile checking whether you are running out of credits and this is slowing down indexing over time.

Given you are running on Kubernetes and there may be other applications affecting resource availability I am not sure how to best read the monitoring stats.

good question:
My storage is azure premium disk P15:
size: 256 Gib
disk tier: P15
provisioned iops: 1100
provisioned throughput: 125
Max burst IPS: 3500
max burst throughput: 170

We are setting our own document ID as there are some critical pieces of data if should allow us to update the document if need be.

I did validate on credits... I have reservations on the compute so I don't run out of burstable credits. I will double-check this.

Lastly very valid concerns about other applciations. Elasticsearch is sitting on it's dedicated nodepool. There is no other application deployed with it. Even kibana is removed from that nodepool.

Our documents are large and nested upto 3 levels and the mapping is more complex than simple metric logging data.

Just to give you an idea:
there are

  • 2 floats
  • 37 long
  • 19 date
  • 9 booleans
  • 66 text fields that are both text and keyword.

Each document size is approximately 1kilobyte I think :slight_smile:
603,000 documents divided over 2 shards is taking 706mb of space. I get confused how to calculate number of records to size based on shards.

If your average document size is 1kB I do not see how using small bulk requests of only 20 documents would be required. As far as I remember a good bulk request size is a few hundred kB up to a few MB in size. If you do have nested documents with multiple levels and that field count, 1kB sounds very small though...

As you only have 2 primary shards, only 2 of the nodes in your cluster will be indexing data as you have no replicas configured.

In order to ensure your indexing process is not the bottleneck I would also recommend trying to index the data set using e.g. Logstash and verify that the limit still is there. Logstash uses persistent connections and reasonably large bulk sizes and would therfore be a good point of reference.

So I hear 3 recommendations:

  • verify burstable credits with Azure
  • increase bulk request size
  • implement logstash

Can you tell me a little bit more about logstash. My current ETL process is a talend job that's reading data from sql server... creating bulk index files and using curl requests to elasticsearch api.

I have never used logstash... may I get a little more guidance

I am not necessarily suggesting implementing Logstash as a permanent solution, but rather try loading the data with it once to see if the cluster behaves the same way. This would check whether the cluster is indeed the limiting factor.

If you have the data in file format, e.g. serialised JSON events (one per line), it should be relatively easy to test.

1 Like

This issue is resolved. I increased the bulk request size from 20 to 100 and that helped fix the issue.

Thank you so much for your help! :slight_smile:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.