Pointers to Improve indexing performance?

Hi There,

We are trying to index documents using logstash 2.4.0 to elasticsearch 2.4.0. Currently we are facing performance issue on indexing. 700 MB of data is taking around 30 minutes of time.

We have a 10 node cluster ( 3 D, 3 M, 2 Cl)(Azure VMs) of 56 GB memory(28GB JVM) and 8 cores. Our index is having 3 replicas, and having size of 2 TB.

we have index refresh interval of 1 hour. Also, elastic search config, looks like below. Please provide us some pointers, if we are missing any settings, which actually help indexing.

http.enabled: false

threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 50%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

Note: We enabled memory lock in windows, but that didn't helped, even though swap memory is being used. as we can see that is being used in elasticHQ plugin.

How many primary shards along with these three replicas?
Do you require the index refresh interval to be set at all?

Are you indexing new documents and/or updating existing ones? How large are your bulk requests? How many concurrent indexing threads/processes are you using? Are you allowing Elasticsearch to set the document ID or do you handle this in your application?

@JKhondhu 20 Primary Shards, we have set the refresh interval to 1 hour, as we heard, it will degrade the indexing performance, if we have refresh interval set to 1 sec(default).

we are indexing new documents, as we are indexing, using logstash, we haven't changes the batch size, which is 125. So bulk requests is also of 125.

using 8 workers. we are setting the dataID. in logstash

here is the logstash config file, and using 8 logstash worker while running logstash

input
{
file
{
path => ["G:/CustomEvent/*.json"]
start_position => "beginning"
codec => json
ignore_older => 0
sincedb_path => "G:/Sincedb/CustomEvent/customevent.sincedb"
}

}
filter {
grok {
match => {
path => "%{GREEDYDATA:folderpath}/%{GREEDYDATA:filename}.json"
}
}}

output
{
elasticsearch {
action => "index"
hosts => ["10.158.36.209"]
codec => json
index => "customevent_catchup012017"
document_type => "dailyaggregate"
document_id => "%{dataid}"
workers => 8
}
file
{
codec => line {
format => ""
}
path => ["G:/CustomEvent/%{filename}.txt"]
}
}

As you are explicitly setting the document id instead of letting Elasticsearch assign it, each index operation is treated as an update as Elasticsearch must check if the document already exists. Depending on how you create this identifier it can have a significant impact on indexing performance. If chosen poorly, like e.g. a random UUID, indexing performance will degrade as the shards grow in size and more segments need to be checked for every document indexed.

As you have immutable documents you will probably benefit from switching to time-based indices as this makes it easier to control the shard size and prevent it from continuously increasing over time. It also makes managing retention of data much easier and efficient as whole indices can simply be dropped rather than having to delete individual documents.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.