Pointers to Improve indexing performance?

nethis · January 31, 2017, 1:16pm

Hi There,

We are trying to index documents using logstash 2.4.0 to elasticsearch 2.4.0. Currently we are facing performance issue on indexing. 700 MB of data is taking around 30 minutes of time.

We have a 10 node cluster ( 3 D, 3 M, 2 Cl)(Azure VMs) of 56 GB memory(28GB JVM) and 8 cores. Our index is having 3 replicas, and having size of 2 TB.

we have index refresh interval of 1 hour. Also, elastic search config, looks like below. Please provide us some pointers, if we are missing any settings, which actually help indexing.

http.enabled: false

threadpool.bulk.size: 8
threadpool.bulk.queue_size: 1000
bootstrap.memory_lock: true
indices.memory.index_buffer_size : 50%
indices.requests.cache.size: 5%
indices.queries.cache.size: 15%
indices.store.throttle.max_bytes_per_sec : 2gb

Note: We enabled memory lock in windows, but that didn't helped, even though swap memory is being used. as we can see that is being used in elasticHQ plugin.

JKhondhu · January 31, 2017, 2:28pm

How many primary shards along with these three replicas?
Do you require the index refresh interval to be set at all?

Christian_Dahlqvist · January 31, 2017, 3:01pm

Are you indexing new documents and/or updating existing ones? How large are your bulk requests? How many concurrent indexing threads/processes are you using? Are you allowing Elasticsearch to set the document ID or do you handle this in your application?

nethis · January 31, 2017, 6:13pm

@JKhondhu 20 Primary Shards, we have set the refresh interval to 1 hour, as we heard, it will degrade the indexing performance, if we have refresh interval set to 1 sec(default).

nethis · January 31, 2017, 6:19pm

we are indexing new documents, as we are indexing, using logstash, we haven't changes the batch size, which is 125. So bulk requests is also of 125.

using 8 workers. we are setting the dataID. in logstash

here is the logstash config file, and using 8 logstash worker while running logstash

input
{
file
{
path => ["G:/CustomEvent/*.json"]
start_position => "beginning"
codec => json
ignore_older => 0
sincedb_path => "G:/Sincedb/CustomEvent/customevent.sincedb"
}

}
filter {
grok {
match => {
path => "%{GREEDYDATA:folderpath}/%{GREEDYDATA:filename}.json"
}
}}

output
{
elasticsearch {
action => "index"
hosts => ["10.158.36.209"]
codec => json
index => "customevent_catchup012017"
document_type => "dailyaggregate"
document_id => "%{dataid}"
workers => 8
}
file
{
codec => line {
format => ""
}
path => ["G:/CustomEvent/%{filename}.txt"]
}
}

Christian_Dahlqvist · January 31, 2017, 7:10pm

As you are explicitly setting the document id instead of letting Elasticsearch assign it, each index operation is treated as an update as Elasticsearch must check if the document already exists. Depending on how you create this identifier it can have a significant impact on indexing performance. If chosen poorly, like e.g. a random UUID, indexing performance will degrade as the shards grow in size and more segments need to be checked for every document indexed.

As you have immutable documents you will probably benefit from switching to time-based indices as this makes it easier to control the shard size and prevent it from continuously increasing over time. It also makes managing retention of data much easier and efficient as whole indices can simply be dropped rather than having to delete individual documents.

system · February 28, 2017, 7:11pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
INDEX Performance Elasticsearch	15	696	July 19, 2018
How to improve performance of Re-indexing from logstash? Logstash	3	387	December 16, 2016
Cluster (ES 5.2) performance degrading after indexing Elasticsearch	3	508	June 6, 2017
Indexing performance Elasticsearch	6	367	July 6, 2017
HELP... How to Increase bulk Index rate performance - 15 Node Elasticsearch Cluster (error 429) Elasticsearch	4	3165	January 4, 2017

Pointers to Improve indexing performance?

Related topics