Indexing performance degradation over time

apyshchyk · November 13, 2017, 11:21pm

Hi everyone, I'm using ES 5.5.0 on Amazon EC2 (18 Data nodes 8 CPU, 32GB RAM, 3 master) 1 TB EBS GP2 ssd (3000 IOPS). and have issue with indexing performance (segment merging time).
A bit about workflow - I'm indexing simple json (about 1Kb each), no update, no delete, no parent-child, etc., refresh interval 30 seconds.

indexing into daily indices, which are separated by tenant (has from 2 to 15 shards, depends on planned data amount).
per day, per tenant have ~ 5 Mil events / shard.
closer to end o each day the is indexing degradation due to segment merging time, it takes a lot.
Currently I'm looking about changing Max size of segment, since 5Gb looks to big for me, and merge segments like 4.8Gb + 100Kb doesn't seem to be fast.
Issue happens when one of data nodes doing merge for 3-9 seconds, and threads a waiting to write data into it.
Or, if you have any other ideas, proposals?

Here is screen from cluster monitoring. Indexing rate currently about 20k events/s, but as you can see - indexing speed far away from 20K, and nodes basically doing nothing. only one or 2 nodes doing some slow merges. These huge spikes in Indexing Time and Merging time - are exactly at the same time when indexing seep lowest

varun_kumar · November 14, 2017, 2:58am

Are there duplicates in your data? i.e. two json with same primary id?

Christian_Dahlqvist · November 14, 2017, 6:52am

Can you provide us an overview of the cluster by providing the full output of the cluster stats API? What does heap usage and GC look like over time? Is there anything in the logs that indicate why it is slowing down? Are you using daily indices? Do you see any correlation between slowdown and creation of new daily indices?

apyshchyk · November 14, 2017, 12:39pm

No duplications, we have our custom ID, owerwriting docs only when app chrashed, and we didn't commit that we have processed bulk to the end.

Christian_Dahlqvist · November 14, 2017, 12:46pm

If you are using custom IDs you should read this blog post. When you use a custom ID Elasticsearch need to check whether the document already exists for every insert, which tend to get slower as shards get larger. If you can use an efficient identifier as described in the blog post, you can reduce this effect, but it will most likely still not be as fast as allowing Elasticsearch to assign IDs automatically.

apyshchyk · November 14, 2017, 12:46pm

I'll provide full cluster_stats output a bit later - right now load is pretty low, so indexing works well
I see correlation between new indices and old one, New process data much faster. About creation of new indices - I precreate indices for next day using separate app. It was hard for ES Cluster to create lots of indices and inserting data at the same time, so I had slow down for 5 - 15 minutes before I went with approach about index precreation.
I checked hot threads, during slow down - there is one- two machines doing merge, and at the same time those machines has 100-200 bulk requests in queue, others doing nothing

apyshchyk · November 14, 2017, 12:52pm

Reason why we use our custom ID - to prevent data duplication.
Our workflow - getting data from kafka, push to ES, commit offset to kafka.
If our app crash after pushing to ES and couldn't commit to Kafka, we will reprocess couple thousands messages twice.

apyshchyk · November 14, 2017, 12:55pm

{

Blockquote
"_nodes" : {
"total" : 21,
"successful" : 21,
"failed" : 0
},
"cluster_name" : "stage-escluster",
"timestamp" : 1510664066932,
"status" : "green",
"indices" : {
"count" : 1080,
"shards" : {
"total" : 6974,
"primaries" : 3487,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 2,
"max" : 60,
"avg" : 6.457407407407407
},
"primaries" : {
"min" : 1,
"max" : 30,
"avg" : 3.2287037037037036
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 5130705993,
"deleted" : 1996
},
"store" : {
"size_in_bytes" : 14100406765796,
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 0,
"total_count" : 0,
"hit_count" : 0,
"miss_count" : 0,
"cache_size" : 0,
"cache_count" : 0,
"evictions" : 0
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 118255,
"memory_in_bytes" : 26172275408,
"terms_memory_in_bytes" : 19511538176,
"stored_fields_memory_in_bytes" : 4999585136,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 15136384,
"points_memory_in_bytes" : 1058100340,
"doc_values_memory_in_bytes" : 587915372,
"index_writer_memory_in_bytes" : 7610750628,
"version_map_memory_in_bytes" : 85053111,
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 21,
"data" : 18,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 21
},
"versions" : [
"5.5.0"
],
"os" : {
"available_processors" : 156,
"allocated_processors" : 156,
"names" : [
{
"name" : "Linux",
"count" : 21
}
],
"mem" : {
"total_in_bytes" : 648554434560,
"free_in_bytes" : 15388495872,
"used_in_bytes" : 633165938688,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 1187
},
"open_file_descriptors" : {
"min" : 977,
"max" : 1745,
"avg" : 1618
}
},
"jvm" : {
"max_uptime_in_millis" : 1532237262,
"versions" : [
{
"version" : "1.8.0_31",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.31-b07",
"vm_vendor" : "Oracle Corporation",
"count" : 21
}
],
"mem" : {
"heap_used_in_bytes" : 164761887000,
"heap_max_in_bytes" : 333647708160
},
"threads" : 2443
},
"fs" : {
"total_in_bytes" : 39368577171456,
"free_in_bytes" : 10708432044032,
"available_in_bytes" : 8729318752256
},
"plugins" : [
{
"name" : "discovery-ec2",
"version" : "5.5.0",
"description" : "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
"classname" : "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
"has_native_controller" : false
},
{
"name" : "x-pack",
"version" : "5.5.0",
"description" : "Elasticsearch Expanded Pack Plugin",
"classname" : "org.elasticsearch.xpack.XPackPlugin",
"has_native_controller" : true
}
],
"network_types" : {
"transport_types" : {
"netty4" : 21
},
"http_types" : {
"netty4" : 21
}
}
}
}

Christian_Dahlqvist · November 14, 2017, 12:55pm

That is an excellent reason to use custom IDs. You may get improved performance if you can make sure they are efficient though.

apyshchyk · November 14, 2017, 12:58pm

Thanks, I'l test today to replace custom id with timestamp + some our fields value

apyshchyk · November 14, 2017, 1:05pm

A bit more details about data flow
our app has 10 threads, which are reading from 40 kafka partitions in each bulk from kafka we have data to multiple indices. Example - 1000messages to one, 100 messages to another 30 to third-one, etc. packing it into one bulk request and send to ES.
Mt other thoughts - to send data to ES not in one bulk which contains data about different indices, but to send those bulk one by one - per index.

apyshchyk · November 16, 2017, 1:40pm

Problem is still actual, closer to end of the day, when indices become bigger, merge time takes a lot of time and slows down indexing a lot.

Christian_Dahlqvist · November 16, 2017, 2:15pm

Is that with a Lucene-friendly document ID? Are you grouping the outputs per index?

apyshchyk · November 16, 2017, 2:45pm

I tried to change to ES autogenerated IDs, but behaviour still the same

Christian_Dahlqvist · November 17, 2017, 7:26am

If you have not already, it may be worthwhile trying to reduce the number of merging threads to 1 through the index.merge.scheduler.max_thread_count setting.

antonbormotov · November 17, 2017, 5:42pm

Hi Andriy,
It might help to disable refresh interval during ingestion and set replicas to 0.
Of course, in case you don't need to search while importing documents.
Take a look at my post here as well: Elasticsearch indexing performance: throttle merging

apyshchyk · November 21, 2017, 1:11pm

HI, thanks for the response, unfortunately I can't do that, since it's streaming data (24/7) and should be available for search in 30 seconds.

Christian_Dahlqvist · November 21, 2017, 1:20pm

Even though you can not disable the refresh interval, you may be able to increase it a bit and still meet your SLA. It would be interesting to see what impact this has.

jpountz · November 21, 2017, 1:33pm

For the record, the hot threads pasted above suggest there might be some sparse doc-value fields, which tend to slow down merging and would have the described side-effect of making indexing slower and slower over time (as merges get bigger). Elasticsearch would perform better with fewer denser fields, or if a change of the schema is no option, an upgrade to Elasticsearch 6.0 would help too since 6.0 has much better support for sparse fields.

system · December 19, 2017, 1:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch poor indexing performance Elasticsearch	6	848	December 1, 2017
Cluster optimization(indexing/query performace) Elasticsearch	4	312	July 6, 2017
Significant time changes during index benchmarking Elasticsearch	4	417	July 6, 2017
Indexing performance degrading over time Elasticsearch	67	33286	July 5, 2017
Elasticsearch Indexing Performance Degradation Elasticsearch	6	190	April 22, 2024

Indexing performance degradation over time

Related topics