Indexing performance degradation over time

Hi everyone, I'm using ES 5.5.0 on Amazon EC2 (18 Data nodes 8 CPU, 32GB RAM, 3 master) 1 TB EBS GP2 ssd (3000 IOPS). and have issue with indexing performance (segment merging time).
A bit about workflow - I'm indexing simple json (about 1Kb each), no update, no delete, no parent-child, etc., refresh interval 30 seconds.

indexing into daily indices, which are separated by tenant (has from 2 to 15 shards, depends on planned data amount).
per day, per tenant have ~ 5 Mil events / shard.
closer to end o each day the is indexing degradation due to segment merging time, it takes a lot.
Currently I'm looking about changing Max size of segment, since 5Gb looks to big for me, and merge segments like 4.8Gb + 100Kb doesn't seem to be fast.
Issue happens when one of data nodes doing merge for 3-9 seconds, and threads a waiting to write data into it.
Or, if you have any other ideas, proposals?

Here is screen from cluster monitoring. Indexing rate currently about 20k events/s, but as you can see - indexing speed far away from 20K, and nodes basically doing nothing. only one or 2 nodes doing some slow merges. These huge spikes in Indexing Time and Merging time - are exactly at the same time when indexing seep lowest


Are there duplicates in your data? i.e. two json with same primary id?

Can you provide us an overview of the cluster by providing the full output of the cluster stats API? What does heap usage and GC look like over time? Is there anything in the logs that indicate why it is slowing down? Are you using daily indices? Do you see any correlation between slowdown and creation of new daily indices?

No duplications, we have our custom ID, owerwriting docs only when app chrashed, and we didn't commit that we have processed bulk to the end.

If you are using custom IDs you should read this blog post. When you use a custom ID Elasticsearch need to check whether the document already exists for every insert, which tend to get slower as shards get larger. If you can use an efficient identifier as described in the blog post, you can reduce this effect, but it will most likely still not be as fast as allowing Elasticsearch to assign IDs automatically.

1 Like

I'll provide full cluster_stats output a bit later - right now load is pretty low, so indexing works well
I see correlation between new indices and old one, New process data much faster. About creation of new indices - I precreate indices for next day using separate app. It was hard for ES Cluster to create lots of indices and inserting data at the same time, so I had slow down for 5 - 15 minutes before I went with approach about index precreation.
I checked hot threads, during slow down - there is one- two machines doing merge, and at the same time those machines has 100-200 bulk requests in queue, others doing nothing

Reason why we use our custom ID - to prevent data duplication.
Our workflow - getting data from kafka, push to ES, commit offset to kafka.
If our app crash after pushing to ES and couldn't commit to Kafka, we will reprocess couple thousands messages twice.

{

Blockquote
"_nodes" : {
"total" : 21,
"successful" : 21,
"failed" : 0
},
"cluster_name" : "stage-escluster",
"timestamp" : 1510664066932,
"status" : "green",
"indices" : {
"count" : 1080,
"shards" : {
"total" : 6974,
"primaries" : 3487,
"replication" : 1.0,
"index" : {
"shards" : {
"min" : 2,
"max" : 60,
"avg" : 6.457407407407407
},
"primaries" : {
"min" : 1,
"max" : 30,
"avg" : 3.2287037037037036
},
"replication" : {
"min" : 1.0,
"max" : 1.0,
"avg" : 1.0
}
}
},
"docs" : {
"count" : 5130705993,
"deleted" : 1996
},
"store" : {
"size_in_bytes" : 14100406765796,
"throttle_time_in_millis" : 0
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
},
"query_cache" : {
"memory_size_in_bytes" : 0,
"total_count" : 0,
"hit_count" : 0,
"miss_count" : 0,
"cache_size" : 0,
"cache_count" : 0,
"evictions" : 0
},
"completion" : {
"size_in_bytes" : 0
},
"segments" : {
"count" : 118255,
"memory_in_bytes" : 26172275408,
"terms_memory_in_bytes" : 19511538176,
"stored_fields_memory_in_bytes" : 4999585136,
"term_vectors_memory_in_bytes" : 0,
"norms_memory_in_bytes" : 15136384,
"points_memory_in_bytes" : 1058100340,
"doc_values_memory_in_bytes" : 587915372,
"index_writer_memory_in_bytes" : 7610750628,
"version_map_memory_in_bytes" : 85053111,
"fixed_bit_set_memory_in_bytes" : 0,
"max_unsafe_auto_id_timestamp" : -1,
"file_sizes" : { }
}
},
"nodes" : {
"count" : {
"total" : 21,
"data" : 18,
"coordinating_only" : 0,
"master" : 3,
"ingest" : 21
},
"versions" : [
"5.5.0"
],
"os" : {
"available_processors" : 156,
"allocated_processors" : 156,
"names" : [
{
"name" : "Linux",
"count" : 21
}
],
"mem" : {
"total_in_bytes" : 648554434560,
"free_in_bytes" : 15388495872,
"used_in_bytes" : 633165938688,
"free_percent" : 2,
"used_percent" : 98
}
},
"process" : {
"cpu" : {
"percent" : 1187
},
"open_file_descriptors" : {
"min" : 977,
"max" : 1745,
"avg" : 1618
}
},
"jvm" : {
"max_uptime_in_millis" : 1532237262,
"versions" : [
{
"version" : "1.8.0_31",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.31-b07",
"vm_vendor" : "Oracle Corporation",
"count" : 21
}
],
"mem" : {
"heap_used_in_bytes" : 164761887000,
"heap_max_in_bytes" : 333647708160
},
"threads" : 2443
},
"fs" : {
"total_in_bytes" : 39368577171456,
"free_in_bytes" : 10708432044032,
"available_in_bytes" : 8729318752256
},
"plugins" : [
{
"name" : "discovery-ec2",
"version" : "5.5.0",
"description" : "The EC2 discovery plugin allows to use AWS API for the unicast discovery mechanism.",
"classname" : "org.elasticsearch.discovery.ec2.Ec2DiscoveryPlugin",
"has_native_controller" : false
},
{
"name" : "x-pack",
"version" : "5.5.0",
"description" : "Elasticsearch Expanded Pack Plugin",
"classname" : "org.elasticsearch.xpack.XPackPlugin",
"has_native_controller" : true
}
],
"network_types" : {
"transport_types" : {
"netty4" : 21
},
"http_types" : {
"netty4" : 21
}
}
}
}

That is an excellent reason to use custom IDs. You may get improved performance if you can make sure they are efficient though.

Thanks, I'l test today to replace custom id with timestamp + some our fields value

A bit more details about data flow
our app has 10 threads, which are reading from 40 kafka partitions in each bulk from kafka we have data to multiple indices. Example - 1000messages to one, 100 messages to another 30 to third-one, etc. packing it into one bulk request and send to ES.
Mt other thoughts - to send data to ES not in one bulk which contains data about different indices, but to send those bulk one by one - per index.

Problem is still actual, closer to end of the day, when indices become bigger, merge time takes a lot of time and slows down indexing a lot.

Is that with a Lucene-friendly document ID? Are you grouping the outputs per index?

I tried to change to ES autogenerated IDs, but behaviour still the same

If you have not already, it may be worthwhile trying to reduce the number of merging threads to 1 through the index.merge.scheduler.max_thread_count setting.

Hi Andriy,
It might help to disable refresh interval during ingestion and set replicas to 0.
Of course, in case you don't need to search while importing documents.
Take a look at my post here as well: Elasticsearch indexing performance: throttle merging

HI, thanks for the response, unfortunately I can't do that, since it's streaming data (24/7) and should be available for search in 30 seconds.

Even though you can not disable the refresh interval, you may be able to increase it a bit and still meet your SLA. It would be interesting to see what impact this has.

For the record, the hot threads pasted above suggest there might be some sparse doc-value fields, which tend to slow down merging and would have the described side-effect of making indexing slower and slower over time (as merges get bigger). Elasticsearch would perform better with fewer denser fields, or if a change of the schema is no option, an upgrade to Elasticsearch 6.0 would help too since 6.0 has much better support for sparse fields.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.