Recently I encountered a strange situation. I use ES for storing Nginx logs.
I use my own codes to pull logs from Kafka and push them to ES. The index speed is about 10K to 20K per seconds. It works very well for save 700 millions logs per day. Everyday I will create a new index for it. 700M logs will occupy 1.7TB space. I have 8TB space. So I will keep 4-5 days logs for analysis. This stable situation lasts more than 10 days.
But from one of the day in last week, after I added a new Date type field, and turn the URL field from 'not_analyzed' to a 'string' field. I saw the the cluster's bulk indexing speed going down dramatically occasionally.
I got two machines in this cluster. I notice that one of the machine's load will be occasionally high to 50.0, CPU going to more than 90%. But another one's cpu is going down to less than 10%. It appears alternatively in both machines. During the busy time, my ES client receives Timeout errors sometime.
I did some research for the busy time. The /_nodes/hot_threads API give me some infos.
91.9% (459.5ms out of 500ms) cpu usage by thread 'elasticsearch[nginx2][[logstash-kafka-11.29][17]: Lucene Merge Thread #67]'
2/10 snapshots sharing following 12 elements
org.apache.lucene.codecs.lucene50.Lucene50DocValuesConsumer.addNumericField(Lucene50DocValuesConsumer.java:80)
After that, I upgrade ES from 1.7.2 to 2.1.0. It continuously have the high load sometimes.
What the cause about it?
Is it the cause of turn URL field from not_analyzed to string? So the lucene will do much more job for merging index data?
How can I optimize the performance or make the performance more smooth?
Machines info: 16 core CPU, 16G Mem, SSD HD.
Index mapping: All the fields are not_analyzed except the URL field. Two date type fields.
Other settings: refresh_interval => 2m, num_of_shards => 20, num_of_replicas => 0, flush_threshold_ops =>5000