Hi, I'm using ES on 3 servers as log streaming storage.
Java version is Oracle JVM 1.7.0_72 64bit.
ES version is 1.6.0
1 master node have 32 core, 128GB memory, 1 TB x 4 HDD, ES heap is 7GB.
2 data nodes have 28 core, 32GB memory, 1 TB x 4 HDD, ES heap is 30 GB.
Linux setup : vm.swappiness: 1, max file open: 250000
New index created each 15 minutes. (6,000,000+ docs)
Open index count : 288
Shard count : 3, replica count : 1
Configuration is almost default. (but index.routing.allocation.enable is none)
Sometimes, one doc size up to 4~500kb
Original data document count is 10x. (whole data not stored yet)
Data throughput process(using JNI) installed on each nodes and bulk request send to localhost. (using TransportClient and BulkProcessor)
Normally, data throughput is 5~70,000 / 10secs and usr cpu 15~20%, sys 5% under per 10 secs. (cpu percent 100 means 2800% at data node and 3200% at master node)
But recently, I found problem occurring often (problem appeared per 2~3 days). That situation, all servers cpu usage is very high (up to 90%) and bulk response delayed.
--> Issue summary
- Bulk response time delayed about 2 secs. (Bulk response time about 0.x second normally.)
- Cpu is normal. Throughput 35,000/5sec. bulk delayed up to 6 secs.
- After 5 min, bulk delayed up to 61 secs. Throughput under 3,000.
- Data throughput process cpu going higher. sys cpu 75% over, usr cpu 15~20% per 5secs. Throughput down to 0.
- ES cpu usage going higher too. up to 3~4x (normal value is less than 10%).
- Problem continue while ES all cluster restart.
- Problem disappear when ES restart complete. (but problem is not solved. This is temporary solution)
- ES log level is debug. but exception log is not appeared.
- Disk I/O and Network traffic is normal
I want to know that how to solve this problem. (why this problem is occurred, too)