We are using ElasticSearch 5.1.2:
Bootstrap is passed except system_call_filter, because we are running on CentOS 6
Here is the setting:
The document to be indexed have 1200 fields, each document is near 4K. But all the fields are consisted of three types: date, keyword, numberic. 24 millions records in total.
We want to use ElasticSearch to do near real-time aggregations.
We have three common nodes(master:true; data:true; ingest:default true) & one standalone ingest node(master:false; data:false; ingest:true). All these nodes jvm config are same as the following.Each node have 64GB memory, SSD, 24cores.
We load data into ES by using elasticsearch-spark, DataFrame.saveToEs() method, which is bulk insert. We use 10 executers which means 10 clients to parallel insert to the ingest node.
Here is the performance we got.
One of data node's
With 4 nodes, we only got index rate near 1000 document/s。 near 4 ms for indexing one document.
All the nodes' cpu usage is under 10%. it won't be bottleneck.Don't have throttle, bulk queue, bulk rejections, index queue, index rejections or other issues.
Increase client, the performance will deteriorate.
Settings of Index:
Settings of Cluster
Other settings are default.
My question is why indexing performance is so poor, only 1000 document/s and why ingest node is so heavy, happened so much young gc, even full gc.
From document, ingest does:
• intercepts bulk and index requests
• applies the transformations
• passes the documents back to the index or bulk APIs
We didn't use pipeline.