I have ES 1.7.5 cluster with 20 servers. Maximum indexing performance of this cluster is about 6000-7000 docs/sec. Average size of document is about 1.5 Kbytes
First (optional question what do you think about this speed?
And for upgrading to ES 2.3 I setup cluster with another 20 servers and run ES 2.3 but maximum performance of this cluster is about 3000-3500 docs/sec. How to improve speed?
Configurations of clusters are almost default except all strings are not_analyzed. And any server may become master.
Any ideas please.
What type of hardware are you using? What is the specification of the nodes? How are you ingesting data? What does load on the servers look like while you are indexing? How many indices/shards are you actively indexing into?
Can you specifiy this in MB/sec?
Sorry - I see now 1.5 KB avg size. It means 9-10 MB/sec. This is very bad performance for 20 servers, it would mean one server could take only 500 KB/sec.
This is probably the cause. ES 2.x default for translog changed from async to sync. If you set the following in your ES 2.3 cluster, do you get better performance?
Let's take caution here. First, when making any change to the
index.translog.durability, it's immensely important to point out that the tradeoff is a loss of safety. Second, the performance reported here is so incredibly low that I'm skeptical that the best solution to getting back some of the performance is by adjusting the translog sync. I suspect that performance is being left on the table somewhere else and we should focus on understanding that.
It depends on the use case. For our use case, we're fine with taking that risk.
Agreed that the numbers reported is poor.
Cluster ES 2.3.3 with this option process 12000-18000 docs/sec
Much better then ES 1.7.5 with 6000-7000 docs/sec
I understand risks of use this option.
Yes, which is why it's important when you recommend someone turning off translog durability that you make them aware of the safety tradeoffs.
20 servers per cluster
CPU: Core i7-6700 CPU @ 3.40GHz
RAM: 64Gb (but 6 nodes in both clusters still has 48Gb - they waits upgrade)
HDD: software raid0 2x2Tb
ES Heap size: 31Gb
Cluster contain 2 daily rotate indices (index1-YYYY.MM.DD and index2-YYYY.MM.DD)
Both indices has almost the same size.
5 shards per index with 1 replica (2 copies of data)
every daily index contain about 160,000,000-220,000,000 docs
Do not use software RAID. Most important, poor disk I/O solutions will thwart powerful CPU like i7-6700. Also, software RAID0, plus disabled transaction durability, is an invitation to data loss. You should use hardware RAID with an optimized file system setting for maximum throughput. Also check if the 2TB drives are built for server performance tasks, or for archival purpose.
You can double indexing speed by the following procedure: 1) create new index with replica level 0 2) bulk index 3) add replica level 1 before enabling search on that index in application.
But I think that alone does still not explain the poor performance.
Could you clarify this? Are you talking about matching stripe sizes, etc?
How is your cluster being setup: Do you have Data nodes only or you also use a Master node? I have seen great performance when I started using a Master node next to the Data nodes.
Also, how much memory do you allocate to the Elasticsearch instances, so what is the value you use for ES_HEAP_SIZE ?
I can not use HW raid.
I can use software raid level 0 or level 1 of two disks or use separately mounted two disks per server.
I use data nodes only. Any node may become master. ES_HEAP_SIZE=31G
Are you using bulk indexing? If so, what bulk size do you use?
Can you try to setup it up with one Masternode which contain no data ? It can also improve the performance on your ES 1.7 cluster.
Any suggestion that ES (either version) is hitting any threadpool limits and dropping events (e.g. bulk rejections)?
What about logstash - any suggestion in the logs that it is not able to keep up?
The cat API gives great insight into this.
Do you have any monitoring solution to give metrics on performance (Marvel, ElasticHQ, elasticsearch-head)?
Avoid RAID5/6, prefer RAID 0/1/1+0, enlarge read ahead settings with RAID 1+0, match stripe size of file system creation with controller settings of read ahead blocks, add mount options (for XFS nobarrier,noatime,nodiratime), tune kernel I/O scheduler/elevator for high IOPS (maybe queue or even noop alows more throughput than deadline)
Most important: run your benchmarks to be sure to find optimal settings.