I'm experiencing problems when using the bulk feature of Elastic but I'm still very novice in ES so I don't know how to even debug it properly.
The problem:
I'm trying to index 2 million documents using the bulk feature with each chunk being 1000 documents. Everything is going good in the first 100k documents, 100k - 150k is kinda slow, 150k - 200k is very slow and 200k+ document is extremely slow. To give you a perspective on 200k+ ES is indexing 1000 documents every 5 mins.
The system:
elasticsearch: 7.4.2,
Windows: 10,
Ram: 16 GB,
Processor: intel i-7 8550U
I haven't touched any of the es configuration files.
What I tried:
Everything was based on something i read from the web and nothing really helped. When i create the index I tried using these settings:
I suggest not setting refresh_interval at all (the default behaviour is to detect that you're doing a lot of bulk indexing and to stop refreshing at all after the first 30 seconds, but you're overriding this which will result in unnecessary refreshes) and set number_of_shards: 1 too. number_of_replicas: 0 is correct if this is a single node cluster.
Use the nodes hot threads API to determine exactly what the node is spending all its time doing. If you need help interpreting the output, please share it here: GET _nodes/hot_threads?threads=9999.
Thank you so much for answering. After seeing that the health of the nodes is good thanks to you I found the problem actually being the way the data is being pulled from the MySql database. After fixing that ES is doing around 10mil indexing in less then an hour which I'm very happy about ( though don't know if this is slow according to standards ). One more thing I wanted to ask. Would you like to briefly explain why number_of_shard: 1 is correct for a single node cluster or maybe point me to some literature where I can read about this. Thank you very much once again.
That was a suggestion, you might see some benefits of more shards depending on exactly how you're using your cluster. Too many shards is a more common problem than too few shards, so I suggest benchmarking the effects before adding shards.
However, number_of_replicas: 0 is certainly correct for a single-node cluster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.