I have a specific requirement where I need to index a large amount of data in Elasticsearch. Every document has 3 fields. My requirement is that I need to index up to 20 Million such documents in 5 mins. I am hitting 10 million with the following settings:
- Bulk processor with default settings:
- Assigned 4GB heap to Elasticsearch via VMOptions: which assigns around 400Mb to buffer index
- Replica shards disabled
- refresh interval disabled
- Single thread which is populating the requests in the Bulk processor
- Default merging
- Swapping enabled
- Single node containing 5 shards running on a single server.
- Server has 8 cores and 32 GB RAM
I observed that running two nodes on same server and feeding data to 2 nodes concurrently using 2 clients degraded the indexing speed.
What else can be done to hit a 20 million mark or to improve indexing further.