Greetings!
I have a small cluster with following setup:
Hardware:
2x 64 GB RAM, 16 cores CPU, 4x2TB HDD configured in RAID10.
There is 1 Gbps connection between the nodes.
Configuration:
elasticsearch.yml node1.test:
cluster.name: mycluster
node.name: node1.test
node.master: true
node.data: true
#discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [node2.test]
discovery.zen.minimum_master_nodes: 2
path.data: /elasticsearch-data/
path.logs: /var/log/elasticsearch
index.store.type: niofs
network.host: 192.168.1.2
http.port: 9200
xpack.security.enabled: false
elasticsearch.yml node2.test:
cluster.name: mycluster
node.name: node1.test
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: [node1.test]
discovery.zen.minimum_master_nodes: 2
path.data: /elasticsearch-data/
path.logs: /var/log/elasticsearch
index.store.type: niofs
network.host: 192.168.1.3
http.port: 9200
xpack.security.enabled: false
I am limiting my JVM heap to have only 32 GB per node (as ES recommends to leave 50% of the physical RAM).
I have 500 GB of json files with total number of new lines 3,500,000,000 (new line delimited json files, 30,000,000 lines, 2.5GB-4GB in size each).
First time I have used the tool to load my data:
esbulk_0.5.0_amd64.deb with parameters:
esbulk -size 1000 -w 16 -verbose -server 192.168.1.2:9200 -u -index bigindex myjsonfiles.json
it was loaded into 5 default shards without replicas.
Tool applies: applied setting: {"index": {"refresh_interval": "-1"}} with status 200 OK
The load of that data was over night lasted approx 12h, but out 3,500,000,000 I have managed to load only 2,000,000,000.
I have noticed during the load following error occurring all the time:
2018/06/29 12:15:25 error during bulk operation, check error details, try less workers (lower -w value) or increase thread_pool.bulk.queue_size in your nodes
I have tried to limit the processing in the tool to two workers with size 50 and the same error was happening (not to mention that the data load will take forever).
I have tried to split the files into smaller ones (100000 lines, approx 8MB each), and still the same problem was experienced.
What would you recommend to load that 500 GB worth data in a reasonable speed and time?
Should I tweak the default number of shards and (I have read about one recommendation that I should not go above 20-25 shards per 1GB of JVM memory HEAP)?
How I can approach the problem of loading such data in a different way?
Thank you for your time in advance!