Loading 500 GB of data / 3500000000 documents to ES cluster

Greetings!
I have a small cluster with following setup:

Hardware:
2x 64 GB RAM, 16 cores CPU, 4x2TB HDD configured in RAID10.
There is 1 Gbps connection between the nodes.

Configuration:
elasticsearch.yml node1.test:
cluster.name: mycluster
node.name: node1.test
node.master: true
node.data: true
#discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [node2.test]
discovery.zen.minimum_master_nodes: 2
path.data: /elasticsearch-data/
path.logs: /var/log/elasticsearch
index.store.type: niofs
network.host: 192.168.1.2
http.port: 9200
xpack.security.enabled: false

elasticsearch.yml node2.test:
cluster.name: mycluster
node.name: node1.test
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: [node1.test]
discovery.zen.minimum_master_nodes: 2
path.data: /elasticsearch-data/
path.logs: /var/log/elasticsearch
index.store.type: niofs
network.host: 192.168.1.3
http.port: 9200
xpack.security.enabled: false

I am limiting my JVM heap to have only 32 GB per node (as ES recommends to leave 50% of the physical RAM).

I have 500 GB of json files with total number of new lines 3,500,000,000 (new line delimited json files, 30,000,000 lines, 2.5GB-4GB in size each).

First time I have used the tool to load my data:
esbulk_0.5.0_amd64.deb with parameters:

esbulk -size 1000 -w 16 -verbose -server 192.168.1.2:9200 -u -index bigindex myjsonfiles.json
it was loaded into 5 default shards without replicas.
Tool applies: applied setting: {"index": {"refresh_interval": "-1"}} with status 200 OK

The load of that data was over night lasted approx 12h, but out 3,500,000,000 I have managed to load only 2,000,000,000.

I have noticed during the load following error occurring all the time:
2018/06/29 12:15:25 error during bulk operation, check error details, try less workers (lower -w value) or increase thread_pool.bulk.queue_size in your nodes

I have tried to limit the processing in the tool to two workers with size 50 and the same error was happening (not to mention that the data load will take forever).

I have tried to split the files into smaller ones (100000 lines, approx 8MB each), and still the same problem was experienced.

What would you recommend to load that 500 GB worth data in a reasonable speed and time?
Should I tweak the default number of shards and (I have read about one recommendation that I should not go above 20-25 shards per 1GB of JVM memory HEAP)?

How I can approach the problem of loading such data in a different way?

Thank you for your time in advance!

Have you followed these guidelines and set index.merge.scheduler.max_thread_count to an appropriate value? why are you setting index.store.type: niofs?

I have configured: "max_thread_count": "1"

"settings": {
"index": {
"refresh_interval": "-1",
"number_of_shards": "5",
"provided_name": "bigindex",
"merge": {
"scheduler": {
"max_thread_count": "1"
}
},
"creation_date": "1530270568301",
"number_of_replicas": "1",
"uuid": "QLplGm7bQmyc3Y1N-JVR-A",
"version": {
"created": "6020499"
}
}
}
}

niofs was set because I thought that it is good idea and will increase performance after reading: https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8

Do you have monitoring installed? What does CPU usage, disk I/O and iowait look like while you are indexing?

At this stage I was just monitoring with htop, kibana monitoring and iostat
Kibana shows indexing speeds approx. 50k/s - 60k/s

Result from iostat during loading:

avg-cpu: %user %nice %system %iowait %steal %idle
8.77 0.00 0.56 8.96 0.00 81.70

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 94.00 0.00 121.00 0.00 3260.00 53.88 0.99 7.80 0.00 7.80 5.39 65.20
sdb 0.00 94.00 0.00 121.00 0.00 3260.00 53.88 0.90 7.11 0.00 7.11 5.06 61.20
md0 0.00 0.00 0.00 369.00 0.00 6528.00 35.38 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 110.00 0.00 106.00 0.00 3192.00 60.23 0.61 5.40 0.00 5.40 4.42 46.80
sdd 0.00 110.00 0.00 106.00 0.00 3192.00 60.23 0.56 4.94 0.00 4.94 3.85 40.80
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 4.00 0.00 16.00 8.00 0.11 28.00 0.00 28.00 24.00 9.60
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 349.00 0.00 6512.00 37.32 3.50 9.78 0.00 9.78 2.85 99.60

If you look at the last line there it looks like you may be limited by the speed of your storage as it looks like it shows a 99.6% utilisation. To improve performance you may therefore need to invest in faster storage to scale out the cluster. You could also try using RAID0 instead of RAID10.

Right, good catch.
I will strip the raid10 and go with raid0 on all drives.

I am using RAID0 on 4 spinning drives. Performance is noticeable with stability.

Side note: Before I was using esbulk, right now I am using esload for placing data into ES cluster. Both of them are in the github.

iowait is too high. try with asnyc

With the tool esload I was able to load in 12h approximately 500GB/2.2b of documents (New Line Json Delimited).

INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.14MB) in 1.10s (5.60MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.09MB) in 0.94s (6.47MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.20MB) in 0.96s (6.47MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.13MB) in 0.99s (6.20MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.17MB) in 0.88s (7.02MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.16MB) in 2.23s (2.76MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.12MB) in 0.92s (6.67MB/s)

Start after work, over night it will do the job.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Small update:

With the loader.py I was able to load approximately 700GB (3,000,000,000,000 documents, json new line delimited with max line len not exceeding 400 characters). Average indexing speed used was 10,000 doc per batch with 16 threads.

Hardware used:

2 ES nodes in a cluster connected by 1Gbps Ethernet card.
Node configuration:
64 GB RAM ECC, Thread Ripper x1900 16 cores CPU, 4x2TB HDD configured in RAID0 using Debian 9.3 OS.
There was 1 Gbps connection between the nodes.

I am not doing aggressive queries on the cluster, just using it for storing data.

At the moment I have:
Indices: 44
Memory: 40.5 GB / 67.8 GB
Total Shards: 215
Unassigned Shards: 2
Documents: 3,249,292,622
Data: 1.0 TB