Loading 500 GB of data / 3500000000 documents to ES cluster

pastechecker · June 29, 2018, 10:32am

Greetings!
I have a small cluster with following setup:

Hardware:
2x 64 GB RAM, 16 cores CPU, 4x2TB HDD configured in RAID10.
There is 1 Gbps connection between the nodes.

Configuration:
elasticsearch.yml node1.test:
cluster.name: mycluster
node.name: node1.test
node.master: true
node.data: true
#discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: [node2.test]
discovery.zen.minimum_master_nodes: 2
path.data: /elasticsearch-data/
path.logs: /var/log/elasticsearch
index.store.type: niofs
network.host: 192.168.1.2
http.port: 9200
xpack.security.enabled: false

elasticsearch.yml node2.test:
cluster.name: mycluster
node.name: node1.test
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: [node1.test]
discovery.zen.minimum_master_nodes: 2
path.data: /elasticsearch-data/
path.logs: /var/log/elasticsearch
index.store.type: niofs
network.host: 192.168.1.3
http.port: 9200
xpack.security.enabled: false

I am limiting my JVM heap to have only 32 GB per node (as ES recommends to leave 50% of the physical RAM).

I have 500 GB of json files with total number of new lines 3,500,000,000 (new line delimited json files, 30,000,000 lines, 2.5GB-4GB in size each).

First time I have used the tool to load my data:
esbulk_0.5.0_amd64.deb with parameters:

esbulk -size 1000 -w 16 -verbose -server 192.168.1.2:9200 -u -index bigindex myjsonfiles.json
it was loaded into 5 default shards without replicas.
Tool applies: applied setting: {"index": {"refresh_interval": "-1"}} with status 200 OK

The load of that data was over night lasted approx 12h, but out 3,500,000,000 I have managed to load only 2,000,000,000.

I have noticed during the load following error occurring all the time:
2018/06/29 12:15:25 error during bulk operation, check error details, try less workers (lower -w value) or increase thread_pool.bulk.queue_size in your nodes

I have tried to limit the processing in the tool to two workers with size 50 and the same error was happening (not to mention that the data load will take forever).

I have tried to split the files into smaller ones (100000 lines, approx 8MB each), and still the same problem was experienced.

What would you recommend to load that 500 GB worth data in a reasonable speed and time?
Should I tweak the default number of shards and (I have read about one recommendation that I should not go above 20-25 shards per 1GB of JVM memory HEAP)?

How I can approach the problem of loading such data in a different way?

Thank you for your time in advance!

Christian_Dahlqvist · June 29, 2018, 10:44am

Have you followed these guidelines and set index.merge.scheduler.max_thread_count to an appropriate value? why are you setting index.store.type: niofs?

pastechecker · June 29, 2018, 11:28am

I have configured: "max_thread_count": "1"

"settings": {
"index": {
"refresh_interval": "-1",
"number_of_shards": "5",
"provided_name": "bigindex",
"merge": {
"scheduler": {
"max_thread_count": "1"
}
},
"creation_date": "1530270568301",
"number_of_replicas": "1",
"uuid": "QLplGm7bQmyc3Y1N-JVR-A",
"version": {
"created": "6020499"
}
}
}
}

niofs was set because I thought that it is good idea and will increase performance after reading: https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8

Christian_Dahlqvist · June 29, 2018, 11:54am

Do you have monitoring installed? What does CPU usage, disk I/O and iowait look like while you are indexing?

pastechecker · June 29, 2018, 12:02pm

At this stage I was just monitoring with htop, kibana monitoring and iostat
Kibana shows indexing speeds approx. 50k/s - 60k/s

Result from iostat during loading:

avg-cpu: %user %nice %system %iowait %steal %idle
8.77 0.00 0.56 8.96 0.00 81.70

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 94.00 0.00 121.00 0.00 3260.00 53.88 0.99 7.80 0.00 7.80 5.39 65.20
sdb 0.00 94.00 0.00 121.00 0.00 3260.00 53.88 0.90 7.11 0.00 7.11 5.06 61.20
md0 0.00 0.00 0.00 369.00 0.00 6528.00 35.38 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 110.00 0.00 106.00 0.00 3192.00 60.23 0.61 5.40 0.00 5.40 4.42 46.80
sdd 0.00 110.00 0.00 106.00 0.00 3192.00 60.23 0.56 4.94 0.00 4.94 3.85 40.80
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 4.00 0.00 16.00 8.00 0.11 28.00 0.00 28.00 24.00 9.60
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 349.00 0.00 6512.00 37.32 3.50 9.78 0.00 9.78 2.85 99.60

Christian_Dahlqvist · June 29, 2018, 12:05pm

If you look at the last line there it looks like you may be limited by the speed of your storage as it looks like it shows a 99.6% utilisation. To improve performance you may therefore need to invest in faster storage to scale out the cluster. You could also try using RAID0 instead of RAID10.

pastechecker · June 29, 2018, 12:06pm

Right, good catch.
I will strip the raid10 and go with raid0 on all drives.

pastechecker · July 24, 2018, 11:32am

I am using RAID0 on 4 spinning drives. Performance is noticeable with stability.

Side note: Before I was using esbulk, right now I am using esload for placing data into ES cluster. Both of them are in the github.

zqc0512 · July 25, 2018, 2:30am

iowait is too high. try with asnyc

pastechecker · July 25, 2018, 7:28am

With the tool esload I was able to load in 12h approximately 500GB/2.2b of documents (New Line Json Delimited).

INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.14MB) in 1.10s (5.60MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.09MB) in 0.94s (6.47MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.20MB) in 0.96s (6.47MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.13MB) in 0.99s (6.20MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.17MB) in 0.88s (7.02MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.16MB) in 2.23s (2.76MB/s)
INFO loader.py:index:105 uploaded 80000 documents (errors: n/a) (6.12MB) in 0.92s (6.67MB/s)

Start after work, over night it will do the job.

system · August 22, 2018, 7:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

pastechecker · August 30, 2018, 7:54am

Small update:

With the loader.py I was able to load approximately 700GB (3,000,000,000,000 documents, json new line delimited with max line len not exceeding 400 characters). Average indexing speed used was 10,000 doc per batch with 16 threads.

Hardware used:

2 ES nodes in a cluster connected by 1Gbps Ethernet card.
Node configuration:
64 GB RAM ECC, Thread Ripper x1900 16 cores CPU, 4x2TB HDD configured in RAID0 using Debian 9.3 OS.
There was 1 Gbps connection between the nodes.

I am not doing aggressive queries on the cluster, just using it for storing data.

At the moment I have:
Indices: 44
Memory: 40.5 GB / 67.8 GB
Total Shards: 215
Unassigned Shards: 2
Documents: 3,249,292,622
Data: 1.0 TB

Topic		Replies	Views
Looking for advice on bulk loading Elasticsearch	6	946	July 6, 2017
Questions --- Regarding to Size of Bulk Load., Capasity of a Shards and Performance Elasticsearch	3	394	July 6, 2017
Bulk load performance Elasticsearch	3	914	July 6, 2017
Adding millions of documents, performance decay Elasticsearch	6	675	July 6, 2017
Bulkload performance issue Elasticsearch	2	400	September 14, 2019

Loading 500 GB of data / 3500000000 documents to ES cluster

Related topics