Anyone with Petabyte indexing experience using parallel tasks?

Hi,

I'm currently at research stage so my testing configuration is not suitable for big data indexing, however i'm trying to find a way to extrapolate the measures using a confortable amount of data and see what happens on indexing, how the cluster behaves etc...
Having tested a solution that i built myself using erlang to concurrently push data into an elasticsearch cluster, i start getting timeouts from the cluster sending bulks of even less than 500 documents per parallel task, using a document of less than 10 fields.

Test cluster is 3 nodes with the following config:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
Stepping: 2
CPU MHz: 2400.046
BogoMIPS: 4800.09
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0

Index configuration is: 2 shards, 2 replicas

I would like to pick someones experience in the matter that helps me figure out whats the best configuration, for the cluster, number of nodes (horizontal scaling only), etc...

Things to have in account:
Data will be stored in the cluster only for a few days, so the important is the indexing performance and optimization, after that indexes will be destroyed and maybe start a new indexing job.

Questions:
Need an extra load balancer?
Has someone used parallel or concurrent indexing and how many worker it runs at the same time without messing up the cluster?
Is it better to have multiple clusters and and/or use an extra load balancer, or a tribe node?

Thanks.

What do you want to achieve here exactly? Cause you are no where near PB scale, and it's a big jump to there from 3 nodes.

Are you using bulk? If so have you What version of ES? What OS? What JVM? Why 2 shards + 1 replica? What sort of document is it?

Centos 7, 2 shards, 2 replicas, its a document with up to 10 fields basically a parsed log, 2 fields are analyzed.

What i'm doing is testing with a few nodes doing bulk inserts using about 6 threads at a time, bulks inserts are about 500 documents which is not big, i have tried 10000 documents using a single thread and it works perfectly, but elasticsearch cluster starts responding timeouts, after a few bulks when multithreaded.

I need to know if someone has had experience doing bulks in multiple threads, and whats the ideal bulk size when doing so,what issues have come from that or what cluster size they ended up using.

Could your try setting the index refresh interval to -1 before you start bulk upload and once you are done with bulk upload set it back to the value which is best suited for your scenario. Here is the link for index_refresh_interval

Also if feasable turn off replica befor bulk upload and after finishing turn it on.

That's probably too small!

The refresh interval setting is interesting, thanks.

Now i need to do some calculations based on real data that i currently have, starting with 4TB a day, i need to reach a decent indexing rate/s using r4.large instances in AWS, https://aws.amazon.com/es/ec2/instance-types/. That said, how many nodes would you recommend to have in the cluster?, and is it wise to define a master node and the rest data nodes?, please have in account that these instances have an elastic storage ssd, so no limits for disk size.

Thanks for your help, i appreciate this effort.

So far i've tested with the following configuration, succesfully:

index.refresh_interval: 60s
index.store.type: mmapfs
indices.memory.index_buffer_size: 30%
index.translog.flush_threshold_ops: 50000

Set index to: 2 shards, 0 replicas.

Bulk size = 3000,
Thread Pool = 10
Num nodes = 3
Cores 2, Ram 16

Results = 30mb/s

If you are aiming for large scale, you should really test using larger instances. We generally recommend assigning 50% of memory to heap and keeping this below 32GB in size, which makes r4.2xlarge instances a good fit. You generally want to get to this size before starting to scale out beyond the first 3 nodes. Also install X-Pack monitoring so you can see what the cluster is up to and what is limiting performance.

Thanks for your comment, i will try scaling the instances, to see what comes from it.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.