I'm currently at research stage so my testing configuration is not suitable for big data indexing, however i'm trying to find a way to extrapolate the measures using a confortable amount of data and see what happens on indexing, how the cluster behaves etc...
Having tested a solution that i built myself using erlang to concurrently push data into an elasticsearch cluster, i start getting timeouts from the cluster sending bulks of even less than 500 documents per parallel task, using a document of less than 10 fields.
Test cluster is 3 nodes with the following config:
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
CPU MHz: 2400.046
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0
Index configuration is: 2 shards, 2 replicas
I would like to pick someones experience in the matter that helps me figure out whats the best configuration, for the cluster, number of nodes (horizontal scaling only), etc...
Things to have in account:
Data will be stored in the cluster only for a few days, so the important is the indexing performance and optimization, after that indexes will be destroyed and maybe start a new indexing job.
Need an extra load balancer?
Has someone used parallel or concurrent indexing and how many worker it runs at the same time without messing up the cluster?
Is it better to have multiple clusters and and/or use an extra load balancer, or a tribe node?