Hi all,
I've hit a performance wall scaling up operations on a large index and would appreciate any opinions and/or insight this community may offer.
Summary
Our all_the_data
index is 500gb (primaries) with 128m documents. It represents 2.5 years worth of data we need to search and aggregate. Data grows at ~200k documents per day. Simple match
queries are very slow (benchmarked with 10k sample queries via JMeter):
- 5s average
- 21s worst-case
We have numerous 1gb - 20gb indices on the same cluster and they all perform well-enough. I created a smaller sample_data
index with 2m documents (~11gb primaries), representing 1 month of data. The same match
queries have acceptable average performance (though not great worst-case):
- 0.216s average
- 7s worst-case
Have we simply overtaxed the cluster with this massive index? What would be the next logical step?
- Scale-out / parallelization: more hardware, try to fit more into memory on more commodity machines
- Scale-up: Replace spinny disks with SSDs
- Other options?
Thanks in advance. Details about cluster, index, benchmarking follows
##Hardware / Cluster
- Elasticsearch 2.1.1; Running in Docker containers on Debian hosts on a local Openstack cloud
- 8 Data Nodes: 64GB RAM / 8 VCPU / 1TB (HD; Not SSD)
- 2 Search/Client Nodes: 8GB RAM / 4 VCPU / 10GB
- 3 Master Nodes: 8GB RAM / 4 VCPU / 10GB
##Big Index
- approx. 2.5 years of data
- 128.8m documents (73.9m text documents / 54.9m metadata docs)
_cat/indices
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open all_the_data 8 1 202736781 28374432 992.4gb 496.1gb
_cat/shards (primaries)
index shard prirep state docs store ip node
all_the_data 6 p STARTED 25342031 61.5gb 10.xxx.xxx.xxx dev-data-5
all_the_data 2 p STARTED 25330942 61.7gb 10.xxx.xxx.xxx dev-data-2
all_the_data 4 p STARTED 25345800 63.7gb 10.xxx.xxx.xxx dev-data-3
all_the_data 3 p STARTED 25344074 61.1gb 10.xxx.xxx.xxx dev-data-5
all_the_data 1 p STARTED 25336429 62.3gb 10.xxx.xxx.xxx dev-data-1
all_the_data 7 p STARTED 25344437 60.9gb 10.xxx.xxx.xxx dev-data-8
all_the_data 5 p STARTED 25347993 61.9gb 10.xxx.xxx.xxx dev-data-3
all_the_data 0 p STARTED 25345075 62.6gb 10.xxx.xxx.xxx dev-data-4
JMeter Suite
Query samples avg min max throughput
Match Query, Multiple Fields (Bool), Highlight (Not FVH) 10000 5647 21 16456 5.9/sec
Match Query, Multiple Fields (Bool), No Highlight 10000 5653 19 23918 5.9/sec
Match Query, Single Field, No Highlight 10000 5231 15 21403 5.9/sec
##Small Index
- 1 Month of Data
- 2.02m documents (1.76m text docs / 262k metadata docs)
_cat/indices
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open sample_data 8 1 3783481 45397 22.8gb 11.2gb
_cat/shards (primaries)
index shard prirep state docs store ip node
sample_data 1 p STARTED 473064 1.3gb 10.xxx.xxx.xxx dev-data-1
sample_data 6 p STARTED 472793 1.4gb 10.xxx.xxx.xxx dev-data-4
sample_data 2 p STARTED 473384 1.3gb 10.xxx.xxx.xxx dev-data-7
sample_data 3 p STARTED 473627 1.3gb 10.xxx.xxx.xxx dev-data-6
sample_data 7 p STARTED 473234 1.5gb 10.xxx.xxx.xxx dev-data-3
sample_data 4 p STARTED 473058 1.3gb 10.xxx.xxx.xxx dev-data-2
sample_data 5 p STARTED 473132 1.4gb 10.xxx.xxx.xxx dev-data-5
sample_data 0 p STARTED 471189 1.3gb 10.xxx.xxx.xxx dev-data-8
JMeter Suite
Query samples avg min max throughput
Match Query, Multiple Fields (Bool), Highlight (Not FVH) 10000 225 4 13157 111.5/sec
Match Query, Multiple Fields (Bool), No Highlight 10000 223 4 6989 111.5/sec
Match Query, Single Field, No Highlight 10000 200 3 7137 111.5/sec
##iostat
While running the benchmarks, iostat
on one of the hosts shows low %iowait
(1.07%) and reasonable (for a spinny disk) tps
(92.3)