Next Steps to Improve Performance on a Large (500gb+) Index

mikewillekes · April 7, 2016, 9:02pm

Hi all,

I've hit a performance wall scaling up operations on a large index and would appreciate any opinions and/or insight this community may offer.

Summary

Our all_the_data index is 500gb (primaries) with 128m documents. It represents 2.5 years worth of data we need to search and aggregate. Data grows at ~200k documents per day. Simple match queries are very slow (benchmarked with 10k sample queries via JMeter):

5s average
21s worst-case

We have numerous 1gb - 20gb indices on the same cluster and they all perform well-enough. I created a smaller sample_data index with 2m documents (~11gb primaries), representing 1 month of data. The same match queries have acceptable average performance (though not great worst-case):

0.216s average
7s worst-case

Have we simply overtaxed the cluster with this massive index? What would be the next logical step?

Scale-out / parallelization: more hardware, try to fit more into memory on more commodity machines
Scale-up: Replace spinny disks with SSDs
Other options?

Thanks in advance. Details about cluster, index, benchmarking follows

##Hardware / Cluster

Elasticsearch 2.1.1; Running in Docker containers on Debian hosts on a local Openstack cloud
- 8 Data Nodes: 64GB RAM / 8 VCPU / 1TB (HD; Not SSD)
- 2 Search/Client Nodes: 8GB RAM / 4 VCPU / 10GB
- 3 Master Nodes: 8GB RAM / 4 VCPU / 10GB

##Big Index

approx. 2.5 years of data
128.8m documents (73.9m text documents / 54.9m metadata docs)

_cat/indices

health status index          pri rep docs.count    docs.deleted   store.size     pri.store.size 
green  open   all_the_data   8   1   202736781     28374432       992.4gb        496.1gb

_cat/shards (primaries)

index        shard prirep state       docs  store ip             node       
all_the_data 6     p      STARTED 25342031 61.5gb 10.xxx.xxx.xxx dev-data-5 
all_the_data 2     p      STARTED 25330942 61.7gb 10.xxx.xxx.xxx dev-data-2 
all_the_data 4     p      STARTED 25345800 63.7gb 10.xxx.xxx.xxx dev-data-3 
all_the_data 3     p      STARTED 25344074 61.1gb 10.xxx.xxx.xxx dev-data-5 
all_the_data 1     p      STARTED 25336429 62.3gb 10.xxx.xxx.xxx dev-data-1 
all_the_data 7     p      STARTED 25344437 60.9gb 10.xxx.xxx.xxx dev-data-8 
all_the_data 5     p      STARTED 25347993 61.9gb 10.xxx.xxx.xxx dev-data-3 
all_the_data 0     p      STARTED 25345075 62.6gb 10.xxx.xxx.xxx dev-data-4

JMeter Suite

Query                                                        samples   avg      min    max       throughput
Match Query, Multiple Fields (Bool), Highlight (Not FVH)     10000     5647     21     16456     5.9/sec
Match Query, Multiple Fields (Bool), No Highlight            10000     5653     19     23918     5.9/sec
Match Query, Single Field, No Highlight                      10000     5231     15     21403     5.9/sec

##Small Index

1 Month of Data
2.02m documents (1.76m text docs / 262k metadata docs)

_cat/indices

health status index         pri rep  docs.count  docs.deleted  store.size   pri.store.size 
green  open   sample_data   8   1    3783481     45397         22.8gb       11.2gb

_cat/shards (primaries)

index       shard prirep state     docs store ip             node       
sample_data 1     p      STARTED 473064 1.3gb 10.xxx.xxx.xxx dev-data-1 
sample_data 6     p      STARTED 472793 1.4gb 10.xxx.xxx.xxx dev-data-4 
sample_data 2     p      STARTED 473384 1.3gb 10.xxx.xxx.xxx dev-data-7 
sample_data 3     p      STARTED 473627 1.3gb 10.xxx.xxx.xxx dev-data-6 
sample_data 7     p      STARTED 473234 1.5gb 10.xxx.xxx.xxx dev-data-3 
sample_data 4     p      STARTED 473058 1.3gb 10.xxx.xxx.xxx dev-data-2 
sample_data 5     p      STARTED 473132 1.4gb 10.xxx.xxx.xxx dev-data-5 
sample_data 0     p      STARTED 471189 1.3gb 10.xxx.xxx.xxx dev-data-8

JMeter Suite

Query                                                        samples   avg      min    max       throughput
Match Query, Multiple Fields (Bool), Highlight (Not FVH)     10000     225      4      13157     111.5/sec
Match Query, Multiple Fields (Bool), No Highlight            10000     223      4      6989      111.5/sec
Match Query, Single Field, No Highlight                      10000     200      3      7137      111.5/sec

##iostat
While running the benchmarks, iostat on one of the hosts shows low %iowait (1.07%) and reasonable (for a spinny disk) tps (92.3)

warkolm · April 7, 2016, 11:40pm

You should really be splitting this into time based indices. As your tests show, you will get better performance.

Topic		Replies	Views
Index speed degradation Elasticsearch	7	477	July 6, 2017
Cluster optimization(indexing/query performace) Elasticsearch	4	349	July 6, 2017
How can I tune for Elasticsearch performance? Elasticsearch	9	624	May 18, 2020
Performance degradation with index size ~2Gb, ~6M docs Elasticsearch	2	560	July 6, 2017
Fastest way to index huge data in elastic Elasticsearch	9	9607	November 11, 2018

Next Steps to Improve Performance on a Large (500gb+) Index

Summary

Related topics