Im fairly new to ElasticSearch and we're looking to move our analytics from SQL to ElasticSearch.
I'm using AWS ElasticSearchService and here are some of the stats from the test data I indexed:
Index: Time based(YYYY-MM-DD), like logstash, but data is not static
~500 Million documents
Largest Index: ~32GB (We predict our production data will be more or less close to this size for all indices)
Number of Documents: ~10 million in the above index
Cluster: 3 dedicated master nodes, 7 X r4.8xlarge data nodes
Shards: 5 per index
Replicas: 2 per index
doc_values are enabled on almost 90% of the fields
The data is basically financial data with a lot of numbers and some text fields (think invoices). Due to confidentiality reasons I cannot divulge the mappings/structures, but our use-case mainly boils down to a lot of nested aggregations and then sum/avg etc in the end on a particular date range (so basically over multiple indices).
However, any nested aggregations more than 3-4 levels (most of our requirements require an avg of 4 levels of nesting) takes the query around 30+ seconds to execute.
I was mainly testing on a single index(the largest), and the queries are mostly search with the nested aggregations.
Initially I was not using any filters, but on applying multiple terms filters on the terms aggregation fields shows some performance (down to 850ms) improvement, but I would ideally not want to use these filters.
My questions are:
- Is my bottleneck the number of shards per index ? I had some success with more replicas, but the responses are still in the 30s range
- Would it be better to filter and then do multiple nested aggregations ?
Thanks in advance and apologies for the vagueness.