Good day
I have an Elasticsearch index with 50 million records in it. This is working as expected.
If I add aggregations to my query it is taking quite a long time to get the results. Any advise ?
Thanks in advance.
Kind regards.
Good day
I have an Elasticsearch index with 50 million records in it. This is working as expected.
If I add aggregations to my query it is taking quite a long time to get the results. Any advise ?
Thanks in advance.
Kind regards.
Greetings,
try lower the date range so you don't need to aggregate that much records,
Hi David,
Thanks.
It is slow
with aggregations
No of records in the result set: 7982
time taken : 35 seconds
without aggregations
same data is returned in 10 seconds
it is not only for the first request.
I am using 7.13.3 version.
Hardware is top notch (if I don't use aggregation it is quite quick)
my request is poco built on below parameters:
productform:"",
keyword: "",
contributor: "",
cop: "",
language: "",
publicationStatus: "",
imprint: "",
publisher: "",
wholesalers: "",
salesRights: "",
subject: "",
identifier: "",
populateAggregations: true,
pubDateFrom: "2010-09-10",
targetAudienceCode: "",
pubDateTo: "2022-09-10",
isAutoCompleteSearch: false
10 seconds is super slow.
Please share the request sent to Elasticsearch.
You might be running into this bug Slow StringTermsAggregatorFromFilters · Issue #76104 · elastic/elasticsearch (github.com) in 7.13.x, which was fixed in 7.14.0.
You can see if this is the case, by using the issues provided workaround, setting the following cluster setting:
"search.aggs.rewrite_to_filter_by_filter": false
I ran into a fairly similar issue not too long ago with a somewhat simple agg, and this turned out to be the issue.
But as mentioned previously, being able to see the query and agg you're actually running would be helpful here.
Hi David,
Below are the data i am sending it to the ES from my GQL.
do you want some thing else?
{gqlbooks(title:"", isbn:"", productform:"", keyword:"", contributor:"", cop:"", language:"", publicationStatus:"", imprint:"", publisher:"", wholesalers:"", salesRights:"", subject:"", identifier:"0123b1e0-b723-470b-9143-8a2a74edcfb2", populateAggregations:true, pubDateFrom:"1996-10-14", targetAudienceCode:"", pubDateTo:"2022-10-14",isAutoCompleteSearch:false) { resultCount publicationdate publisher author isbn13 title productform languagetext audiences author noofpages publicationstatus productclassifiers productclassifiercodes subtitle wholesalers markets publicationstatus cop bookName imprint imageurl rpgList bucketDTO distributors}}
I'd like to see the HTTP Request which is sent to Elasticsearch.
I can not guess from that how this is then translated to the queryDSL.
Hi David,
appologies for delay, please find http request below:
GET /idx-myelasticindex/_search
{ "size": 0,
"query": {
"match": {
"bookName": "Prooi"
}
},
"aggs": {
"Terms_Aggregation" : {
"terms": {
"field":
"cop.keyword"
}
},
"Author_Aggregation" : {
"terms": {
"field":
"author.keyword"}
},
"Format_Aggregation" : {
"terms": {
"field":
"productform.keyword"}
},
"Status_Aggregation" : {
"terms": {
"field":
"publicationstatus.keyword"}
},
"Readership_Aggregation" : {
"terms": {
"field":
"audiences.keyword"}
}
}
}
And what is the full response from Elasticsearch?
Please share it in both cases. One with the aggs and one without any aggregation, but same query.
Hi David
The response object is quite a huge object, it is over the allowed size here, I have uploaded at below location, can you please try get it from below, first one is with and other one without aggregations:
https://www.kwiksnoop.com/documents/elastic.json
thanks in advance.
So. A first look at this gives me:
With aggregations:
{
"took" : 15311,
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
}
...
}
Without aggregations:
"took" : 21,
First of all, we can see that without aggs, the time is only 21ms. Not 10s.
Then, the time spent on the aggregation is 15s for only 6 documents. Which does not make sense at all.
Could you run the same agg again and again, and give the output (only the first lines until hits
is enough) after some runs?
It it still slow?
Hardware is top notch
What kind of hardware do you have?
Hi David,
Thank you
First of all, we can see that without aggs, the time is only 21ms. Not 10s.
Then, the time spent on the aggregation is 15s for only 6 documents. Which does not make sense at all.
I am not too worried about response time for without aggregations 10s was for almost 8000 record, for aggregations response time huge for large data sets.
Could you run the same agg again and again, and give the output (only the first lines until
hits
is enough) after some runs?
It it still slow?
there is very little improvement, if I run 3 times the response time is 13.5 seconds.
"took" : 13597,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 6,
"relation" : "eq"
},
Below is the server configuration:
RAM : 64 GB
64 Bit, OS
16 Core Processor
2.80 GHz
Windows 10
Could you add "profile": true
when you run with the aggs?
What is the output of:
GET /
GET /_cat/nodes?v
GET /_cat/health?v
GET /_cat/indices?v
If some outputs are too big, please share them on gist.github.com and link them here.
Please don't post images of text as they are hard to read, may not display correctly for everyone, and are not searchable.
Instead, paste the text and format it with </>
icon or pairs of triple backticks (```), and check the preview window to make sure it's properly formatted before posting it. This makes it more likely that your question will receive a useful answer.
It would be great if you could update your post to solve this.
Also add the first query I asked for:
GET /
Thanks
Hi David
Please see below now, may be copy and paste the stats in any text editor to read properly.
I will send the GET / in next communication
result for : GET /_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1 44 85 65 cdfhilmrstw * <<server name>>
GET /_cat/health?v
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1635152186 08:56:26 elasticsearch yellow 1 1 18 18 0 0 2 0 - 90.0%
GET /_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open .kibana_7.13.3_001 zTV0ZBDYR7OHohvEaxo_ng 1 0 345 39 4.9mb 4.9mb
green open .apm-agent-configuration kQqZGYSQHaQN16hLlHQsQ 1 0 0 0 208b 208b
green open .tasks 2eEdiJmdTgmOEtKudXhHAQ 1 0 136 0 81kb 81kb
green open .kibana_task_manager_7.13.3_001 4UjZ7VU7RQ2qP9YeFXHKGw 1 0 11 1005 4.2mb 4.2mb
green open .security-7 TBOvtIRWRCipq97T1Oew3Q 1 0 55 0 268.4kb 268.4kb
green open .apm-custom-link 6IkqyS-sTICtQt44VtI1Iw 1 0 0 0 208b 208b
green open .kibana-event-log-7.13.3-000001 MoXe_5IARTqYnLcanYYgrg 1 0 85 0 31kb 31kb
green open .kibana-event-log-7.13.3-000003 JJD5JlrPQfiMnqP_s9Sf2Q 1 0 5 0 27.1kb 27.1kb
green open .kibana-event-log-7.13.3-000002 N5M6KzlHQkOQO1zJRJRd5g 1 0 25 0 36.6kb 36.6kb
green open .kibana-event-log-7.13.3-000004 jeYUrCRNQLW4_wgzk5npkg 1 0 13 0 18.6kb 18.6kb
yellow open idx-autherindex FK7YBB16SIOzMz_TxrvP_w 1 1 3 0 70.6kb 70.6kb
green open .async-search Ev7YbBA1SFWlwCAvM1Ij9w 1 0 0 0 6.7kb 6.7kb
yellow open idx-discoveryproduct Gw5EN2OSSkCw0kEhiTeKMg 1 1 55990270 24907862 96.2gb 96.2gb
Hi David,
Please find below as requested.
{
"name" : "my-server-name",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "uTFhdzzdSDuSo7RB9NUg1g",
"version" : {
"number" : "7.13.3",
"build_flavor" : "default",
"build_type" : "zip",
"build_hash" : "5d21bea28db1e89ecc1f66311ebdec9dc3aa7d64",
"build_date" : "2021-07-02T12:06:10.804015202Z",
"build_snapshot" : false,
"lucene_version" : "8.8.2",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
I'm guessing that you are searching in idx-discoveryproduct
, right?
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open idx-discoveryproduct 1 1 55990270 24907862 96.2gb 96.2gb
So you have one single shard of 96gb. That's too much IMO.
Should have at least 2 shards, and may be more.
I'd try to split this index into 5 shards (20gb each more or less) to see if this is getting better.
You can try the Split API.
Also I can see that you have a lot of deletes. Are you doing a lot of updates?
Yes, I am idx-discoveryproduct
is the index i am working on. And we have lots of updates.
It is already in product index, what is the impact if I split it now into multiple shards?
Thanks
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.