Scroll in ElasticSearch Aggregation

Vinothkumar_Ganeshan · November 28, 2019, 4:12pm

I have a index with around 50Million data points, where I have a ID for each document.

I need unique id and its count , as it would be more than 10000, I used scrolling but unexpecteded, the scrolling gives me the same scrollid in the iteration

> data = es.search(index="ttd-conversions-2019-05*", scroll='1m', body= bodys) 
> sid = data['_scroll_id']
> scroll_size = len(data['hits']['hits'])
> count = list()
> tdid = list()
> while(scroll_size > 0):
>     log_time = list()
>     tdid = list()
>     print('Scroll id', sid)
>     page = es.scroll(scroll_id= sid, scroll = '1m')
>     sid = page['_scroll_id']
> 
>     count = list()
>     tdid = list()
>     for i in data['aggregations']['2']['buckets']:
>         count.append(i['doc_count'])
>         tdid.append(i['key'])
> 
>     scroll_size = len(page['hits']['hits'])
>     with open(save_path + "/out.csv", "a", newline="") as f:
>         writer = csv.writer(f)
>         writer.writerows(zip(count, tdid))

Please let me know, this same code works fine for search with scroll, but aggregation repeats the same scroll id.

Thanks for your help.

Mark_Harwood · November 28, 2019, 4:49pm

The scroll api is used to scroll through documents, not aggregations.

To look at options for large numbers of unique terms try run this wizard to pick the right approach

Vinothkumar_Ganeshan · November 28, 2019, 4:51pm

Hi, Thanks for your respone. I understand that scroll id is to scroll through the documents, so i'm expecting that in scroll_id would change for the next iterations, but it doesn't and always gives the first 10 results. I'm confused in it.

Mark_Harwood · November 28, 2019, 5:13pm

Aggregations summarize the entire result set, not the current page of documents.
If you want to page through aggregation results you need to see my previous advice.

Vinothkumar_Ganeshan · November 29, 2019, 10:23am

Thank you, Based on Wizard, which suggested the Composite Aggregation

GET ttd-conversions-2019-05*/_search?scroll=1m
{
  "aggs" : {
      "my_buckets": {
            "composite" : {
                "sources" : [
                    { "TDID": { "terms" : { "field": "TDID"}                       
                    }}]}}}
}

Well, it returns the result with TDID and Count only for the first scroll, I request based on scroll_id on next. It doesn't return the TDID with Count but just lists the same documents. I'm bit confused. Sorry if I misunderstood something. Thanks for your support, I means a lot

Mark_Harwood · November 29, 2019, 10:25am

Don't use scroll.
Use a regular search using the composite agg and then another search using a composite agg with the after parameter returned in the previous result. Repeat as necessary

Vinothkumar_Ganeshan · November 29, 2019, 3:53pm

Thanks for your time and effort. I would be helpful, if you could share your insights for the following problem Count in other index based on Current Index field

system · December 27, 2019, 3:53pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Response 400 while using Scroll API with Python - scroll_id too long Elasticsearch	1	1449	August 14, 2019
Scroll id is not changing while querying Elasticsearch	2	4917	December 8, 2017
Do unique/reusable _scroll_ids exist? Elasticsearch	4	1511	July 6, 2017
Retrieving millions of large documents Elasticsearch	7	381	September 25, 2023
Same scroll id Elasticsearch	3	1787	July 5, 2017

Scroll in ElasticSearch Aggregation

Related topics