How can I reindex the results of an aggregation into another index?

Vaishnavi_Nandakumar · July 16, 2021, 12:22pm

Hello,

I'm using the 7.7.0 version of Elasticsearch. I have an aggregation query that works on the ingested data to produce the output but I am not able to re-index it.

Data Context

The given dataset has around 500,000 records consisting of Incoming File ID, Outgoing File ID, combination of both ID’s and the time elapsed during the transaction of both ID’s over a year. There can be multiple instances where both the File ID’s combinations (IN_OUT_ID) are repeated on different days. The aggregation query I have groups these records based on IN_OUT_ID and takes the document with the smallest Outgoing ID.

I have used top hits aggregations to get the document source but when I tried reindexing with the query given below, all of the documents are being ingested into the destination index instead of just the filtered ones.

POST /_reindex
{
  "source": {
    "index": "src-new-data",
    "aggs": {
      "groupbyID": {
        "terms": {
          "field": "IN_OUT_ID",
          "order": {
            "lowest_score": "asc"
          }
        },
        "aggs": {
          "lowest_score": {
            "min": {
              "field": "OUT_FILE_ID"
            }
          },
          "lowest_score_top_hits": {
            "top_hits": {
              "size": 1,
              "sort": [
                {
                  "OUT_FILE_ID": {
                    "order": "asc"
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "dest": {
    "index": "dest-filter-data"
  }
}

I want to visualize a histogram of the elapsed time from the filtered documents. If no other approach can be used for reindexing, is there any other way I can create the visualization based on the data context given? I have tried using Vega but I am not able to access the elapsed time value from the top hits bucket aggregation.
Thank You

Hendrik_Muhs · July 16, 2021, 2:12pm

Have a look at transform.

However top hits is not supported in transform, but you can use a scripted_metric aggregation, similar to this example.

Vaishnavi_Nandakumar · July 17, 2021, 7:00pm

Hi Hendrik,
I checked out the link but I am not able to use _transform as I'm using an earlier version of Elasticsearch.
Is there any other approach I can use to either :
a. Reindex from the top hits aggregations
b. Use Vega to visualize top hits aggregations.

Hendrik_Muhs · July 19, 2021, 6:47am

Transform has been made GA(General Availability) in elasticsearch as of version 7.5, it's predecessor (beta program) has been available since 7.2.

Are you using the OSS distribution of elasticsearch? If not, in order to use transform you need to activate the basic license. This license comes free of charge.

Transform has been significantly improved after 7.7, however what I proposed should work in 7.7, too.

Re-index can not write aggregation results into the destination index.

I am not a Vega expert, but this seems possible to me. How many results are you expecting, how many do you want to show? If you are ok with the "top" top_hits, this might work.

Vaishnavi_Nandakumar · July 19, 2021, 10:38am

Hello,
I'm using Open Distro for Elasticsearch. I tried to activate the basic license but I'm getting an error.

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parse_exception",
        "reason" : "request body is required"
      }
    ],
    "type" : "parse_exception",
    "reason" : "request body is required"
  },
  "status" : 400
}

The top hits aggregation is just producing 1 document per bucket. I want to access the value of a elapsed time field, within the source of that document.

dadoonet · July 19, 2021, 10:57am

You need to download the official version of elasticsearch if you want to use those features are they are not available within any other distribution.

Hendrik_Muhs · July 19, 2021, 11:05am

ok, in this case there is unfortunately no transform and no other basic licensed features.

I actually meant how many buckets are you expecting. Do you want to show all buckets in 1 page?

The reason for computing aggregations offline with e.g. transform is usually:

the number of buckets is large, like 10k or more (the limit can be lifted, but performance is another reason, offline pre-aggregation avoids running expensive queries)
you want to further analyze on top of the results, e.g. to answer questions like "average duration over all transactions"
you want to feed the results into another process like building a machine learning model

Vaishnavi_Nandakumar · July 19, 2021, 11:16am

Alright. I saw Ingest Pipelines can also be used for reindexing. Do you think a painless script can be used to filter out the documents?

Ideally, yes. But I understand the limitations and I just want to see if the data could be visualized.

Vaishnavi_Nandakumar · July 19, 2021, 11:17am

Oh I didn't know about this. Thank you.

Hendrik_Muhs · July 19, 2021, 12:13pm

Ingest pipelines and the contained ingest processors can only take 1 document manipulate it and output 1 document, basically a "map" type of operation. They can not "reduce"[1], meaning you can not combine a set of documents. Aggregations are conceptually such a "reducer".

I don't think re-index or ingest help you in this case. Vega will let you run and visualize/print aggregations, but might be limited to a small set.

The only other option - if an upgrade of the cluster / switch to the official Elastic elasticsearch distribution[2] is not possible - is implementing your requirement in code, e.g. a python script that pulls the data and writes it back into another index.

[1](Some rare special cases which are of type "reduce" are possible though with ingest pipelines, e.g. you can "enrich" one document with another)
[2]Opendistro is a fork of elasticsearch not supported by Elastic

Vaishnavi_Nandakumar · July 19, 2021, 12:58pm

Got it.
I'm currently limited to the platform I'm working with so I can't change that. But thank you for your response.

system · August 16, 2021, 12:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can i reindex aggregation's result to other index? Elasticsearch	3	1565	February 27, 2019
Re-index aggregation ElasticSearch Elasticsearch	2	4764	March 17, 2017
Can i add Aindex's documents aggregation result into B Index's document? Elasticsearch	2	471	February 28, 2019
Reindex to data streams Elasticsearch	2	253	October 4, 2022
Reindex while writing to index Elasticsearch	1	668	March 16, 2018

How can I reindex the results of an aggregation into another index?

Related topics