How can I reindex the results of an aggregation into another index?

Hello,

I'm using the 7.7.0 version of Elasticsearch. I have an aggregation query that works on the ingested data to produce the output but I am not able to re-index it.

Data Context

The given dataset has around 500,000 records consisting of Incoming File ID, Outgoing File ID, combination of both ID’s and the time elapsed during the transaction of both ID’s over a year. There can be multiple instances where both the File ID’s combinations (IN_OUT_ID) are repeated on different days. The aggregation query I have groups these records based on IN_OUT_ID and takes the document with the smallest Outgoing ID.

I have used top hits aggregations to get the document source but when I tried reindexing with the query given below, all of the documents are being ingested into the destination index instead of just the filtered ones.

POST /_reindex
{
  "source": {
    "index": "src-new-data",
    "aggs": {
      "groupbyID": {
        "terms": {
          "field": "IN_OUT_ID",
          "order": {
            "lowest_score": "asc"
          }
        },
        "aggs": {
          "lowest_score": {
            "min": {
              "field": "OUT_FILE_ID"
            }
          },
          "lowest_score_top_hits": {
            "top_hits": {
              "size": 1,
              "sort": [
                {
                  "OUT_FILE_ID": {
                    "order": "asc"
                  }
                }
              ]
            }
          }
        }
      }
    }
  },
  "dest": {
    "index": "dest-filter-data"
  }
}

I want to visualize a histogram of the elapsed time from the filtered documents. If no other approach can be used for reindexing, is there any other way I can create the visualization based on the data context given? I have tried using Vega but I am not able to access the elapsed time value from the top hits bucket aggregation.
Thank You

Have a look at transform.

However top hits is not supported in transform, but you can use a scripted_metric aggregation, similar to this example.

1 Like

Hi Hendrik,
I checked out the link but I am not able to use _transform as I'm using an earlier version of Elasticsearch.
Is there any other approach I can use to either :
a. Reindex from the top hits aggregations
b. Use Vega to visualize top hits aggregations.

Transform has been made GA(General Availability) in elasticsearch as of version 7.5, it's predecessor (beta program) has been available since 7.2.

Are you using the OSS distribution of elasticsearch? If not, in order to use transform you need to activate the basic license. This license comes free of charge.

Transform has been significantly improved after 7.7, however what I proposed should work in 7.7, too.

Re-index can not write aggregation results into the destination index.

I am not a Vega expert, but this seems possible to me. How many results are you expecting, how many do you want to show? If you are ok with the "top" top_hits, this might work.

Hello,
I'm using Open Distro for Elasticsearch. I tried to activate the basic license but I'm getting an error.

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parse_exception",
        "reason" : "request body is required"
      }
    ],
    "type" : "parse_exception",
    "reason" : "request body is required"
  },
  "status" : 400
}

The top hits aggregation is just producing 1 document per bucket. I want to access the value of a elapsed time field, within the source of that document.

You need to download the official version of elasticsearch if you want to use those features are they are not available within any other distribution.

ok, in this case there is unfortunately no transform and no other basic licensed features.

I actually meant how many buckets are you expecting. Do you want to show all buckets in 1 page?

The reason for computing aggregations offline with e.g. transform is usually:

  • the number of buckets is large, like 10k or more (the limit can be lifted, but performance is another reason, offline pre-aggregation avoids running expensive queries)
  • you want to further analyze on top of the results, e.g. to answer questions like "average duration over all transactions"
  • you want to feed the results into another process like building a machine learning model

Alright. I saw Ingest Pipelines can also be used for reindexing. Do you think a painless script can be used to filter out the documents?

Ideally, yes. But I understand the limitations and I just want to see if the data could be visualized.

Oh I didn't know about this. Thank you.

Ingest pipelines and the contained ingest processors can only take 1 document manipulate it and output 1 document, basically a "map" type of operation. They can not "reduce"[1], meaning you can not combine a set of documents. Aggregations are conceptually such a "reducer".

I don't think re-index or ingest help you in this case. Vega will let you run and visualize/print aggregations, but might be limited to a small set.

The only other option - if an upgrade of the cluster / switch to the official Elastic elasticsearch distribution[2] is not possible - is implementing your requirement in code, e.g. a python script that pulls the data and writes it back into another index.

[1](Some rare special cases which are of type "reduce" are possible though with ingest pipelines, e.g. you can "enrich" one document with another)
[2]Opendistro is a fork of elasticsearch not supported by Elastic

Got it.
I'm currently limited to the platform I'm working with so I can't change that. But thank you for your response.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.