Finding optimal solution to use filter in aggregations of transform index

InfiniteDreamer · July 28, 2020, 10:32pm

We are trying to create transform index, Following sample data is our subset where we want to find most recent event date based our duty code.

code	enterprise	   duty	     visitdate
2247 	HASTY&TASTY		1		Jul 27, 2020 @ 00:00
2247 	HASTY&TASTY		2		Jul 26, 2020 @ 00:00
2247 	HASTY&TASTY		0       Jul 25, 2020 @ 00:00
2247 	HASTY&TASTY		2	    Jun 30, 2020 @ 00:00
2247 	HASTY&TASTY		1       Jun 22, 2020 @ 00:00
2213	DunkinDonut		0		Jul 28, 2020 @ 00:00
2213	DunkinDonut		2		Jul 27, 2020 @ 00:00
2213	DunkinDonut		2		Jul 26, 2020 @ 00:00

Transform index should have only most recent dated docs where duty=2.

code     enterprise      visitdate
2247     HASTY&TASTY     Jul 26, 2020 @ 00:00
2213     DunkinDonut     Jul 27, 2020 @ 00:00

This script for transformation is working fine, where we want to know any other alternate method to implement the same, because it required multiple scripted fields for multiple use cases in same transform index where it may create loading issues.

POST _transform/_preview
{
  "source": {
    "index": [
      "visitdata*"
    ],
    "query": {
      "exists": {
        "field": "visitdate"
      }
    }
  },
  "dest": {
    "index": "nx009"
  },
  "pivot": {
    "group_by": {
      "code": {
        "terms": {
          "field": "code.keyword"
        }
      },
      "enterprise": {
        "terms": {
          "field": "enterprise.keyword"
        }
      }
    },
    "aggregations": {
      "visitdate_doc": {
        "scripted_metric": {
          "init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
          "map_script": """ 
        
        def current_date = doc['visitdate'].getValue().toInstant().toEpochMilli();
        def visited = doc['duty'].getValue();
        if (current_date > state.timestamp_latest && visited==2 )
        {state.timestamp_latest = current_date;
        state.last_doc = new HashMap(params['_source']);}
      """,
          "combine_script": "return state",
          "reduce_script": """ 
        def last_doc = '';
        def timestamp_latest = 0L;
        for (s in states) 
        {
          if (s.timestamp_latest > (timestamp_latest))
        {
          timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;
          
        }
          
        }if(last_doc != null && !last_doc.isEmpty())
            {
            return last_doc.visitdate;
            }
      """
        }
      }
    }
  }
}

InfiniteDreamer · July 30, 2020, 12:10pm

Hi, kindly guide us

Hendrik_Muhs · July 30, 2020, 1:32pm

Hi,

I do not see anything wrong with your approach. Getting the last state is one of the top ask for transform and we might have a ootb solution for that in future. Today you need scripted_metric, we know this is complicated and fragile, but as said, there is no alternative at the moment.

Regarding your config: Are you always only interested in data points with duty == 2 (or duty>=2)? If so, I think it is better for performance, if you filter in the query, instead of filtering as part of scripted_metric. It would also be possible to put a filter aggregation right before scripted_metric, in case you do not want to filter globally.

InfiniteDreamer · July 30, 2020, 7:40pm

Thanks @Hendrik_Muhs

Actually we want to have many max dates based on individual duty codes using multiple scripted_metric fields in same transform index, index filtering or filter aggregation may not suit for our case.

Hendrik_Muhs · July 31, 2020, 9:11am

FWIW a sub-agg filter would look like this:

{
  "source": {
    ...
  },
  "pivot": {
    "group_by": {
      ... 
    },
    "aggregations": {
      "visited_twice": {
        "filter": {
          "term": {
            "visited": 2
          }
        },
        "aggs": {
          "visitdate_doc": {
            "scripted_metric": {
...

That means you filter out visited!=2 before the scripted_metric (and you can remove the check in the script). Note that the result would have an additional level (can be later moved using an ingest pipeline):

"visited_twice": {
  "visitdate_doc": { ... }
}

This solution should work, however I can not say if/how much performance you gain.

(FWIW our benchmark tool rally has support for transform.)

InfiniteDreamer · August 11, 2020, 1:29pm

Hi @Hendrik_Muhs ,

we are creating few fields based on above sub-agg filter with range, where we want to create a fields based on 30 day and 45 day... avg value in transform index.

{
  "source": {
    "index": [
      "sales*"
    ]
  },
  "pivot": {
    "group_by": {
      "customer.keyword": {
        "terms": {
          "field": "customer.keyword"
        }
      },
      "locationcode.keyword": {
        "terms": {
          "field": "locationcode.keyword"
        }
      }
    },
    "aggregations": {
      "avg30days": {
        "filter": {
          "range": {
            "journeydate": {
              "gte": "now-30d/d",
              "lte": "now/d"
            }
          }
        },
        "aggs": {
          "avg_30val": {
            "avg": {
              "field": "linenetamount"
            }
          }
        }
      },
      "avg45days": {
        "filter": {
          "range": {
            "journeydate": {
              "gte": "now-45d/d",
              "lte": "now/d"
            }
          }
        },
        "aggs": {
          "avg_45val": {
            "avg": {
              "field": "linenetamount"
            }
          }
        }
      }
    }
  }
}

this is the error we are getting.

{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "Unsupported aggregation type [filter]"
      }
    ],
    "type": "status_exception",
    "reason": "Failed to validate configuration",
    "caused_by": {
      "type": "status_exception",
      "reason": "Unsupported aggregation type [filter]"
    }
  },
  "status": 400
}

please let us know how to rectify this error or any other logic to acheive the same.

BenTrent · August 11, 2020, 2:16pm

Hey @InfiniteDreamer,

filter agg support was added in Elasticsearch 7.7. Can you confirm your version?

InfiniteDreamer · August 12, 2020, 5:29am

Hi @BenTrent,

Thanks for the reply, we are checking earlier transform script on Elasticsearch 7.5. Also now i cross verified the feature with Elasticsearch 7.8 on sample ecommerce data, though transform runs but it throws null value in place of avg aggs,

POST _transform/_preview
{
  "source": {
    "index": [
      "kibana_sample_data_ecommerce*"
    ]
  },
  "pivot": {
    "group_by": {
      "customer_full_name.keyword": {
        "terms": {
          "field": "customer_full_name.keyword"
        }
      },
      "day_of_week": {
        "terms": {
          "field": "day_of_week"
        }
      }
    },
    "aggregations": {
      "avg7days": {
        "filter": {
          "range": {
            "order_date": {
              "gte": "now-7d/d",
              "lte": "now/d"
            }
          }
        },
        "aggs": {
          "avg_7val": {
            "avg": {
              "field": "taxful_total_price"
            }
          }
        }
      },
      "avg3days": {
        "filter": {
          "range": {
            "order_date": {
              "gte": "now-3d/d",
              "lte": "now/d"
            }
          }
        },
        "aggs": {
          "avg_3val": {
            "avg": {
              "field": "taxful_total_price"
            }
          }
        }
      }
    }
  }
}

but taxful total price is present in the docs.

{
  "preview" : [
    {
      "customer_full_name" : {
        "keyword" : "Abd Adams"
      },
      "avg7days" : {
        "avg_7val" : null
      },
      "avg3days" : {
        "avg_3val" : null
      },
      "day_of_week" : "Friday"
    },
    {
      "customer_full_name" : {
        "keyword" : "Abd Allison"
      },
      "avg7days" : {
        "avg_7val" : null
      },
      "avg3days" : {
        "avg_3val" : null
      },
      "day_of_week" : "Sunday"
    }............

Kindly help us

BenTrent · August 12, 2020, 2:58pm

You are grouping by day_of_week. So, your buckets will be the customer and each individual day of week (like SUNDAY, MONDAY, etc.).

Since you are filtering on now-7d, that means that if that particular customer say John did not place an order on a Monday in the last 7 days, that value will be null as it does not exist.

FWIW, I tried this myself, and did get some results, but yes, there are many null ones. Which makes sense, as not every customer makes an order every day of the week.

InfiniteDreamer · August 14, 2020, 7:59am

Thanks @BenTrent, Instead of Avg other metrics like sum, distinct can group the sample data

system · September 11, 2020, 7:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elastic Latest Transform is not working if sync time and sort time is different Kibana transforms	7	151	April 5, 2024
Kibana transform latest continously not updating Elasticsearch transforms	3	639	May 19, 2021
Transforms - Latest and Pivot Kibana transforms	2	214	March 22, 2024
Trying to create tranform job which would not go through all documents Kibana transforms	4	309	April 27, 2023
Elasticsearch transform index using scripted fields Elasticsearch transforms	3	681	October 27, 2020

Finding optimal solution to use filter in aggregations of transform index

Related topics