Tophits on transform group by a field not all field

So I have this case where I need to use top hits on transformation
I want to show data based on
I have this data

email col2 col3 col4 col5 Time
a.com         a        a        a    a     11:00 
a.com         a        a        a    a     11:01 
a.com         a        b        a    a     11:02

I want to remove the duplicate email, and only show it based on the latest time. I'm using transform and aggregate it based on max time. and for the group by I choose every field I needed. It returns data such as :

a.com         a        a        a    a     11:01 
a.com         a        b        a    a     11:02

I only want it to show data a.com a b a a 11:02
How can I make the transform based on groupby email only instead every field?

can you post your configuration?

It sounds like you group_by on more than the email address, but if you only want email you should not define other group_by fields.

The top hits aggregation is currently not supported, but we have a top hits like scripted metric example in the docs.

Dear Hendrik, yes you are right, I add everyone on group by because I need them to shown as terms. Since there are only aggregation and group by kind. I add them on group by. They are not numbers so I don't think I should aggregate them.
group by email, col2,col3,col4,col5
aggregate max(Time)
Please kindly help me, thank you @Hendrik_Muhs

Dear Hendrik,
How do we add terms field without aggregate them , but outside the group by query?

Did you tried the example I mentioned in the last answer?

It is an example for getting the last document based on a timestamp field. That means you configure the email as group_by and use the latest_doc aggregation from the given example as aggregation. I think that should work for your case.

2 Likes

Thanks @Hendrik_Muhs for the example, this was very useful to get top hits in tranforms
Looking for the direct support for top hits aggs in tranform soon, thank you

@charles97 I used this example to get the lastest update for a ticket, same as your use case, here is the example i used

PUT _transform/tickets_tranform
{
  "source": {
    "index": [
      "tickets"
    ]
  },
  "pivot": {
    "group_by": {
      "ticket_id.keyword": {
        "terms": {
          "field": "ticket_id.keyword"
        }
      }
    },
    "aggregations": {
      "latest_doc": {
        "scripted_metric": {
          "init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
          "map_script": """ 
        def current_date = doc['updated_at'].getValue().toInstant().toEpochMilli();
        if (current_date > state.timestamp_latest)
        {state.timestamp_latest = current_date;
        state.last_doc = new HashMap(params['_source']);}
      """,
          "combine_script": "return state",
          "reduce_script": """ 
        def last_doc = '';
        def timestamp_latest = 0L;
        for (s in states) {if (s.timestamp_latest > (timestamp_latest))
        {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}}
        return last_doc
      """
        }
      }
    }
  },
  "description": "Get the latest update per ticket_id",
  "dest": {
    "index": "tickets_tranform"
  },
  "frequency": "5m",
  "sync": {
    "time": {
      "field": "updated_at",
      "delay": "60s"
    }
  }
}
1 Like

Dear Hendrik,
Yes I've tried it out and it still returning duplicate email. Here is the mapping

{
  "id": "transform_latest_doc",
  "source": {
    "index": [
      "email-profile-nov-bug*"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "transform_latest_doc"
  },
  "sync": {
    "time": {
      "field": "extract_date",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "Email.keyword": {
        "terms": {
          "field": "Email.keyword"
        }
      },
      "bln.keyword": {
        "terms": {
          "field": "bln.keyword"
        }
      },
      "fa.keyword": {
        "terms": {
          "field": "fa.keyword"
        }
      },
      "ever.keyword": {
        "terms": {
          "field": "ever.keyword"
        }
      },
      "Status_Category.keyword": {
        "terms": {
          "field": "Status_Category.keyword"
        }
      },
      "Email_age_category.keyword": {
        "terms": {
          "field": "Email_age_category.keyword"
        }
      },
      "Vintage_cat.keyword": {
        "terms": {
          "field": "Vintage_cat.keyword"
        }
      },
      "domain.keyword": {
        "terms": {
          "field": "domain.keyword"
        }
      }
    },
    "aggregations": {
      "latest_doc": {
        "scripted_metric": {
          "init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
          "map_script": " \n        def current_date = doc['extract_date'].getValue().toInstant().toEpochMilli();\n        if (current_date > state.timestamp_latest)\n        {state.timestamp_latest = current_date;\n        state.last_doc = new HashMap(params['_source']);}\n      ",
          "combine_script": "return state",
          "reduce_script": " \n        def last_doc = '';\n        def timestamp_latest = 0L;\n        for (s in states) {if (s.timestamp_latest > (timestamp_latest))\n        {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}}\n        return last_doc\n      "
        }
      }
    }
  },
  "settings": {},
  "version": "7.8.0",
  "create_time": 1607952403488
}

Dear Yassine,
Thank you for the share but I still didn't have mine running. Since it still returning duplicate email and bln.

Data sample

email   bln   cat        time
a.com   10     a        10.11
a.com   10     b        10.15
a.com   11     b        10.20
a.com   11     b        10.50

returns

 email   bln   cat        time
a.com   10     a        10.11
a.com   10     b        10.15
a.com   11     b        10.50

result I need

 email   bln   cat        time
a.com   10     b        10.15
a.com   11     b        10.50

technically it should be grouped by email and bln. But I also want to show the last status of cat .

Dear @ylasri and @Hendrik_Muhs,
Is it possible to add field outside the groupby or aggregation?

That's the idea is to group only by email, and get all other fields from the scripted aggs

1 Like

Dear @ylasri,
pardon, I don't get it. How do we do it? Please pardon my elk skill. I don't get how we add the rest of field if we only add email as group by? Would you please explain it to me systematically so I can understand? Thank you so much.

Like this, you will get all other fields under latest_doc

PUT _transform/transform_latest_doc
{
  "source": {
    "index": [
      "email-profile-nov-bug*"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "transform_latest_doc"
  },
  "sync": {
    "time": {
      "field": "extract_date",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "Email.keyword": {
        "terms": {
          "field": "Email.keyword"
        }
      }
    },
    "aggregations": {
      "latest_doc": {
        "scripted_metric": {
          "init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
          "map_script": " \n        def current_date = doc['extract_date'].getValue().toInstant().toEpochMilli();\n        if (current_date > state.timestamp_latest)\n        {state.timestamp_latest = current_date;\n        state.last_doc = new HashMap(params['_source']);}\n      ",
          "combine_script": "return state",
          "reduce_script": " \n        def last_doc = '';\n        def timestamp_latest = 0L;\n        for (s in states) {if (s.timestamp_latest > (timestamp_latest))\n        {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}}\n        return last_doc\n      "
        }
      }
    }
  }
}
2 Likes

Conceptually group_by is about forming your buckets. Its built from the combinations of the values extracted/created for each document and each combination is considered as a composite bucket.

The aggregation part defines what to return from the buckets build in group_by.

To get to a better understanding of transform fundamentals, I suggest the webinar recording. It doesn't cover your case, however hopefully helps to understand the difference between group_by and aggregation.

2 Likes

Dear @ylasri thank you so much for your assistance and sample, you make it clearer about how to use the script aggregation. I've used it but never make it clear as I still add all of the field I need by add it to group by. Thank you for explaining how the script works

Dear @Hendrik_Muhs, thank you so much for the resources you gave me to learn. I will learn some more from it.

Thank you everyone for your assistance. Appreciate that so much. Have a nice day.

@hendrik,
I've tested some more things but itseems it dont returning all unique emails. But weird behavious happened as :
All data transformed : 83402
Unique Count by email : 83066

When I try to find the duplicate emails by aggregate it by cardinal. It returns that nothing has value more than 1.

GET transformwithout12/_search
{
  "size": 0,
  "aggs": {
    "duplicateNames": {
       "terms": {
         "field": "Email.keyword",
         "min_doc_count": 2
       }
     }
  }
}

I check it out with this and it returns no duplicate email found

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "duplicateNames" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 83377,
      "buckets" : [ ]
    }
  }
}

For the unique count I suppose you used the cardinality aggs wish will never give 100% exact count, you can increase the precision_threshold to get more accurate count https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html#_precision_control

Transform is 100% exact as it use composite aggs to scroll all possible buckets...

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.