So I have this case where I need to use top hits on transformation
I want to show data based on
I have this data
email col2 col3 col4 col5 Time
a.com a a a a 11:00
a.com a a a a 11:01
a.com a b a a 11:02
I want to remove the duplicate email, and only show it based on the latest time. I'm using transform and aggregate it based on max time. and for the group by I choose every field I needed. It returns data such as :
a.com a a a a 11:01
a.com a b a a 11:02
I only want it to show data a.com a b a a 11:02
How can I make the transform based on groupby email only instead every field?
Dear Hendrik, yes you are right, I add everyone on group by because I need them to shown as terms. Since there are only aggregation and group by kind. I add them on group by. They are not numbers so I don't think I should aggregate them.
group by email, col2,col3,col4,col5
aggregate max(Time)
Please kindly help me, thank you @Hendrik_Muhs
Did you tried the example I mentioned in the last answer?
It is an example for getting the last document based on a timestamp field. That means you configure the email as group_by and use the latest_doc aggregation from the given example as aggregation. I think that should work for your case.
Thanks @Hendrik_Muhs for the example, this was very useful to get top hits in tranforms
Looking for the direct support for top hits aggs in tranform soon, thank you
@charles97 I used this example to get the lastest update for a ticket, same as your use case, here is the example i used
Dear @ylasri,
pardon, I don't get it. How do we do it? Please pardon my elk skill. I don't get how we add the rest of field if we only add email as group by? Would you please explain it to me systematically so I can understand? Thank you so much.
Conceptually group_by is about forming your buckets. Its built from the combinations of the values extracted/created for each document and each combination is considered as a composite bucket.
The aggregation part defines what to return from the buckets build in group_by.
To get to a better understanding of transform fundamentals, I suggest the webinar recording. It doesn't cover your case, however hopefully helps to understand the difference between group_by and aggregation.
Dear @ylasri thank you so much for your assistance and sample, you make it clearer about how to use the script aggregation. I've used it but never make it clear as I still add all of the field I need by add it to group by. Thank you for explaining how the script works
Dear @Hendrik_Muhs, thank you so much for the resources you gave me to learn. I will learn some more from it.
Thank you everyone for your assistance. Appreciate that so much. Have a nice day.
@hendrik,
I've tested some more things but itseems it dont returning all unique emails. But weird behavious happened as :
All data transformed : 83402
Unique Count by email : 83066
When I try to find the duplicate emails by aggregate it by cardinal. It returns that nothing has value more than 1.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.