Filter latest record for each user and aggregate on different field in elasticsearch

Need to filter the latest record for each user and aggregate the count on each vote type. Still cannot able to figure out. any help pls?

{user:"sam", vote: 20, "createdDate": 1601285178238}

{user:"sam", vote: 21, "createdDate": 1601285178239}

{user:"harry", vote: 30, "createdDate": 1601285178238}

Any help

Hello @matrix1,

first change the createdDate in the human readable format from epoch and then apply the aggregation.

Hope it works for you

@shrikantgulia pls find my index details after the changes. i need to count only the votes of the latest user response

POST vote_result/_doc
{
"user": "sam",
"vote": "20",
"createdDate": "2020-09-30T09:05:30.113Z"
}

POST vote_result/_doc
{
"user": "sam",
"vote": "21",
"createdDate": "2020-09-30T09:05:32.113Z"
}

POST vote_result/_doc
{
"user": "harry",
"vote": "30",
"createdDate": "2020-09-30T09:05:30.113Z"
}

GET vote_result/_search
{
"aggs":{
"by_user":{
"terms": {
"field": "user"
},
"aggs": {
"tops": {
"top_hits": {
"size": 1
}
}
}
}
}
}

Response:

"aggregations" : {
"by_user" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "sam",
"doc_count" : 2,
"tops" : {
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "vote_result",
"_type" : "_doc",
"_id" : "UdFS3nQBmPKI5bzQIM1V",
"_score" : 1.0,
"_source" : {
"user" : "sam",
"vote" : "21",
"createdDate" : "2020-09-30T09:05:32.113Z"
}
}
]
}
}
},
{
"key" : "harry",
"doc_count" : 1,
"tops" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "vote_result",
"_type" : "_doc",
"_id" : "UtFS3nQBmPKI5bzQKM0B",
"_score" : 1.0,
"_source" : {
"user" : "harry",
"vote" : "30",
"createdDate" : "2020-09-30T09:05:30.113Z"
}
}
]
}
}
}
]
}
}
}

After this i need to aggregate the count of each unique vote in this case i need

{
key : 21
count : 1
} ,
{
key : 30
count: 1
}

how to achieve this result

Hello,

to get the first result put the size parameter.
size =1;

  1. You marked the votes as string Please change them to number

POST test_1/_doc
{
"user": "sam",
"vote": 21,
"createdDate": "2020-09-30T09:05:32.113Z"
}

POST test_1/_doc
{
"user": "harry",
"vote": 30,
"createdDate": "2020-09-30T09:05:30.113Z"
}

POST test_1/_search
{
"size": 0,
"aggs": {
"max_votes": {
"max": {
"field": "vote"
}
}
}
}

@shrikantgulia

i can get the max vote from size 1. the problem is need to filter only the latest record for each user then aggregate on count to get each vote type

eg:

i have 3 document . just a sample pls ignore the timestamp format

{user:"sam", vote: 20, "createdDate": 1601285178238}

{user:"sam", vote: 21, "createdDate": 1601285178239}

{user:"harry", vote: 30, "createdDate": 1601285178238}

filter only the latest of document for each user which will give me

{user:"sam", vote: 21, "createdDate": 1601285178239}

{user:"harry", vote: 30, "createdDate": 1601285178238}

And aggregate the field vote to get count of each vote type which will provide me

{
key : 21
count : 1
} ,
{
key : 30
count: 1
}

how to achieve this result?

Any reasoning based on last-known state of entities is tricky in a distributed system where related records are scattered across nodes. For this sort of analysis you typically need to bring related data physically closer together in a special index.
This can be done in your client app by updating a dedicated entity-state document keyed on the entity ID or using the transform API to periodically fuse multiple related records into a single entity-centric document. This fusion process is simple for some operations (e.g. counting the number of urls a user has accessed using a cardinality agg) but requires more complex scripting for some things like recording data only from the last known state.

@Mark_Harwood
GET opinion_test/_search
{
"aggs": {
"instances": {
"terms": {
"field": "user",
"size": 1000,
"order": {
"_key": "desc"
}
},
"aggs": {
"latest_mem": {
"top_hits": {
"sort": [
{
"createdDate": {
"order": "desc"
}
}
]
}
}
}
},
"latest_doc": {
"scripted_metric": {
"init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
"map_script": "def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli(); if (current_date > state.timestamp_latest) {state.timestamp_latest = current_date;state.last_doc = new HashMap(params['_source']);}",
"combine_script": "return state",
"reduce_script": "def last_doc = '';def timestamp_latest = 0L; for (s in states) {if (s.timestamp_latest > (timestamp_latest)) {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}} return last_doc"
}
}
}
}

tried this getting some error .can u help me with a es query for this scenerio

@Mark_Harwood

fixed the query issue but is not returning each user latest values it is giving only one user value

GET test/_search
{
"aggs": {
"instances": {
"terms": {
"field": "user",
"size": 1000,
"order": {
"_key": "desc"
}
},
"aggs": {
"latest_mem": {
"top_hits": {
"sort": [
{
"createdDate": {
"order": "desc"
}
}
]
}
}
}
},
"latest_doc": {
"scripted_metric": {
"init_script": "state.timestamp_latest = 0L; state.last_doc = ''",
"map_script": "def current_date = doc['createdDate'].getValue().toInstant().toEpochMilli(); if (current_date > state.timestamp_latest) {state.timestamp_latest = current_date;state.last_doc = new HashMap(params['_source']);}",
"combine_script": "return state",
"reduce_script": "def last_doc = '';def timestamp_latest = 0L; for (s in states) {if (s.timestamp_latest > (timestamp_latest)) {timestamp_latest = s.timestamp_latest; last_doc = s.last_doc;}} return last_doc"
}
}
}
}

Please let me know if any issues in query

The code I linked to is used to fuse data in a transform job not a query.
You should read up on the transform API.

Fundamentally you need to fuse data into a new index which only holds the latest information for each user. You can then query that index to determine the breakdown of vote types.

@Mark_Harwood thank you understood. is it any way possible to do in a query without Transform API ?

How many unique user IDs do you have?

@Mark_Harwood this for user poll aggregation so we have more user so no fixed count

Rough guess on largest numbers?

@Mark_Harwood maybe 10000 - 12000

The good news is that's sufficiently small that you may be able to get away with doing it in one search request without needing the transform API. When numbers are in the millions it requires too much memory.
The bad news is it's essentially a programming activity on your part. The scripted_metric gives you the framework to have custom scripts collect search result data from data nodes and fuse these at the coordinating node for a final result. It will involve the use of temporary maps etc. to organise results into buckets.

@Mark_Harwood thank you for the solution. I am new to scripted metrics and checking on the documentation now. Will be helpful if u can provide a sample script for this user based lastest aggregation using scripted metrics to get each vote count