Elasticquery

I have few duplicate documents under a particular index . Is there any way to skip the duplicate entry through elastic query ?
I have following documents under a index. Now I want a single document through elastic query based on status field. Means if we have 10 documents having "status: done" , it should return only single document (latestone)

{
_id:  dxsafaf
status:  done
pid:  123
}

{
_id: dadadfe
status: done
pid: 123
}

{
_id: qewqert
status:done
pid: 123
}

i am using the following query through python but it returns all the doucument

res = es.search(index='data-ver2-*', size=5000,

body={"sort": [{"@timestamp": {"order": "asc"}}],"query": { "bool": {"must":[{"match_all" :{}},{"match_phrase": {"pid": { "query": 123}}}]}}},request_timeout=60)

Could you pls let me know what extra function I need to add with this query and how ?

Thanks
Niraj

What would help to make the decision that qewqert is the "latest" one?
There is nothing else to sort on.

@timestamp

{
'@timestamp': '2019-02-12T11:05:36.124Z'
_id:  dxsafaf
status:  done
pid:  123
}

{
'@timestamp': '2019-02-12T11:05:45.124Z'
_id: dadadfe
status: done
pid: 123
}

{
'@timestamp': '2019-02-12T11:05:52.124Z'
_id: qewqert
status:done
pid: 123
}

Why about using size: 1?

Size =1 would be fine if there is only single status (done). But if we have some other status as well then this should not work , right ?
Let me give the complete example here:

{
'@timestamp': '2019-02-12T11:03:52.124Z'
_id: qewqert
status:run
pid: 123
}
{
'@timestamp': '2019-02-12T11:04:36.124Z'
_id: dxsafah
status: run
pid: 123
}
{
'@timestamp': '2019-02-12T11:05:52.124Z'
_id: qewwert
status:done
pid: 123
}
{
'@timestamp': '2019-02-12T11:05:45.124Z'
_id: dadadfe
status: done
pid: 123
}
{
'@timestamp': '2019-02-12T11:05:52.124Z'
_id: qewqert
status:done
pid: 123
}

And I want only these two documents

 {
    '@timestamp': '2019-02-12T11:04:36.124Z'
    _id:  dxsafah
    status:  run
    pid:  123
    }
{
    '@timestamp': '2019-02-12T11:05:52.124Z'
    _id: qewqert
    status:done
    pid: 123
    }

That would help not spending time with wrong answers. Please do that next time.

It's always better to provide a full recreation script as described in About the Elasticsearch category. It helps to better understand what you are doing.

A full reproduction script will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Anyway, you can run a terms aggregation and then a top_hits agg with size:1 in it.

Thanks for you support. It works now.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.