How to remove duplicate values in ealstic search

narasiman1986 · December 18, 2017, 12:16pm

i am using elastic search 5.6.4

GET /clinicaldata/_search
{
"query": {
"match" : {
"mrdno" : "112627"
}
}
}

in that mrdno 112627 have duplicate of 75 records.

when run in the above shows ouput as follows

{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 75,
"max_score": 1

total 75 records. in this 75 duplicate records is there.

i want to delete the duplicate records using elastic search

for that how to write the query using elastic search

dadoonet · December 18, 2017, 12:32pm

You can probably use the delete by query feature and index again one of the docs

narasiman1986 · December 18, 2017, 3:23pm

ok. please tell me. how to do. because i am new to elastic search

dadoonet · December 18, 2017, 3:35pm

See https://www.elastic.co/guide/en/elasticsearch/reference/6.1/docs-delete-by-query.html

narasiman1986 · December 18, 2017, 3:49pm

i have duplicate 75 duplicate records. mrdno 11657

POST twitter/_delete_by_query?scroll_size=5000
{
"query": {
"term": {
"user": "kimchy"
}
}
}

from the above example i am replacing my field name.

my index name is test

POST test/_delete_by_query?scroll_size=5000
{
"query": {
"term": {
"mrdno": "11657"
}
}
}

The above query will delete the duplicate values for the column mrdno.

the above one is correct please let me know

dadoonet · December 18, 2017, 4:38pm

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Your query looks good but the index name which should be probably clinicaldata.

narasiman1986 · December 18, 2017, 4:44pm

clinical data is my index name

POST clinicaldata/_delete_by_query?scroll_size=5000

{

"query": {

"term": {

"mrdno": "11657"

}

The above query is correct ?

to delete duplicate records of mrd no

dadoonet · December 18, 2017, 5:45pm

It will delete ALL records that match "mrdno": "11657" not only duplicates.
So you will need to create again after a new document...

narasiman1986 · December 19, 2017, 4:14am

i do not want delete the all record that matches mrdno 11657.

i want to delete only duplicates

POST clinicaldata/_delete_by_query?scroll_size=5000

{

"query": {

"term": {

"mrdno": "11657"

}

you told that above query will delete all the record that matches mrdno 11657.

i want to delete duplicates records of mrdo 16657.

for that what changes i have to make from the above query.

please do the needful.

dadoonet · December 19, 2017, 5:41am

I can see that you asked the same question at

So the answer is: there is no way to remove duplicates in one single call.

As I said, you need to:

Remove all docs
Add back one of the docs you removed

Is it a one time operation or something you want to do in the long run? If the later, what is the usecase of allowing duplicates ?

dadoonet · December 19, 2017, 5:54am

Unless you have another field which can help like a timestamp...

narasiman1986 · December 19, 2017, 6:09am

in kibana discover tab data is displaying, below i show the discover tab details in kibana.

clinicaldata (index name)

Selected Fields

? _source

Available Fields
? @timestamp

? @version

t _id

t _index
add

_score
t _type

Fields name as follows

? age

? cpa_addr_1

? cpa_addr_2

? cpa_addr_3

? cpa_addr_area

? cpa_addr_city

? cpa_country_cd

? cpa_pin_code

? cpa_state_cd

? mcs_case_summary

? mcs_crt_dt

? mcs_crt_uid

? rrh_first_name

? rrh_location_cd

? rrh_mr_num

? rrh_pat_dob

? rrh_pat_sex

? rrh_regn_dt

_source
cpa_addr_2:POST- EKBARNA cpa_addr_area: - mcs_crt_uid:MACTCS cpa_addr_1:VILL- EKB ARNA rrh_mr_num:3416558 cpa_pin_code:732 204 rrh_pat_sex:Fema mcs_crt_dt:2015-01-03T07:23:00.000Z @timestamp:2017-12-15T08:54:28.231Z cpa_country_cd:INDIA cpa_state_cd:WB @version:1 rrh_regn_dt:2014-11-18T18:30:00.000Z rrh_first_name:ANITA KARMAKAR rrh_pat_dob:1987-08-24T18:30:00.000Z mcs_case_summary:External File Uploaded - EXTNFILEINFO SHEET age:28 rrh_location_cd:MAIN cpa_addr_3:PS- RATUA cpa_addr_city:india
_id:AWBZYcj9fbvNIqwo2O6C _type:logs _index:clinicaldata _score:1

But when i do in kibana visualization using tag cloud i want to display the data in visualization

Steps i follows to display data in visualization

Create a new visualization

select the tag cloud (in visualization type)

select the index name

Then select add a fliter + (Button)
another popup is opened from the fliter list select the field name and selector operator is and in value textbox type the india and click save button

then Message shows no result found

what is the problem in visualization tab data not displaying.

please do the needful. Steps i follows to display data in visualization i mentioned above.

is there any mistake in above steps.

dadoonet · December 19, 2017, 6:23am

So you can run a first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.

dadoonet · December 19, 2017, 6:51am

And please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

narasiman1986 · December 19, 2017, 7:22am

ok. you told that first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.

how to do please tell me which each step.

Because i am new to elastic search. please do the needful.

dadoonet · December 19, 2017, 7:53am

min aggregation: Min Aggregation | Elasticsearch Reference [6.1] | Elastic
exclude from resultset, see must_not clause from bool query: Bool Query | Elasticsearch Reference [6.1] | Elastic

If you still don't know, please provide a sample data set with some docs that we can use as described in

It will help to better understand what you are doing.
Please, try to keep the example as simple as possible.

Please read carefully the instructions. It must be easily runnable.

narasiman1986 · December 19, 2017, 9:01am

removing the duplicates the below code is not working

input {

jdbc {

jdbc_driver_library => "D:\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44-bin.jar"

jdbc_driver_class => "com.mysql.jdbc.Driver"

jdbc_connection_string => "jdbc:mysql://localhost:3306/sample"

jdbc_user => "root"

jdbc_password => "root"

jdbc_fetch_size => 10000

schedule => "* * * * *"
statement => "SELECT * from sample"

#codec => "json"

}

output {

elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "clinical" }

document_id => "%{[@metadata][_RRH_MR_NUM]}"

stdout { codec => rubydebug }
}

to remove the duplicates in elastic search i write the above code in my CONF file above.

but above code is not working.

Please do the needful.

what is the mistake in my above code.

dadoonet · December 19, 2017, 9:22am

Please format your code if you expect someone to read it.

narasiman1986 · December 19, 2017, 9:44am

format my code and sent it again

input {

jdbc {

jdbc_driver_library => "D:\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44-bin.jar"

jdbc_driver_class => "com.mysql.jdbc.Driver"

jdbc_connection_string => "jdbc:mysql://localhost:3306/sample"

jdbc_user => "root"

jdbc_password => "root"

jdbc_fetch_size => 10000

schedule => "* * * * *"
statement => "SELECT * from sample"

#codec => "json"

}

output {

elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "clinical" }

document_id => "%{[@metadata][_RRH_MR_NUM]}"

stdout { codec => rubydebug }
}

to remove the duplicates in elastic search i write the above code in my CONF file above.

but above code is not working.

please help me. what is wrong in my above code. i tried several times it is not working,

dadoonet · December 19, 2017, 10:22am

Please read what I wrote here.

Edit your post. Make sure in the preview window that it is correct. Thanks.

Topic		Replies	Views
How to avoid duplicate values in ealstic search 5.6.4 Elastic Training	4	4036	January 18, 2018
Duplicate results in resultset Elasticsearch	4	3002	July 6, 2017
Delete duplicate items Elasticsearch	1	321	July 6, 2017
How to identify and remove duplicates in Elasticsearch index Elasticsearch	4	3463	July 20, 2022
Duplicate documents in Elasticsearch Elasticsearch	1	981	June 23, 2017

How to remove duplicate values in ealstic search

document_id => "%{[@metadata][_RRH_MR_NUM]}"

document_id => "%{[@metadata][_RRH_MR_NUM]}"

Related topics