How to remove duplicate values in ealstic search


(narasiman) #9

i do not want delete the all record that matches mrdno 11657.

i want to delete only duplicates

POST clinicaldata/_delete_by_query?scroll_size=5000

{

"query": {

"term": {

"mrdno": "11657"

}

}

}

you told that above query will delete all the record that matches mrdno 11657.

i want to delete duplicates records of mrdo 16657.

for that what changes i have to make from the above query.

please do the needful.


(David Pilato) #10

I can see that you asked the same question at

So the answer is: there is no way to remove duplicates in one single call.

As I said, you need to:

  • Remove all docs
  • Add back one of the docs you removed

Is it a one time operation or something you want to do in the long run? If the later, what is the usecase of allowing duplicates ?


How to avoid duplicate values in ealstic search 5.6.4
(David Pilato) #11

Unless you have another field which can help like a timestamp...


(narasiman) #12

in kibana discover tab data is displaying, below i show the discover tab details in kibana.

clinicaldata (index name)

Selected Fields

? _source

Available Fields
? @timestamp

? @version

t _id

t _index
add

_score
t _type

Fields name as follows

? age

? cpa_addr_1

? cpa_addr_2

? cpa_addr_3

? cpa_addr_area

? cpa_addr_city

? cpa_country_cd

? cpa_pin_code

? cpa_state_cd

? mcs_case_summary

? mcs_crt_dt

? mcs_crt_uid

? rrh_first_name

? rrh_location_cd

? rrh_mr_num

? rrh_pat_dob

? rrh_pat_sex

? rrh_regn_dt

_source
cpa_addr_2:POST- EKBARNA cpa_addr_area: - mcs_crt_uid:MACTCS cpa_addr_1:VILL- EKB ARNA rrh_mr_num:3416558 cpa_pin_code:732 204 rrh_pat_sex:Fema mcs_crt_dt:2015-01-03T07:23:00.000Z @timestamp:2017-12-15T08:54:28.231Z cpa_country_cd:INDIA cpa_state_cd:WB @version:1 rrh_regn_dt:2014-11-18T18:30:00.000Z rrh_first_name:ANITA KARMAKAR rrh_pat_dob:1987-08-24T18:30:00.000Z mcs_case_summary:External File Uploaded - EXTNFILEINFO SHEET age:28 rrh_location_cd:MAIN cpa_addr_3:PS- RATUA cpa_addr_city:india
_id:AWBZYcj9fbvNIqwo2O6C _type:logs _index:clinicaldata _score:1

But when i do in kibana visualization using tag cloud i want to display the data in visualization

Steps i follows to display data in visualization

Create a new visualization

select the tag cloud (in visualization type)

select the index name

Then select add a fliter + (Button)
another popup is opened from the fliter list select the field name and selector operator is and in value textbox type the india and click save button

then Message shows no result found

what is the problem in visualization tab data not displaying.

please do the needful. Steps i follows to display data in visualization i mentioned above.

is there any mistake in above steps.


(David Pilato) #13

So you can run a first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.


(David Pilato) #14

And please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

(narasiman) #15

ok. you told that first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.

how to do please tell me which each step.

Because i am new to elastic search. please do the needful.


(David Pilato) #16

If you still don't know, please provide a sample data set with some docs that we can use as described in

It will help to better understand what you are doing.
Please, try to keep the example as simple as possible.

Please read carefully the instructions. It must be easily runnable.


(narasiman) #17

removing the duplicates the below code is not working

input {

jdbc {

jdbc_driver_library => "D:\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44-bin.jar"

jdbc_driver_class => "com.mysql.jdbc.Driver"

jdbc_connection_string => "jdbc:mysql://localhost:3306/sample"

jdbc_user => "root"

jdbc_password => "root"

jdbc_fetch_size => 10000

schedule => "* * * * *"
statement => "SELECT * from sample"

#codec => "json"

}

}

output {

elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "clinical" }

document_id => "%{[@metadata][_RRH_MR_NUM]}"

stdout { codec => rubydebug }
}

to remove the duplicates in elastic search i write the above code in my CONF file above.

but above code is not working.

Please do the needful.

what is the mistake in my above code.


(David Pilato) #18

Please format your code if you expect someone to read it.


(narasiman) #19

format my code and sent it again

input {

jdbc {

jdbc_driver_library => "D:\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44-bin.jar"

jdbc_driver_class => "com.mysql.jdbc.Driver"

jdbc_connection_string => "jdbc:mysql://localhost:3306/sample"

jdbc_user => "root"

jdbc_password => "root"

jdbc_fetch_size => 10000

schedule => "* * * * *"
statement => "SELECT * from sample"

#codec => "json"

}

}

output {

elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "clinical" }

document_id => "%{[@metadata][_RRH_MR_NUM]}"

stdout { codec => rubydebug }
}

to remove the duplicates in elastic search i write the above code in my CONF file above.

but above code is not working.

please help me. what is wrong in my above code. i tried several times it is not working,


(David Pilato) #20

Please read what I wrote here.

Edit your post. Make sure in the preview window that it is correct. Thanks.


(narasiman) #21

i format my code using </> icon and send to again.

input {

jdbc {

jdbc_driver_library => "D:\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44-bin.jar"


jdbc_driver_class => "com.mysql.jdbc.Driver"

 
jdbc_connection_string => "jdbc:mysql://localhost:3306/sample"

  
jdbc_user => "root"


jdbc_password => "root"


jdbc_fetch_size => 10000


    schedule => "* * * * *"
    statement => "SELECT * from sample"

#codec => "json"

  }

}



filter {
  fingerprint {
    source => "RRH_MR_NUM"
    target => "[@metadata][fingerprint]"
    method => "MURMUR3"
  }
}
output {
  elasticsearch {
    hosts => ["localhost:9200"] 
	index => "clinical" 
    # document_id => "%{[@metadata][fingerprint]}"	
	}
	stdout { codec => rubydebug }
    }

i tried this above query to eliminate the duplicate records in elastic search.

is there any mistake in my above code.

please let me know.


(Arthur Silva Sens) #22

One strategy that you could use is to query where "mrdno" is "11657", then save ONE of the ids

Then delete where "mrdno" is "11657" AND _id NOT = the id you saved

It should delete all the duplicates


(David Pilato) #23

About the format: please try to indent your code like you did for the fingerprint part.

Why this is commented?

# document_id => "%{[@metadata][fingerprint]}"

(narasiman) #24

i am not commented the below line.

how you are telling i am commented the below line.

document_id => "%{[@metadata][fingerprint]}"

you are saying # means comment?


(David Pilato) #25

Seriously... Please...


(Adebiyi Abdurrahman) #26

:rofl:


(system) closed #27

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.


(Alex Marquardt) #28

I have written a blog post about removing duplicate documents from Elasticsearch, which can be found at https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/