How to remove duplicate values in ealstic search


(narasiman) #1

i am using elastic search 5.6.4

GET /clinicaldata/_search
{
"query": {
"match" : {
"mrdno" : "112627"
}
}
}

in that mrdno 112627 have duplicate of 75 records.

when run in the above shows ouput as follows

{
"took": 2,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 75,
"max_score": 1

total 75 records. in this 75 duplicate records is there.

i want to delete the duplicate records using elastic search

for that how to write the query using elastic search


(David Pilato) #2

You can probably use the delete by query feature and index again one of the docs


(narasiman) #3

ok. please tell me. how to do. because i am new to elastic search


(David Pilato) #4

See https://www.elastic.co/guide/en/elasticsearch/reference/6.1/docs-delete-by-query.html


(narasiman) #5

i have duplicate 75 duplicate records. mrdno 11657

POST twitter/_delete_by_query?scroll_size=5000
{
"query": {
"term": {
"user": "kimchy"
}
}
}

from the above example i am replacing my field name.

my index name is test

POST test/_delete_by_query?scroll_size=5000
{
"query": {
"term": {
"mrdno": "11657"
}
}
}

The above query will delete the duplicate values for the column mrdno.

the above one is correct please let me know


(David Pilato) #6

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Your query looks good but the index name which should be probably clinicaldata.


(narasiman) #7

clinical data is my index name

POST clinicaldata/_delete_by_query?scroll_size=5000

{

"query": {

"term": {

"mrdno": "11657"

}

}

}

The above query is correct ?

to delete duplicate records of mrd no


(David Pilato) #8

It will delete ALL records that match "mrdno": "11657" not only duplicates.
So you will need to create again after a new document...


(narasiman) #9

i do not want delete the all record that matches mrdno 11657.

i want to delete only duplicates

POST clinicaldata/_delete_by_query?scroll_size=5000

{

"query": {

"term": {

"mrdno": "11657"

}

}

}

you told that above query will delete all the record that matches mrdno 11657.

i want to delete duplicates records of mrdo 16657.

for that what changes i have to make from the above query.

please do the needful.


(David Pilato) #10

I can see that you asked the same question at

So the answer is: there is no way to remove duplicates in one single call.

As I said, you need to:

  • Remove all docs
  • Add back one of the docs you removed

Is it a one time operation or something you want to do in the long run? If the later, what is the usecase of allowing duplicates ?


How to avoid duplicate values in ealstic search 5.6.4
(David Pilato) #11

Unless you have another field which can help like a timestamp...


(narasiman) #12

in kibana discover tab data is displaying, below i show the discover tab details in kibana.

clinicaldata (index name)

Selected Fields

? _source

Available Fields
? @timestamp

? @version

t _id

t _index
add

_score
t _type

Fields name as follows

? age

? cpa_addr_1

? cpa_addr_2

? cpa_addr_3

? cpa_addr_area

? cpa_addr_city

? cpa_country_cd

? cpa_pin_code

? cpa_state_cd

? mcs_case_summary

? mcs_crt_dt

? mcs_crt_uid

? rrh_first_name

? rrh_location_cd

? rrh_mr_num

? rrh_pat_dob

? rrh_pat_sex

? rrh_regn_dt

_source
cpa_addr_2:POST- EKBARNA cpa_addr_area: - mcs_crt_uid:MACTCS cpa_addr_1:VILL- EKB ARNA rrh_mr_num:3416558 cpa_pin_code:732 204 rrh_pat_sex:Fema mcs_crt_dt:2015-01-03T07:23:00.000Z @timestamp:2017-12-15T08:54:28.231Z cpa_country_cd:INDIA cpa_state_cd:WB @version:1 rrh_regn_dt:2014-11-18T18:30:00.000Z rrh_first_name:ANITA KARMAKAR rrh_pat_dob:1987-08-24T18:30:00.000Z mcs_case_summary:External File Uploaded - EXTNFILEINFO SHEET age:28 rrh_location_cd:MAIN cpa_addr_3:PS- RATUA cpa_addr_city:india
_id:AWBZYcj9fbvNIqwo2O6C _type:logs _index:clinicaldata _score:1

But when i do in kibana visualization using tag cloud i want to display the data in visualization

Steps i follows to display data in visualization

Create a new visualization

select the tag cloud (in visualization type)

select the index name

Then select add a fliter + (Button)
another popup is opened from the fliter list select the field name and selector operator is and in value textbox type the india and click save button

then Message shows no result found

what is the problem in visualization tab data not displaying.

please do the needful. Steps i follows to display data in visualization i mentioned above.

is there any mistake in above steps.


(David Pilato) #13

So you can run a first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.


(David Pilato) #14

And please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

(narasiman) #15

ok. you told that first request to get the min value of the timestamp field with a min aggregation. And then exclude this value in your search body.

how to do please tell me which each step.

Because i am new to elastic search. please do the needful.


(David Pilato) #16

If you still don't know, please provide a sample data set with some docs that we can use as described in

It will help to better understand what you are doing.
Please, try to keep the example as simple as possible.

Please read carefully the instructions. It must be easily runnable.


(narasiman) #17

removing the duplicates the below code is not working

input {

jdbc {

jdbc_driver_library => "D:\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44-bin.jar"

jdbc_driver_class => "com.mysql.jdbc.Driver"

jdbc_connection_string => "jdbc:mysql://localhost:3306/sample"

jdbc_user => "root"

jdbc_password => "root"

jdbc_fetch_size => 10000

schedule => "* * * * *"
statement => "SELECT * from sample"

#codec => "json"

}

}

output {

elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "clinical" }

document_id => "%{[@metadata][_RRH_MR_NUM]}"

stdout { codec => rubydebug }
}

to remove the duplicates in elastic search i write the above code in my CONF file above.

but above code is not working.

Please do the needful.

what is the mistake in my above code.


(David Pilato) #18

Please format your code if you expect someone to read it.


(narasiman) #19

format my code and sent it again

input {

jdbc {

jdbc_driver_library => "D:\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44\mysql-connector-java-5.1.44-bin.jar"

jdbc_driver_class => "com.mysql.jdbc.Driver"

jdbc_connection_string => "jdbc:mysql://localhost:3306/sample"

jdbc_user => "root"

jdbc_password => "root"

jdbc_fetch_size => 10000

schedule => "* * * * *"
statement => "SELECT * from sample"

#codec => "json"

}

}

output {

elasticsearch {
hosts => ["localhost:9200"]
manage_template => false
index => "clinical" }

document_id => "%{[@metadata][_RRH_MR_NUM]}"

stdout { codec => rubydebug }
}

to remove the duplicates in elastic search i write the above code in my CONF file above.

but above code is not working.

please help me. what is wrong in my above code. i tried several times it is not working,


(David Pilato) #20

Please read what I wrote here.

Edit your post. Make sure in the preview window that it is correct. Thanks.