I am searching for and counting duplicated phrases within a single, or group of, human readable documents. I break each document into phrases/sentences and populate an Elasticsearch index with these phrases, one per ES document.
I have 707 documents in my index. I KNOW that I should have, at least, 25 duplicate documents. My search is returning 23 duplicate docs. I don't understand why I am missing some matches.
Notes:
I am creating my index with 1 shard and 0 replicas.
I am matching all string content in a field so I am md5 hashing the field value.
Here is my query:
{
"size": 0,
"aggs": {
"duplicateCount": {
"terms": {
"field": "content",
"min_doc_count": 2
},
"aggs": {
"duplicateDocuments": {
"top_hits": {
}
}
}
}
}
}
My process:
- Create index
- Build bulk insert data objects
- Bulk insert documents into index
- Reindex documents
- Run duplicates query (above)
- Parse results - SUM buckets.doc_counts
- delete index
Why is ES not returning all duplicates????
Thanks
