I am searching for and counting duplicated phrases within a single, or group of, human readable documents. I break each document into phrases/sentences and populate an Elasticsearch index with these phrases, one per ES document.
I have 707 documents in my index. I KNOW that I should have, at least, 25 duplicate documents. My search is returning 23 duplicate docs. I don't understand why I am missing some matches.
Notes:
I am creating my index with 1 shard and 0 replicas.
I am matching all string content in a field so I am md5 hashing the field value.
Here is my query:
{ "size": 0, "aggs": { "duplicateCount": { "terms": { "field": "content", "min_doc_count": 2 }, "aggs": { "duplicateDocuments": { "top_hits": { } } } } } }
My process:
- Create index
- Build bulk insert data objects
- Bulk insert documents into index
- Reindex documents
- Run duplicates query (above)
- Parse results - SUM buckets.doc_counts
- delete index
Why is ES not returning all duplicates????
Thanks