Duplicates Query not returning all results

I am searching for and counting duplicated phrases within a single, or group of, human readable documents. I break each document into phrases/sentences and populate an Elasticsearch index with these phrases, one per ES document.

I have 707 documents in my index. I KNOW that I should have, at least, 25 duplicate documents. My search is returning 23 duplicate docs. I don't understand why I am missing some matches.

Notes:
I am creating my index with 1 shard and 0 replicas.
I am matching all string content in a field so I am md5 hashing the field value.

Here is my query:

{
    "size": 0,
    "aggs": {
        "duplicateCount": {
            "terms": {
                "field": "content",
                "min_doc_count": 2
            },
            "aggs": {
                "duplicateDocuments": {
                    "top_hits": {

                    }
                }
            }
        }
    }
}

My process:

  1. Create index
  2. Build bulk insert data objects
  3. Bulk insert documents into index
  4. Reindex documents
  5. Run duplicates query (above)
  6. Parse results - SUM buckets.doc_counts
  7. delete index

Why is ES not returning all duplicates????

Thanks

The terms aggregation by default returns only 10 values so in your case max 10 md5s. For these 10 hashes you may have only 23 docs that share these keys.
If you use the size setting on the terms agg you can increase the number of md5s under consideration.

You're the Man Mark! Thanks so much - that fixed it.
My new query:

{
    "size": 0,
    "aggs": {
        "duplicateCount": {
            "terms": {
                "size": 708,
                "field": "content",
                "min_doc_count": 2
            },
            "aggs": {
                "duplicateDocuments": {
                    "top_hits": {

                    }
                }
            }
        }
    }
}

Since I always know how many docs I have I always keep my query reasonably efficient. When I hit 10000 I'll have to figure out how to use scroll :slight_smile:

Thanks again,
John