Duplicates Query not returning all results

johns · August 22, 2016, 5:17pm

I am searching for and counting duplicated phrases within a single, or group of, human readable documents. I break each document into phrases/sentences and populate an Elasticsearch index with these phrases, one per ES document.

I have 707 documents in my index. I KNOW that I should have, at least, 25 duplicate documents. My search is returning 23 duplicate docs. I don't understand why I am missing some matches.

Notes:
I am creating my index with 1 shard and 0 replicas.
I am matching all string content in a field so I am md5 hashing the field value.

Here is my query:

{
    "size": 0,
    "aggs": {
        "duplicateCount": {
            "terms": {
                "field": "content",
                "min_doc_count": 2
            },
            "aggs": {
                "duplicateDocuments": {
                    "top_hits": {

                    }
                }
            }
        }
    }
}

My process:

Create index
Build bulk insert data objects
Bulk insert documents into index
Reindex documents
Run duplicates query (above)
Parse results - SUM buckets.doc_counts
delete index

Why is ES not returning all duplicates????

Thanks

Mark_Harwood · August 23, 2016, 10:45am

The terms aggregation by default returns only 10 values so in your case max 10 md5s. For these 10 hashes you may have only 23 docs that share these keys.
If you use the size setting on the terms agg you can increase the number of md5s under consideration.

johns · August 23, 2016, 2:37pm

You're the Man Mark! Thanks so much - that fixed it.
My new query:

{
    "size": 0,
    "aggs": {
        "duplicateCount": {
            "terms": {
                "size": 708,
                "field": "content",
                "min_doc_count": 2
            },
            "aggs": {
                "duplicateDocuments": {
                    "top_hits": {

                    }
                }
            }
        }
    }
}

Since I always know how many docs I have I always keep my query reasonably efficient. When I hit 10000 I'll have to figure out how to use scroll

Thanks again,
John

Topic		Replies	Views
Terms aggregation doesn't return all hits Elasticsearch	10	2269	December 15, 2020
Min doc sub aggregation (find duplicates) Elasticsearch	1	482	October 7, 2017
Using aggregation in Elasticsearch to find duplicate data in the same index Elasticsearch painless	6	14125	January 22, 2021
Help with aggregation to identify dups Elasticsearch	3	1079	March 4, 2019
Aggs don't return all records Elasticsearch	2	2007	January 5, 2018

Duplicates Query not returning all results

Related topics