Aggregation based on terms field is not working. I loaded the data with different (_id), with the same data. I Am trying to find a duplicate of records based on one of the index fields. using the below simple terms query, it's not returning the results which were expected.
This is what I call the "Elizabeth Taylor" problem.
She was an actress who famously married many times and would be hard to find in a distributed database if you were looking for people with the most marriages.
Marriages, like your records, preferably only occur once and in most cases this is true. If your system holding marriage certificates spreads them across many shards randomly then each shard could end up thinking Elizabeth Taylor was like everyone else in their subset of data - only married once. This makes it hard for the system as a whole to determine which of the millions of people held on file were married more than once.
If you use custom routing of documents you can ensure each person's documents end up on the same shard and get the accurate answers.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.