So recently we ran into a problem using elastic, we are attempting to detect records which have a duplicate value (hash) and to patch them with a flag so that we iterate all records as we go. (27 million records in total, out of which 6 million populate with said hash).
This workflow worked fine on an index sitting on a single node. Eventually the index grew and we moved it to multiple nodes so that we retain performance.
After the move the aggregation result was not accurate anymore as some of the records which we know for sure have keys with more than 1 doc count, do not get returned. I tried to run the query with a min_doc_count of 1 and again the aggregation does not return some of the values (hashes) which should be returned.
In the query below if we remove the must_not and add a different condition which should return the missing hashes will not work either. If we add a condition defining an explicit equal: key = value (hash) which we know is missing then the aggregation will return the correct result and count, otherwise the key is missing all together.
I am hoping that there are a few of few who can explain what is going on and if there is something that we might be doing wrong or if there is a way to rectify the situation.
The problem is that randomly-sharded data is suited to finding the most popular things only when the frequency of those top things are > number of shards.
Assuming there's millions of hashes that match your query and they are spread somewhat randomly across shards how would multiple remote shards independently decide on the same subset of 10k (shard_size) terms that would guarantee finding the duplicates? Each of the millions of hashes on each shard occur only once so they're all equally promising candidates in this isolated view.
When the required global frequency is low (2 in your example), and there are millions of values to choose from then you have to get each shard to focus analysis on the same subset of all the matching terms in a request e.g. just the hashes beginning with "a".
A more effective way to do this is the term partitioning feature in the terms agg. It means you have to make multiple requests, one for each partition, but the results can be made accurate.
Another alternative is to reindex using routing to send docs with the same hash to the same shard.
Thank you very much for your answer, the query constraint is what we patch as we iterate in a loop, so that records returned are excluded for the next page. But as I mentioned patching all records found in the terms aggregation with no min doc count constraint will still not go through some of the hashes (I do not even care about the accuracy of the count at this time) . I also tried what you suggested with splitting the aggregation in parts, the missing hashes are still not returned in the set. The only way I was able to get the hash to be part of the aggregation was to add an equal constraint on it.
Do you have any other suggestions without reindexing?
I will document myself about the reindexing suggestion but I think that will be my last option
Thank you yet again for your help in this matter,
Gabe.
The last set on which I am doing my tests should return less than 10000 which is less than the default maximum allowed buckets for an aggregation. A single aggregation will return the same result as the partitioned requests.
I have tried to iterate through them using 20x 500 partitions and another 10 x 1000 partitions.
None of the sets include the missing hashes . Just wondering if there is some hidden mechanic which excludes these missing hashes. If it was a matter of accuracy I should have been able to get a match on them even if their doc count was 1 but they are missing altogether unless I add a strict equal query to match a specific hash.
I believe I have tried it before without success but I tried it yet again just to make sure and the results are still not returning the missing hashes. Both shard_min_doc_count 0 and 1 return the same result, without the missing hashes.
I am using the query to restrict the amount of records returned so that it is easier to focus on one hash instance which is missing from the result set.
I am collecting and cleaning the data so it should be here soon
Thanks for that. I can see from the results that doc_count_error_upper_bound >0 which means that the not all of the relevant data is being returned for consideration to the coordinating node.
This means that the number of partitions is too low for the size of results being considered. By increasing the number of partitions you will be reducing the number of terms being considered in any one request to a manageable subset and you should see the doc_count_error_upper_bound value become zero (meaning nothing was left behind on shards).
Good news, I was finally able to retrieve the hash which was missing part of the partitioned result set, thank you very much for the breakthrough.
I do have one more question, how would I go about finding the correct amount of partitions for a certain page size so that the errors will be 0, is there some sort of formula or algorithm I could use ? so I do not have to do guess work to get the optimal values.
The docs include some guidance. I notice they say tweak settings until partitions have sorted results that include things you don't want (e.g. terms that only occur once).
They should also say pay attention to the doc_count_error_upper_bound to make sure the calculations are including all counts from all shards.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.