How to get all unique values of a field for a single index?

Hey All,
Running into an issue that I cannot seem to find a solution. I currently have a number of indexes, all of them billions of docs large, and I need a way to extract all unique values for any given field in the index. I would say the majority of the data is duplicated data in the field, I would expect about 50m-100m unique results for the 1bil+ docs.

When I do a terms aggregation Im limited to the 10000 buckets, and cannot get more than 10k unique values and I cannot seem to figure out how to paginate the aggs buckets (I can scroll the hits, but this isnt of value to me as it is just giving me every doc, 10k at a time, until I scroll through all 1bil+ docs)

Also, I cant seem to get a composit search to work. Again, the ['hits']['hits'] are not unique values, and when i look at the buckets, i am being returned values that do not exist that start with hypens, and illegal characters for the field type ( like * and : )

Can anyone explain to me how I can grab all 50m unique values of an index. Currently the only way I can actually get this to work is to scroll through every doc, extract the value, and dedupe. But with 2-3 second queries at 10k at a time, this will take over 3 days!

I put my queries below, also have tried playing around with doc and bucket size parameters with no luck.


what I have tried:

Attempt 1: Scrolling terms agg

GET /dns-2020.01.13/_search?scroll=1m
{
"aggs": {
"unique_quieries": {
"terms": {
"field": "query.keyword",
"size": 10000
}
}
}
}

followed up with
GET /_search/scroll
{
"scroll" : "5m",
"scroll_id" : "DnF1ZXJ5VGhlbkZldGNoBAAAAAAAEVZiFkx1REFCM0FXU3dpNGpFOEZXNnZaRGcAAAAAABB4zRY1NHdiS2dzc1JaZXhoSjNTaFZvbmVRAAAAAAAP690WOEZWeFd5cU1Td3VIVWFvbEo0MXh0ZwAAAAAAPubpFkl5RDl6b1d2U1JxMmMzejQ3V05odlE="
}

Attempt 2: Composit search, returns odd values and non unique hits.

GET /dns-2020.01.13/_search
{
"track_total_hits": false,
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "query": { "terms" : { "field": "query.keyword" } } }
]
}
}
}
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.