DISTINCT values DSL query

Hello,

I've been reading a lot about this topic because I've seen it has been asked before but I can do it works yet.

I am trying to get unique values from an index.

I have something like this:

id | app_name       | url
1  | app_1          | https://subdomain.app_1.com
2 | app_1          | https://app_1.com
3  | app_2          | https://app_1.com
4  | app_3          | https://subdomain.app_3.com
5  | app_1          | https://app_3.com

I would like to receive just the distinct app_name:

app_1
app_2
app_3

The query I tried with aggs is:

GET app_index/_search
{
  "aggs": {
    "unique_apps": {
      "terms": {
        "field": "app_name",
      }
    }
  }
}

I also tried a kind of group by here:

GET app_index/_search
{
  "aggs": {
    "unique_apps": {
      "terms": {
        "field": "app_name.keyword"
      },
      "aggs": {
        "oneRecord": {
          "top_hits": {
            "size": 1
          }
        }
      }
    }
  }
}

But I still receive all the apps.

  • Is there a way to receive unique values?
  • Maybe is there a possibility to check in logstash if some value exists in the database and avoid sending it again? Or maybe use the fingerprint plugin and generate an unique _id according to the value of the field? If I receive the same information in that field it could generate the same ID so it won't be saved again.

I also checked if there's any possibility to create unique fields in Elasticsearch but I see it's not possible.

Thank you very much for your help and time :slight_smile:

This one is correct. I think you are seeing in the results all the hits which is the records returned from the query. If you scroll to the bottom of the return you should see the aggregation. Most of the time when doing aggregation you don't need the hits so you can remove them using the below and it will only return the aggs.

GET app_index/_search
{
  "size": 0,
  "aggs": {
    "unique_apps": {
      "terms": {
        "field": "app_name",
      }
    }
  }
}

Thank you for your answer @aaron-nimocks

In this case I see that there are some values in the buckets list inside the aggregations but unfortunately not the data.

At the end I created another index with unique values according to the string using the fingerprint plugin. I'm not very sure if it's the best option but at the end I need to extract a lot of information and it was taking a lot of time.

I'll share what I did, maybe can be helpful to someone else that want to do the something similar.



  • Is there a way to receive unique values?

I've used the fingerprint plugin in this case. I've generated an unique ID based on the string. e.g, if I receive the same app_name name it will generate always the same _id so it won't be repeated in Elasticsearch. I've added this config in the logstash.conf pipeline, specifically in the filter side:

fingerprint {
    source => ["app_name"]
    target => ["unique_id_by_app_name"]
    method => "SHA1"
  }

Then in the output:

    elasticsearch {
      hosts => "localhost:9200"
      index => "logstash_apps"
      document_id => "%{[unique_id_by_app_name]}"
    }

If I receive again the app_1 with the same or even different data I'll have the same ID because the hashing:

$ -> echo -n "app_1" | sha1sum | awk -F '  -' '{print $1}'
87dbad46d7c47f3714eb02ff70e18b94e4ee6523

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.