How to return distinct values from query based on a field

Hello @Aurel_Drejta

First of all, you are going to search on the content field.
Thanks to your query, you will obtain N (size) hits from the index or indices you're querying on.

Now, you are requesting to get the documents having distinct url values.

First of all, the drawback of your mapping is that url is analyzed as text.

If you can reindex the data using the following mapping (or use keyword instead of text for the field url):

PUT discussunique?include_type_name=false
{
  "mappings": {
    "properties": {
      "author": {
        "type": "text"
      },
      "content": {
        "type": "text"
      },
      "url": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
# Demo data
PUT discussunique/_doc/1
{
  "author": "author1",
  "content": "Some text i want to search",
  "url": "https://elastic.co"
}


PUT discussunique/_doc/2
{
  "author": "author2",
  "content": "Some text i want to search",
  "url": "https://elastic.co"
}


PUT discussunique/_doc/3
{
  "author": "author3",
  "content": "Some text i want to search",
  "url": "https://discuss.elastic.co"
}

You will be able to get unique values using collapse:

GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "collapse": {
    "field": "url.keyword"
  }
}
# Response
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "discussunique",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.80118835,
        "_source" : {
          "author" : "author1",
          "content" : "Some text i want to search",
          "url" : "https://elastic.co"
        },
        "fields" : {
          "url.keyword" : [
            "https://elastic.co"
          ]
        }
      },
      {
        "_index" : "discussunique",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.80118835,
        "_source" : {
          "author" : "author3",
          "content" : "Some text i want to search",
          "url" : "https://discuss.elastic.co"
        },
        "fields" : {
          "url.keyword" : [
            "https://discuss.elastic.co"
          ]
        }
      }
    ]
  }
}

Or you can use a terms aggregation. If you expect a lot of results, you should use a composite aggregation with a terms sub aggregation.
If you want to get the most occurring unique values, you can use terms

# All unique values for `url`
GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "composite": {
        "sources": [
          {
            "urls": {
              "terms": {
                "field": "url.keyword"
              }
            }
          }
        ]
      }
    }
  }
}

# Top 100 unique values for `url`
GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "terms": {
        "field": "url.keyword",
        "size": 100
      }
    }
  }
}

If you're on a Basic license, you can try out the SQL API:

POST _sql?format=txt
{
  "query": """SELECT url FROM discussunique WHERE QUERY('content:"Some text i want to search"') GROUP by url"""
}

# If you want to see the derived Elasticsearch DSL
POST _sql/translate
{
  "query": """SELECT url FROM discussunique WHERE QUERY('content:"Some text i want to search"') GROUP by url"""
}

If you cannot modify the mappings, there are no options except using a scripted field, but it will be REALLY slow and memory consuming because you will be accessing the _source of the documents each time. The following ones are not recommended.

GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "composite": {
        "sources": [
          {
            "urls": {
              "terms": {
                 "script": {
                   "source": "params['_source'].url",
                   "lang": "painless"
                 }
              }
            }
          }
        ]
      }
    }
  }
}
# Or...
GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "terms": {
        "script": {
          "source": "params['_source'].url",
          "lang": "painless"
        },
        "size": 100
      }
    }
  }
}
2 Likes