How to return distinct values from query based on a field

I'm trying to return documents from a query and i want to return distinct values in my results.

I don't want to count how many documents contain the same value i just want to return some documents based on distinct or unique values from a field.

Let's say i have an index like this:

"mapping": {
  "properties": {
    "author": {
      "type": "text"
    },
    "content": {
      "type": "text"
    },
    "url": {
      "type": "text"
    }
  }}

And i have a query like this:

   "query":{
      "match": {
        "content": "Some text i want to search"
    }

I want to return documents with unique "url" field values that also come up from the text I'm searching for.
Even if 2 documents have the same url value I only want one of them to be returned.

How can i do it in this case and can i do it on a field that is a "text" type?

Hello @Aurel_Drejta

First of all, you are going to search on the content field.
Thanks to your query, you will obtain N (size) hits from the index or indices you're querying on.

Now, you are requesting to get the documents having distinct url values.

First of all, the drawback of your mapping is that url is analyzed as text.

If you can reindex the data using the following mapping (or use keyword instead of text for the field url):

PUT discussunique?include_type_name=false
{
  "mappings": {
    "properties": {
      "author": {
        "type": "text"
      },
      "content": {
        "type": "text"
      },
      "url": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      }
    }
  }
}
# Demo data
PUT discussunique/_doc/1
{
  "author": "author1",
  "content": "Some text i want to search",
  "url": "https://elastic.co"
}


PUT discussunique/_doc/2
{
  "author": "author2",
  "content": "Some text i want to search",
  "url": "https://elastic.co"
}


PUT discussunique/_doc/3
{
  "author": "author3",
  "content": "Some text i want to search",
  "url": "https://discuss.elastic.co"
}

You will be able to get unique values using collapse:

GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "collapse": {
    "field": "url.keyword"
  }
}
# Response
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "discussunique",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.80118835,
        "_source" : {
          "author" : "author1",
          "content" : "Some text i want to search",
          "url" : "https://elastic.co"
        },
        "fields" : {
          "url.keyword" : [
            "https://elastic.co"
          ]
        }
      },
      {
        "_index" : "discussunique",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.80118835,
        "_source" : {
          "author" : "author3",
          "content" : "Some text i want to search",
          "url" : "https://discuss.elastic.co"
        },
        "fields" : {
          "url.keyword" : [
            "https://discuss.elastic.co"
          ]
        }
      }
    ]
  }
}

Or you can use a terms aggregation. If you expect a lot of results, you should use a composite aggregation with a terms sub aggregation.
If you want to get the most occurring unique values, you can use terms

# All unique values for `url`
GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "composite": {
        "sources": [
          {
            "urls": {
              "terms": {
                "field": "url.keyword"
              }
            }
          }
        ]
      }
    }
  }
}

# Top 100 unique values for `url`
GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "terms": {
        "field": "url.keyword",
        "size": 100
      }
    }
  }
}

If you're on a Basic license, you can try out the SQL API:

POST _sql?format=txt
{
  "query": """SELECT url FROM discussunique WHERE QUERY('content:"Some text i want to search"') GROUP by url"""
}

# If you want to see the derived Elasticsearch DSL
POST _sql/translate
{
  "query": """SELECT url FROM discussunique WHERE QUERY('content:"Some text i want to search"') GROUP by url"""
}

If you cannot modify the mappings, there are no options except using a scripted field, but it will be REALLY slow and memory consuming because you will be accessing the _source of the documents each time. The following ones are not recommended.

GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "composite": {
        "sources": [
          {
            "urls": {
              "terms": {
                 "script": {
                   "source": "params['_source'].url",
                   "lang": "painless"
                 }
              }
            }
          }
        ]
      }
    }
  }
}
# Or...
GET discussunique/_search
{
  "query": {
    "match": {
      "content": "Some text i want to search"
    }
  },
  "size": 0,
  "aggs": {
    "urls": {
      "terms": {
        "script": {
          "source": "params['_source'].url",
          "lang": "painless"
        },
        "size": 100
      }
    }
  }
}
2 Likes

All right i will reindex my index since i am already low on resources for the moment in my production builds. I will come back and mark the solution as a correct one when i implement it.

Thank you for the fast response :slight_smile:

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.