How to get Unique Records

Hello Team

I am using Elasticsearch version 7.8.0.

I am having an INDEX in which there is one field.

"userId" : {
          "type" : "text",
         "fields" : {
              "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
          }
         }
         }

In this i am storing records like

7deeeefrrgcdoonskhdhdwl345

I want to get the list of unique records in this "userId" field of my index.

I tried using aggs - terms - but it shows the count of records .. I want actual list of userId.

I tried using SQL api --> but in elasticsearch there is no DISTINCT function ....

Could you please help me in this.

Thanks
Tushar Nemade

May be a terms agg with an inner top_hits?

Hi David

terms with aggs - i used it but it records count of time record is repeated in index.

I want list of all records for example :

657dhfhfhfdnn
7565jdjdjddmd
958jdjdjdjdjdd

like this to be the output ...

Thanks
Tushar Nemade

For Example in SQL :

select distinct userid from employee :
userid
-------
4844hddjdj
74747dhdjdjd
48848ddddd

Like this output I am trying to get from elasticsearch ...

Thanks
Tushar Nemade

That's what the terms agg will give at the end. Just ignore the count part in the result.

The output s like below one :

curl -X GET "localhost:9200/myindex/_search?pretty" -H 'Content-Type: application/json' -d'
{     "query": {
    "match_all": {}
  },
      "size": 0,
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "userId",
            "filter_duplicate_text": true
          }
        }
      }
    }
' -u elastic:elastic


{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "keywords" : {
      "doc_count" : 67007,
      "bg_count" : 70659,
      "buckets" : [
        {
          "key" : "3a7135074ee9715b8ec07bfda0a75336",
          "doc_count" : 28531,
          "score" : 0.02306479074069242,
          "bg_count" : 28540
        },
        {
          "key" : "99c9a97ce78b4fd1a124a173222a013a",
          "doc_count" : 2617,
          "score" : 0.0020500645248022376,
          "bg_count" : 2622
        },
        {
          "key" : "7a919773e80c43a9b499b1947caa163e",
          "doc_count" : 2532,
          "score" : 0.0019496069985887787,
          "bg_count" : 2539
        },
        {
          "key" : "4b1ed7601091487b982c8632ca86481a",
          "doc_count" : 1876,
          "score" : 0.0015258900958613191,
          "bg_count" : 1876
        },
        {
          "key" : "41b3ef01c68c47cea1cbde66cf8aff3a",
          "doc_count" : 1418,
          "score" : 9.197778131438137E-4,
          "bg_count" : 1433
        },
        {
          "key" : "c7589a964b6349078df094a8872d97a3",
          "doc_count" : 1061,
          "score" : 8.629900808682623E-4,
          "bg_count" : 1061
        },
        {
          "key" : "aa426b628ffa4f7fbc66344d03157178",
          "doc_count" : 1024,
          "score" : 8.328952335618312E-4,
          "bg_count" : 1024
        },
        {
          "key" : "ad77884486564074a2d7885b1698b816",
          "doc_count" : 928,
          "score" : 7.548113054154072E-4,
          "bg_count" : 928
        },
        {
          "key" : "7d7cc295fbe64fdabdcddf15df8a28c1",
          "doc_count" : 2108,
          "score" : 7.519206332956028E-4,
          "bg_count" : 2171
        },
        {
          "key" : "acb4314538bb43eca9027e5efa0ad3a5",
          "doc_count" : 773,
          "score" : 6.287382964289966E-4,
          "bg_count" : 773
        }
      ]
    }
  }
}

This output is not even complete output of myindex. There are suppose to be more records .. I am not getting entire output ...

Also i want only KEY value as output and rest to be ignored.... as i want to store this output in another index.

Thanks
Tushar Nemade

Use a terms aggregation. Not the significant_text.

Then read carefully the documentation at:

Specifically:

If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the Composite aggregation which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination.

1 Like

Hi David

Thanks for your response and page.

I have been through this page and "term" before using significant_text and keywords

curl -X GET "localhost:9200/myindex/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"unique_usrid": {
"terms": { "field": "userId"}
}
}
}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "myindex",
        "node" : "-ZiUhOCjRb-EEcSLojZ4YQ",
        "reason" : {
          "type" : "illegal_argument_exception",
          "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
        }
      }
    ],
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    }
  },
  "status" : 400
}

You need to change the mapping and use a keyword type instead of text.
In case you're using the default mapping, then append .keyword to your field name and the terms agg will work.

This is my mapping .... it already has keyword defined.

As I said, use userId.keyword field.

Okay ...

I will check with development team about it.

Secondly , even if I achieve this by changing my mapping in INDEX , I still have another issue of getting all records in output, as CURL only shows some 1000 rows as due to some settings which I am not aware of.

Could you share an example of the request you are running? If you meant something like the number of hits, you can use:

  • the size and from parameters to display by default up to 10000 records to your users. If you want to change this limit, you can change index.max_result_window setting but be aware of the consequences (ie memory).
  • the search after feature to do deep pagination.
  • the Scroll API if you want to extract a resultset to be consumed by another tool later.

Hi David

In this Topic i have given example of how my output using CURL is incomplete.

https://discuss.elastic.co/t/curl-output-is-not-complete/255048

Regards the options you have presented :

  1. size and from I cannot use as index is continuously getting data and its not fixed. If i use high values , it will give error as it would be out of index actual data count.

  2. max_result_window i cannot set due to memory consumptions.

  3. scroll api and search after : I will check these are helpful to get the output in one go , so that i can proceed with those records further.

Thanks
Tushar Nemade

And I answered there (Curl Output is not complete - #2 by dadoonet). So let's keep the discussion there.

Hi David

Yes, I agree its not complete and I am aware of it. Hence I raise an topic to findout the way to have it completed.

Pagination is not helping me in automation of my work. like i have mentioned in there.

Further discussion on this on another topic raised.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.