How to get Unique Records

tusharnemade · November 11, 2020, 9:08am

Hello Team

I am using Elasticsearch version 7.8.0.

I am having an INDEX in which there is one field.

"userId" : {
          "type" : "text",
         "fields" : {
              "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
          }
         }
         }

In this i am storing records like

7deeeefrrgcdoonskhdhdwl345

I want to get the list of unique records in this "userId" field of my index.

I tried using aggs - terms - but it shows the count of records .. I want actual list of userId.

I tried using SQL api --> but in elasticsearch there is no DISTINCT function ....

Could you please help me in this.

Thanks
Tushar Nemade

dadoonet · November 11, 2020, 9:22am

May be a terms agg with an inner top_hits?

tusharnemade · November 11, 2020, 9:29am

Hi David

terms with aggs - i used it but it records count of time record is repeated in index.

I want list of all records for example :

657dhfhfhfdnn
7565jdjdjddmd
958jdjdjdjdjdd

like this to be the output ...

Thanks
Tushar Nemade

tusharnemade · November 11, 2020, 9:34am

For Example in SQL :

select distinct userid from employee :
userid
-------
4844hddjdj
74747dhdjdjd
48848ddddd

Like this output I am trying to get from elasticsearch ...

Thanks
Tushar Nemade

dadoonet · November 11, 2020, 10:13am

That's what the terms agg will give at the end. Just ignore the count part in the result.

tusharnemade · November 11, 2020, 10:28am

The output s like below one :

curl -X GET "localhost:9200/myindex/_search?pretty" -H 'Content-Type: application/json' -d'
{     "query": {
    "match_all": {}
  },
      "size": 0,
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "userId",
            "filter_duplicate_text": true
          }
        }
      }
    }
' -u elastic:elastic


{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "keywords" : {
      "doc_count" : 67007,
      "bg_count" : 70659,
      "buckets" : [
        {
          "key" : "3a7135074ee9715b8ec07bfda0a75336",
          "doc_count" : 28531,
          "score" : 0.02306479074069242,
          "bg_count" : 28540
        },
        {
          "key" : "99c9a97ce78b4fd1a124a173222a013a",
          "doc_count" : 2617,
          "score" : 0.0020500645248022376,
          "bg_count" : 2622
        },
        {
          "key" : "7a919773e80c43a9b499b1947caa163e",
          "doc_count" : 2532,
          "score" : 0.0019496069985887787,
          "bg_count" : 2539
        },
        {
          "key" : "4b1ed7601091487b982c8632ca86481a",
          "doc_count" : 1876,
          "score" : 0.0015258900958613191,
          "bg_count" : 1876
        },
        {
          "key" : "41b3ef01c68c47cea1cbde66cf8aff3a",
          "doc_count" : 1418,
          "score" : 9.197778131438137E-4,
          "bg_count" : 1433
        },
        {
          "key" : "c7589a964b6349078df094a8872d97a3",
          "doc_count" : 1061,
          "score" : 8.629900808682623E-4,
          "bg_count" : 1061
        },
        {
          "key" : "aa426b628ffa4f7fbc66344d03157178",
          "doc_count" : 1024,
          "score" : 8.328952335618312E-4,
          "bg_count" : 1024
        },
        {
          "key" : "ad77884486564074a2d7885b1698b816",
          "doc_count" : 928,
          "score" : 7.548113054154072E-4,
          "bg_count" : 928
        },
        {
          "key" : "7d7cc295fbe64fdabdcddf15df8a28c1",
          "doc_count" : 2108,
          "score" : 7.519206332956028E-4,
          "bg_count" : 2171
        },
        {
          "key" : "acb4314538bb43eca9027e5efa0ad3a5",
          "doc_count" : 773,
          "score" : 6.287382964289966E-4,
          "bg_count" : 773
        }
      ]
    }
  }
}

This output is not even complete output of myindex. There are suppose to be more records .. I am not getting entire output ...

Also i want only KEY value as output and rest to be ignored.... as i want to store this output in another index.

Thanks
Tushar Nemade

dadoonet · November 11, 2020, 1:37pm

Use a terms aggregation. Not the significant_text.

Then read carefully the documentation at:

Specifically:

If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the Composite aggregation which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination.

tusharnemade · November 11, 2020, 1:50pm

Hi David

Thanks for your response and page.

I have been through this page and "term" before using significant_text and keywords

curl -X GET "localhost:9200/myindex/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"unique_usrid": {
"terms": { "field": "userId"}
}
}
}'

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "myindex",
        "node" : "-ZiUhOCjRb-EEcSLojZ4YQ",
        "reason" : {
          "type" : "illegal_argument_exception",
          "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
        }
      }
    ],
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",
      "caused_by" : {
        "type" : "illegal_argument_exception",
        "reason" : "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [userId] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
      }
    }
  },
  "status" : 400
}

dadoonet · November 11, 2020, 6:46pm

You need to change the mapping and use a keyword type instead of text.
In case you're using the default mapping, then append .keyword to your field name and the terms agg will work.

tusharnemade · November 12, 2020, 3:52am

tusharnemade:

"userId" : {
          "type" : "text",
         "fields" : {
              "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
          }
         }
         }

This is my mapping .... it already has keyword defined.

dadoonet · November 12, 2020, 4:12am

As I said, use userId.keyword field.

tusharnemade · November 12, 2020, 4:15am

Okay ...

I will check with development team about it.

Secondly , even if I achieve this by changing my mapping in INDEX , I still have another issue of getting all records in output, as CURL only shows some 1000 rows as due to some settings which I am not aware of.

dadoonet · November 12, 2020, 4:30am

Could you share an example of the request you are running? If you meant something like the number of hits, you can use:

the size and from parameters to display by default up to 10000 records to your users. If you want to change this limit, you can change index.max_result_window setting but be aware of the consequences (ie memory).
the search after feature to do deep pagination.
the Scroll API if you want to extract a resultset to be consumed by another tool later.

tusharnemade · November 12, 2020, 4:48am

Hi David

In this Topic i have given example of how my output using CURL is incomplete.

https://discuss.elastic.co/t/curl-output-is-not-complete/255048

Regards the options you have presented :

size and from I cannot use as index is continuously getting data and its not fixed. If i use high values , it will give error as it would be out of index actual data count.
max_result_window i cannot set due to memory consumptions.
scroll api and search after : I will check these are helpful to get the output in one go , so that i can proceed with those records further.

Thanks
Tushar Nemade

dadoonet · November 12, 2020, 8:05am

And I answered there (Curl Output is not complete - #2 by dadoonet). So let's keep the discussion there.

tusharnemade · November 12, 2020, 12:23pm

Hi David

Yes, I agree its not complete and I am aware of it. Hence I raise an topic to findout the way to have it completed.

Pagination is not helping me in automation of my work. like i have mentioned in there.

Further discussion on this on another topic raised.

system · December 10, 2020, 12:23pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Get distinct data Elasticsearch	4	423	July 6, 2017
Pagination on unique data Elasticsearch	2	1595	July 6, 2017
Get unique users in db Elasticsearch	6	2306	June 29, 2018
How to fetch the distinct records from Elastic Search Elasticsearch	1	327	July 6, 2017
Get unique record(fields) Elasticsearch	1	323	July 6, 2017

How to get Unique Records

Related topics