“distinct” query in elasticsearch java api

hmkhitaryan · November 19, 2022, 9:50pm

How to create “distinct” query in elasticsearch java api, like we do in sql. here is my query, which kibana generated, so I should do the same with java, and I'm trying many ways, but again getting with duplicates.

"aggs": {
    "3": {
      "terms": {
        "field": "policy.weight",
        "order": {
          "1": "asc"
        },
        "size": 5
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "item.item_doc_id.keyword"
          }
        },
        "2": {
          "terms": {
            "field": "policy.name.keyword",
            "order": {
              "1": "desc"
            },
            "size": 100
          },
          "aggs": {
            "1": {
              "cardinality": {
                "field": "item.item_doc_id.keyword"
              }
            },
            "4": {
              "terms": {
                "field": "policy.description.keyword",
                "order": {
                  "1": "desc"
                },
                "size": 100
              },
              "aggs": {
                "1": {
                  "cardinality": {
                    "field": "item.item_doc_id.keyword"
                  }
                }
              }
            }
          }
        }
      }
    }
  }

and this is the code in java, what I'm trying to do, but get with duplicates.

final SearchRequest searchRequest = new SearchRequest(bucketListInfo.getIndexName());
        final String field1 = "policy.name.keyword";
        final String field2 = "policy.description.keyword";
//        final String field3 = "policy.weight";
        final String field3 = "item.item_doc_id.keyword";
        final SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
                .query(getRangeQueryBuilderWithOptionalFilter(bucketListInfo.getTimestampFieldName(), bucketListInfo.getFilterFieldName(), bucketListInfo.getFilterFieldValue(),
                        bucketListInfo.getFrom(), bucketListInfo.getTo())).size(100);

        TermsAggregationBuilder termsAggregationBuilder;
        AggregationBuilder cardinalityAggregationBuilder = AggregationBuilders.cardinality(field3).field(field3);
        termsAggregationBuilder = AggregationBuilders.terms(field1).field(field1).size(100);
        searchSourceBuilder.aggregation(termsAggregationBuilder);

        termsAggregationBuilder = AggregationBuilders.terms(field2).field(field2).size(100);
        searchSourceBuilder.aggregation(termsAggregationBuilder);

        searchSourceBuilder.aggregation(cardinalityAggregationBuilder);
        searchRequest.source(searchSourceBuilder);
        try {
            final SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
            BucketList bucketList = new BucketList();
            final Terms terms1 = response.getAggregations().get(field1);
            final Terms terms2 = response.getAggregations().get(field2);
            Cardinality terms3 = response.getAggregations().get(field3);
            bucketList.getBuckets().add(terms1.getBuckets());
            bucketList.getBuckets().add(terms2.getBuckets());
//            bucketList.getBuckets().add(terms3.getBuckets());
            return bucketList;
        } catch (Exception e) {
            log.error(e.getMessage(), e);
            return null;
        }

So tell me please, what is the correct way of writing the java query of that kibana generated query.

RabBit_BR · November 20, 2022, 1:05am

Hi @hmkhitaryan

This a code adapt your query.

 SearchRequest searchRequest = new SearchRequest("idx_name");

    TermsAggregationBuilder aggsTermPolicyWeight = AggregationBuilders.terms("3")
        .field("policy.weight").size(5);

    CardinalityAggregationBuilder aggsCardinalityDocId = AggregationBuilders.cardinality("1").field("item.item_doc_id.keyword");

    aggsTermPolicyWeight.subAggregation(aggsCardinalityDocId);

    TermsAggregationBuilder aggsTermPolicyName = AggregationBuilders.terms("2")
        .field("policy.name.keyword").size(100);

    TermsAggregationBuilder aggsTermPolicyDescription = AggregationBuilders.terms("4")
        .field("policy.description.keyword").size(100);

    aggsTermPolicyDescription.subAggregation(aggsCardinalityDocId);

    aggsTermPolicyName.subAggregation(aggsCardinalityDocId);
    aggsTermPolicyName.subAggregation(aggsTermPolicyDescription);

    aggsTermPolicyWeight.subAggregation(aggsTermPolicyName);

    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.aggregation(aggsTermPolicyWeight);

    searchRequest.source(searchSourceBuilder);

    SearchResponse searchResponse = getClient().search(searchRequest, RequestOptions.DEFAULT);

hmkhitaryan · November 20, 2022, 7:02am

Hi @RabBit_BR I'm trying this, but it again returns hits with duplicates.
Actually the final response should be like this:

   [
     {
       "policyName" : "policy name from db", 
       "policyDescription" : "policy desc from db",
       "count" :  the count from db
     },
     ....
   ]

RabBit_BR · November 20, 2022, 11:00am

the problem must be in the query. The coding is exactly the query that you presented.

Christian_Dahlqvist · November 20, 2022, 11:11am

What do you mean by duplicates? Be aware that cardinality aggregations are spproximations and not necessarily exact.

hmkhitaryan · November 20, 2022, 11:25am

@Christian_Dahlqvist comming back to your question. this whole query retrieves rows, where doc_id is not uniqe, so I want it to be uniq, like dinstinct, to get the ones, where this doc_id field doesn't repeat.

hmkhitaryan · November 20, 2022, 11:31am

Actually these two methods does the job, what I need, the only lack here is that I get the rows, where doc_id field values repeat, so this is the problem actually now, how to do so that this field values be distinct.

    public BucketList getListOfBucketsTimeRestricted(BucketListInfo bucketListInfo) {
        final SearchRequest searchRequest = new SearchRequest(bucketListInfo.getIndexName());

        final SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder()
                .query(getRangeQueryBuilderWithOptionalFilter(bucketListInfo.getTimestampFieldName(), bucketListInfo.getFilterFieldName(), bucketListInfo.getFilterFieldValue(),
                        bucketListInfo.getFrom(), bucketListInfo.getTo())).size(100);

        for (String aggrField : bucketListInfo.getAggrFieldList()) {
            final TermsAggregationBuilder aggregationBuilder = AggregationBuilders.terms(aggrField).field(aggrField);
            searchSourceBuilder.aggregation(aggregationBuilder);
        }
        searchRequest.source(searchSourceBuilder);
        try {
            final SearchResponse response = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
            BucketList bucketList = new BucketList();
            for (String aggrField : bucketListInfo.getAggrFieldList()) {
                final Terms terms = response.getAggregations().get(aggrField);
                bucketList.getBuckets().add(terms.getBuckets());
            }
            return bucketList;
        } catch (Exception e) {
            log.error(e.getMessage(), e);
            return null;
        }
    }

    private static final List<String> AGGR_FIELDS = List.of("policy.name.keyword", "policy.description.keyword");
    private static final String IS_COMPLIANT_FIELD = "is_compliant";

    private final SearchClient searchClient;

    public List<PolicyViolation> getEvaluationTimeRestricted(DateTime from, DateTime to) {
        final BucketListInfo bucketListInfo = BucketListInfo.builder()
                .indexName(EVALUATION.getIndexName())
                .timestampFieldName(EVALUATION.getTimestampFieldName())
                .aggrFieldList(AGGR_FIELDS)
                .filterFieldName(IS_COMPLIANT_FIELD)
                .filterFieldValue(false)
                .from(from)
                .to(to)
                .build();

        BucketList buckets = searchClient.getListOfBucketsTimeRestricted(bucketListInfo);
        if (Objects.isNull(buckets)) {
            return Collections.emptyList();
        }

        List<? extends Terms.Bucket> firstBuckets = buckets.getBuckets().get(0);
        List<? extends Terms.Bucket> secondBuckets = buckets.getBuckets().get(1);
        final List<PolicyViolation> evaluationCounts = new ArrayList<>();

        for (int i = 0; i < firstBuckets.size(); i++) {
            evaluationCounts.add(
                    new PolicyViolation(firstBuckets.get(i).getKeyAsString(), secondBuckets.get(i).getKeyAsString(), firstBuckets.get(i).getDocCount()));
        }
        return evaluationCounts;
    }

type or paste code here

Christian_Dahlqvist · November 20, 2022, 11:42am

Tbe documents retrieved if you have not set size to 0 are matching the filter selection but not affected by aggregations, so will not be unique.

hmkhitaryan · November 20, 2022, 11:48am

@Christian_Dahlqvist will you please show me where to set that size?

Christian_Dahlqvist · November 20, 2022, 4:54pm

Please have a look at the docs.

hmkhitaryan · November 20, 2022, 6:16pm

@Christian_Dahlqvist sorry, but I think we don't understand each other.
I want to do a "select distinct" query analog in elasticsearch java api, and then group by the results, got from that select distinct query. just that.

hmkhitaryan · November 20, 2022, 6:36pm

like this:
select distinct doc_id, ..., ... from ... where ..., group by ....

Christian_Dahlqvist · November 20, 2022, 6:37pm

As far as I know I do not think that is possible in Elasticsearch.

hmkhitaryan · November 20, 2022, 7:14pm

no way ?

stephenb · November 20, 2022, 7:29pm

Here is a thought...

Use the SQL API interface

Then use the translate API to get the DSL

Perhaps that will help, one word of caution. The SQL API is very picky about single quotes and double quotes etc.

hmkhitaryan · November 21, 2022, 10:11am

@stephenb I started using the sql search api, you suggested, but the problem is now that my index name is "*-evaluation", and when I'm using the query like this :

        String query = "{\"query\":\"SELECT 'name' FROM *-evaluation\"" + ","
                + "\"filter\": {"
                   + " \"range\": {"
                      + " \"@cspm.ingested_at\": {"
                         + " \"from\": \"15/11/2022\","
                         + " \"to\": \"16/11/2022\","
                         + " \"format\": \"dd/MM/yyyy\""
                     + " }"
                + " }"
                + "}}";

so I get a very unclear json object graph, which I think is not related to the data, which I am expecting. I even excape the index name in string, like '*-evaluation', but anyway, again no exected results. So any idea how to resolve this?

bhuvahh · November 21, 2022, 1:21pm

I want to do a "select distinct" query analog in elasticsearch java api, and then group by the results, got from that select distinct query. just that.

Christian_Dahlqvist · November 21, 2022, 1:36pm

I do not think there is any query construct that based on a filter returns all unique values (terms) from a field. The closest I think you can get is to perform a terms aggregation, but that will give you a count together with each term. There is also a limit to the size of the result set, so you may not get all if there are many values.

linkerc · November 29, 2022, 11:19pm

You can do nested aggregation, but I'm not sure if that's what you are looking for.
You can group by unique doc_id and within each unique doc_id, you can further group by other fields.

system · December 27, 2022, 11:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Distinct search using Java API Elasticsearch language-clients	23	3863	June 18, 2021
DISTINCT values DSL query Elasticsearch	3	9908	May 18, 2022
How to get distinct values from elastic search Elasticsearch	1	520	May 25, 2020
How to get all the distinct values of field in sorted order using ES Java API Elasticsearch	3	4259	July 6, 2017
Returning Distinct Values in a query Elasticsearch	2	4809	September 27, 2019

“distinct” query in elasticsearch java api

Related topics