Sorting aggregation results with Java API

mrbarret · March 20, 2020, 8:49pm

I have below query that buckets results and sorts them in the order I want, but I am unable to translate this to the Java API.

GET /my_data_set/_search
{
  "size": 0,
  "aggregations": {
    "toBeOrdered": {
      "terms": {
        "field": "Content.keyword",
        "size": 1000000,
        "order": [
          {
            "topSort": "asc"
          },
          {
            "nextSort": "asc"
          }
        ]
      },
      "aggregations": {
        "topAnswer": {
          "top_hits": {
            "size": 20,
            "from": 0,
            "sort": {
              "DateTime": "asc"
            },
            "_source": { "includes": ["LastName", "Content", "DateTime"]}
          }
        },
        "topSort": {
          "max": {
            "field": "LastName.raw"
          }
        },
        "nextSort": {
          "max": {
            "field": "Content.raw"
          }
        }
      }
    }
  }
}

An example document is

{
   "Content": ...,
   "LastName": ...,
   "DateTime": ...
}

Edit 1:

The basic idea of the GET request above is the following. Imagine looking for duplicate contents and taking the oldest one. After obtaining each oldest content value you want to sort by the last names of the authors.

Sample input documents.

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "My first tweet",
    "LastName": "Smith",
    "DateTime": "2020-01-29'T'01:00:00"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

I would want to receive a result of

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

The tweets by Doe and Smith were bucketed together and Doe's tweet was kept because they have the same "Content" value but Doe posted first. In the result we have Doe's tweet before Locke's tweet because the last name Doe comes before Locke alphabetically.

How do I write this in Java code?

Opster_Community1 · March 21, 2020, 6:12am

I haven't checked, but the following should translate to your requirement:

List<BucketOrder> bucketOrderList = new ArrayList<>();
bucketOrderList.add(BucketOrder.aggregation("topSort", true));
bucketOrderList.add(BucketOrder.aggregation("nextSort", true));
String[] topHitIncludes = {"LastName", "Content", "DateTime"};
TermsAggregationBuilder termsAggregation = AggregationBuilders.terms("Content.keyword")
    .size(1000000).order(bucketOrderList);
termsAggregation
    .subAggregation(AggregationBuilders.topHits("topAnswer").size(10).from(0)
        .sort("DateTime", org.elasticsearch.search.sort.SortOrder.ASC)
        .fetchSource(topHitIncludes, null))
    .subAggregation(AggregationBuilders.max("LastName.raw"))
    .subAggregation(AggregationBuilders.max("Content.raw"));

Links to documentations: TermsAggregationBuilder, BucketOrder

mrbarret · March 23, 2020, 8:21pm

Thanks. I think the sorts "topSort" and "nextSort" are applied within the buckets, instead of across the buckets' representatives. Is there a way to get the buckets, flatten them and then sort, all inside of Elasticsearch?

I modified your code a little, I think in two places you put the field value where the name is expected. Here is what I am using.

    List<BucketOrder> bucketOrderList = query.getSorts().stream().map(sort ->
      BucketOrder.aggregation(sort.getField() + "-sort", sort.getDirection() == Sort.Direction.ASC)
    ).collect(Collectors.toList());

    String[] topHitIncludes = new String[query.getSelections().get(indexName).size()];
    query.getSelections().get(indexName).toArray(topHitIncludes);
    TermsAggregationBuilder termsAggregation = AggregationBuilders
            .terms("terms-aggregation")
            .field(query.getCollapser().getField() + ".keyword")
            .size(10)
            .order(bucketOrderList);

    termsAggregation
            .subAggregation(
                    AggregationBuilders
                            .topHits("topAnswer")
                            .size(1)
                            .from(0)
                            .sort("DateTime", org.elasticsearch.search.sort.SortOrder.ASC)
                            .fetchSource(topHitIncludes, null)
            );

    query.getSorts().forEach(sort -> {
      if (sort.getDirection() == Sort.Direction.ASC) {
        termsAggregation.subAggregation(
                AggregationBuilders.min(sort.getField() + "-sort").field(sort.getField() + ".raw")
        );
      } else {
        termsAggregation.subAggregation(
                AggregationBuilders.max(sort.getField() + "-sort").field(sort.getField() + ".raw")
        );
      }
    });

    searchBuilder.aggregation(termsAggregation);

And I get results that look like this, not ordered as expected.

{
  "result": [
    {
      "Content": "Stand-alone systemic toolset",
      "LastName": "35cabb8a-fc11-47c6-b1cf-7bb24781506c",      
      "DateTime": "2015-07-30T23:30:09Z"
    },
    {
      "Content": "Programmable scalable capability",
      "LastName": "15ee0f86-3b5b-405c-8411-9b0c7439eafe",
      "DateTime": "2013-01-23T19:01:06Z"
    },
    {
      "Content": "Mandatory methodical knowledge base",
      "LastName": "9f3c52ef-6ff1-4f0c-8d6e-ff2ab386ff87",
      "DateTime": "1988-09-05T03:46:51Z"
    },
    {
      "Content": "Phased client-server paradigm",
      "LastName": "ab0c3680-a771-4c44-9c68-d6bd71080725",
      "DateTime": "1981-03-20T20:24:43Z"
    },
    {
      "Content": "Object-based fresh-thinking open architecture",
      "LastName": "215070ff-f637-45a3-ae4f-f32229ad7be9",
      "DateTime": "2009-07-20T18:42:54Z"
    },
    {
      "Content": "Inverse local groupware",
      "LastName": "cef240ed-bbeb-41f5-80ac-a72eb95e28b0",
      "DateTime": "1987-12-09T00:42:15Z"
    },
    {
      "Content": "Extended explicit project",
      "LastName": "0df4ef72-48e7-4b82-9d49-e19071059701",
      "DateTime": "1996-05-13T03:20:03Z"
    },
    {
      "Content": "Cross-group web-enabled capability",
      "LastName": "399acef6-1379-42f4-aadf-0c6f7122022b",
      "DateTime": "1986-05-08T23:57:34Z"
    },
    {
      "Content": "Quality-focused transitional portal",
      "LastName": "289cb6bb-bb54-4ec1-81b7-bcda7a0aa46f",
      "DateTime": "2013-03-04T10:08:21Z"
    }
  ],
  "totalCount": 2000
}

Thoughts? Would it be worth my time to collect the buckets in my service layer and sort the flattened result list there?

mrbarret · March 24, 2020, 2:08am

It looks like the Java code you provided matches the GET low-level call I gave above. Unfortunately, I need something else -- it seems I gave you a bad request to start with. Sorry.

The final results are not sorted across buckets. One representative is taken from each bucket (the oldest), and then sort the representatives by other features.

Opster_Community1 · March 24, 2020, 2:57am

I'm sorry but I didn't quite understand your requirement. It would be easy for me, if you provide the exact query that you are looking to translate into java code.

mrbarret · March 24, 2020, 12:12pm

Unfortunately, it turns out that I don't have a correct exact query for what I want to do. So I'll need your help constructing either. All I have is the following requirement.

The idea of the request is the following. Imagine looking for duplicate contents and taking the oldest one. After obtaining each oldest content value you want to sort by the last names of the authors.

Sample input documents.

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "My first tweet",
    "LastName": "Smith",
    "DateTime": "2020-01-29'T'01:00:00"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

I would want to receive a result of

{
    "Content": "My first tweet",
    "LastName": "Doe",
    "DateTime": "2020-01-29'T'00:00:01"
},
{
    "Content": "Some other tweet",
    "LastName": "Locke",
    "DateTime": "2020-01-30'T'00:00:01"
}

The tweets by Doe and Smith were bucketed together and Doe's tweet was kept because they have the same "Content" value but Doe posted first. In the result we have Doe's tweet before Locke's tweet because the last name Doe comes before Locke alphabetically.

Any help you can offer would be greatly appreciated.

Matthew_Isett · March 26, 2020, 2:37pm

I think field collapsing is what you are looking for not an aggregation.

I am going to work in Query DSL and not in JAVA rest API.

Let's use a dataset from demo.elastic.co so we can share examples in dev tools

     GET twitter-sentiment-2020.03/_search
          {
            "query": {
                "match": {
                    "extended_tweet.full_text": "Elasticsearch"
                }
            },
            "collapse" : {
                "field" : "extended_tweet.full_text.keyword"
            },
            "sort": ["@timestamp", "user.name.keyword"],
            "from": 10
        }

Would this allow you to get closer to your goal?

mrbarret · March 26, 2020, 2:43pm

Matt! That's amazing! Thanks! I am able to translate this to the high level API with ease so I'll mark this as Done and Done.

system · April 23, 2020, 2:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sort buckets by count in terms aggregation using java api client Elasticsearch	0	87	May 9, 2024
How I sort these jsons using java api? Elasticsearch	1	322	July 6, 2017
Sort with aggregation with another aggregation in Java REST API Elasticsearch	2	308	June 22, 2021
Sort data by field in Elasticsearch Elasticsearch	4	1182	December 27, 2016
Sort query in Java Elasticsearch	2	221	November 29, 2022

Sorting aggregation results with Java API

Related topics