How to optimize elasticsearch aggregations to just take any for specific fields

jj1234 · March 26, 2019, 4:02pm

Let's say I am trying to get the top 10 items with the top sales across a time period. I want to just show it on a UI with the item id, item name, item store id, item section id, item brand and item sales amount in period. All items for a given item id have the same name, item store id, item section id, and item brand. What is the right way to do this? How do I get performance while doing this?

At first, I thought I could just use the hits array to create a lookup map, but the ES window limit is 10,000. That worked, but a problem occurred when a store had more than 10,000 hits with a given period (see query below).

{
  "size" : 10000,
  "timeout" : 1500,
  "query" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "itemStoreId_string" : "12342323232"
        }
      }, {
        "range" : {
          "orderDate_timestamp" : {
            "from" : "2019-01-01T00:00:00.000Z",
            "to" : "2019-06-01T00:00:00.000Z",
            "include_lower" : true,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
	    "byItemTopSales" : {
	      "terms" : {
	        "field" : "itemId_string",
	        "size" : 10,
	        "order" : {
	          "aggTotalSales" : "desc"
	        }
	      },
	      "aggregations" : {
	        "aggTotalSales" : {
	          "sum" : {
	            "field" : "totalSales_double_not_indexed"
	          }
	        }
        }
      }
    }
  }
}

Then, I tried aggregating more fields so I wouldn't have to rely on a lookup map and wouldn't run into the 10k limit on window size. I noticed that this query (below) is significantly slower. Is this expected? How can I make this faster?

{
  "size" : 0,
  "timeout" : 1500,
  "query" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "itemStoreId_string" : "12342323232"
        }
      }, {
        "range" : {
          "orderDate_timestamp" : {
            "from" : "2019-01-01T00:00:00.000Z",
            "to" : "2019-06-01T00:00:00.000Z",
            "include_lower" : true,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
	    "byItemTopSales" : {
	      "terms" : {
	        "field" : "itemId_string",
	        "size" : 10,
	        "order" : {
	          "aggTotalSales" : "desc"
	        }
	      },
	      "aggregations" : {
	        "aggTotalSales" : {
	          "sum" : {
	            "field" : "totalSales_double_not_indexed"
	          }
	        },
	       "itemName" : {
	          "terms" : {
	            "field" : "itemName_string_not_indexed"
	          }
	        },
	       "itemId" : {
	          "terms" : {
	            "field" : "menuItemId_string_not_indexed"
	          }
	        },
	       "itemBrandName" : {
	          "terms" : {
	            "field" : "itemBrandName_string_not_indexed"
	          }
	        }
      }
    }
  }
}

Again, all items for a given item id have the same name, item store id, item section, and item brand.

How do I change my query to have it just take the first matching name, item store id, .etc.? Any tips would be much appreciated. Thanks!

Ignacio_Vera · March 26, 2019, 5:16pm

Have you tried the Top Hits aggregator? it sounds like a good match:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

jj1234 · March 26, 2019, 5:32pm

Isn't top hits sort of doing the reverse? It seems to be for a subaggregation so the use case would be if I wanted to get the Total sales of the top 10 items, but what I am looking for is to get the total sales for each of the items aggregated and I want the top 10 of those items ordered by aggregated sales.

Ignacio_Vera · March 26, 2019, 5:48pm

I think I misunderstand your question.

What do you mean with ES window limit is 10,000? This limit is for the returned documents but aggregations will still run in all hits from the query.

Would it be possible for you to share an example of what you are trying to achieve?

jj1234 · March 26, 2019, 6:25pm

If I change the size to be > 10000, there is a complaint about max_result_window. I realize that I can change this, but I am hesitant.

I am using the hits (along with the aggregations) from the output. In what I was originally doing, I was turning the hits array into a lookup map. Then for the top 10 items I would lookup the other fields (itemName, sectionName, etc.) using that lookup map. However, there are cases where a store can have many items over a period and that the 10,000 max_result_window size doesn't encompass a top 10 item. In that case I would get no itemName, sectionName, etc. to go along with the itemId and sales in the aggregation. Please see the example below (the resulting 10,000 hits array is used to lookup the name etc.

{
  "size" : 10000,
  "timeout" : 1500,
  "query" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "itemStoreId_string" : "12342323232"
        }
      }, {
        "range" : {
          "orderDate_timestamp" : {
            "from" : "2019-01-01T00:00:00.000Z",
            "to" : "2019-06-01T00:00:00.000Z",
            "include_lower" : true,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
	    "byItemTopSales" : {
	      "terms" : {
	        "field" : "itemId_string",
	        "size" : 10,
	        "order" : {
	          "aggTotalSales" : "desc"
	        }
	      },
	      "aggregations" : {
	        "aggTotalSales" : {
	          "sum" : {
	            "field" : "totalSales_double_not_indexed"
	          }
	        }
        }
      }
    }
  }
}

jj1234 · April 16, 2019, 8:52pm

Should I be increasing the size or is there a better way without making 2 queries? I feel like this is a common query (aggregate first and then do a top 10 based on those aggregations).

Ignacio_Vera · April 17, 2019, 10:20am

I still think you can use the top_hits_aggregation, why is it not working from you if the values are the same in all documents you are aggregating by itemId_string?

Something like:

{
  "size": 0,
  "timeout": 1500,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "itemStoreId_string": "12342323232"
          }
        },
        {
          "range": {
            "orderDate_timestamp": {
              "from": "2019-01-01T00:00:00.000Z",
              "to": "2019-06-01T00:00:00.000Z",
              "include_lower": true,
              "include_upper": true
            }
          }
        }
      ]
    }
  },
  "aggregations": {
    "byItemTopSales": {
      "terms": {
        "field": "itemId_string",
        "size": 10,
        "order": {
          "aggTotalSales": "desc"
        }
      },
      "aggregations": {
        "item": {
          "top-hits": {
             "size" : 1,
            "_source": {
              "includes": [
                "itemId_string",
                "itemName_string_not_indexed",
                "menuItemId_string_not_indexed",
                "itemBrandName_string_not_indexed"
              ]
            }
          },
          "aggTotalSales": {
            "sum": {
              "field": "totalSales_double_not_indexed"
            }
          }
        }
      }
    }
  }
}

jj1234 · May 9, 2019, 7:58pm

Ah you are right, top hits with a size of 1 gave me what I wanted. I guess I thought top hits wasn't used for this use case of 1, but it is interesting that this is the option to get the other fields of a document in an aggregation.

Thanks for your help!

system · June 6, 2019, 7:58pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Aggregations-only use case - performance tuning via config possible? Elasticsearch	3	403	July 6, 2017
Top hits query working very slow while retrieving data Elasticsearch	1	503	June 29, 2020
A slow query problem in elasticsearch (aggregation) Elasticsearch	1	149	November 8, 2023
Query Optimization Elasticsearch	2	437	November 4, 2020
Aggregation Sum is very slow Elasticsearch	1	537	October 9, 2018

How to optimize elasticsearch aggregations to just take any for specific fields

Related topics