How to optimize elasticsearch aggregations to just take any for specific fields

Let's say I am trying to get the top 10 items with the top sales across a time period. I want to just show it on a UI with the item id, item name, item store id, item section id, item brand and item sales amount in period. All items for a given item id have the same name, item store id, item section id, and item brand. What is the right way to do this? How do I get performance while doing this?

At first, I thought I could just use the hits array to create a lookup map, but the ES window limit is 10,000. That worked, but a problem occurred when a store had more than 10,000 hits with a given period (see query below).

{
  "size" : 10000,
  "timeout" : 1500,
  "query" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "itemStoreId_string" : "12342323232"
        }
      }, {
        "range" : {
          "orderDate_timestamp" : {
            "from" : "2019-01-01T00:00:00.000Z",
            "to" : "2019-06-01T00:00:00.000Z",
            "include_lower" : true,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
	    "byItemTopSales" : {
	      "terms" : {
	        "field" : "itemId_string",
	        "size" : 10,
	        "order" : {
	          "aggTotalSales" : "desc"
	        }
	      },
	      "aggregations" : {
	        "aggTotalSales" : {
	          "sum" : {
	            "field" : "totalSales_double_not_indexed"
	          }
	        }
        }
      }
    }
  }
}

Then, I tried aggregating more fields so I wouldn't have to rely on a lookup map and wouldn't run into the 10k limit on window size. I noticed that this query (below) is significantly slower. Is this expected? How can I make this faster?

{
  "size" : 0,
  "timeout" : 1500,
  "query" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "itemStoreId_string" : "12342323232"
        }
      }, {
        "range" : {
          "orderDate_timestamp" : {
            "from" : "2019-01-01T00:00:00.000Z",
            "to" : "2019-06-01T00:00:00.000Z",
            "include_lower" : true,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
	    "byItemTopSales" : {
	      "terms" : {
	        "field" : "itemId_string",
	        "size" : 10,
	        "order" : {
	          "aggTotalSales" : "desc"
	        }
	      },
	      "aggregations" : {
	        "aggTotalSales" : {
	          "sum" : {
	            "field" : "totalSales_double_not_indexed"
	          }
	        },
	       "itemName" : {
	          "terms" : {
	            "field" : "itemName_string_not_indexed"
	          }
	        },
	       "itemId" : {
	          "terms" : {
	            "field" : "menuItemId_string_not_indexed"
	          }
	        },
	       "itemBrandName" : {
	          "terms" : {
	            "field" : "itemBrandName_string_not_indexed"
	          }
	        }
      }
    }
  }
}

Again, all items for a given item id have the same name, item store id, item section, and item brand.

How do I change my query to have it just take the first matching name, item store id, .etc.? Any tips would be much appreciated. Thanks!

Have you tried the Top Hits aggregator? it sounds like a good match:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html

Isn't top hits sort of doing the reverse? It seems to be for a subaggregation so the use case would be if I wanted to get the Total sales of the top 10 items, but what I am looking for is to get the total sales for each of the items aggregated and I want the top 10 of those items ordered by aggregated sales.

I think I misunderstand your question.

What do you mean with ES window limit is 10,000? This limit is for the returned documents but aggregations will still run in all hits from the query.

Would it be possible for you to share an example of what you are trying to achieve?

If I change the size to be > 10000, there is a complaint about max_result_window. I realize that I can change this, but I am hesitant.

I am using the hits (along with the aggregations) from the output. In what I was originally doing, I was turning the hits array into a lookup map. Then for the top 10 items I would lookup the other fields (itemName, sectionName, etc.) using that lookup map. However, there are cases where a store can have many items over a period and that the 10,000 max_result_window size doesn't encompass a top 10 item. In that case I would get no itemName, sectionName, etc. to go along with the itemId and sales in the aggregation. Please see the example below (the resulting 10,000 hits array is used to lookup the name etc.

{
  "size" : 10000,
  "timeout" : 1500,
  "query" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "itemStoreId_string" : "12342323232"
        }
      }, {
        "range" : {
          "orderDate_timestamp" : {
            "from" : "2019-01-01T00:00:00.000Z",
            "to" : "2019-06-01T00:00:00.000Z",
            "include_lower" : true,
            "include_upper" : true
          }
        }
      } ]
    }
  },
  "aggregations" : {
	    "byItemTopSales" : {
	      "terms" : {
	        "field" : "itemId_string",
	        "size" : 10,
	        "order" : {
	          "aggTotalSales" : "desc"
	        }
	      },
	      "aggregations" : {
	        "aggTotalSales" : {
	          "sum" : {
	            "field" : "totalSales_double_not_indexed"
	          }
	        }
        }
      }
    }
  }
}

Should I be increasing the size or is there a better way without making 2 queries? I feel like this is a common query (aggregate first and then do a top 10 based on those aggregations).

I still think you can use the top_hits_aggregation, why is it not working from you if the values are the same in all documents you are aggregating by itemId_string?

Something like:

{
  "size": 0,
  "timeout": 1500,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "itemStoreId_string": "12342323232"
          }
        },
        {
          "range": {
            "orderDate_timestamp": {
              "from": "2019-01-01T00:00:00.000Z",
              "to": "2019-06-01T00:00:00.000Z",
              "include_lower": true,
              "include_upper": true
            }
          }
        }
      ]
    }
  },
  "aggregations": {
    "byItemTopSales": {
      "terms": {
        "field": "itemId_string",
        "size": 10,
        "order": {
          "aggTotalSales": "desc"
        }
      },
      "aggregations": {
        "item": {
          "top-hits": {
             "size" : 1,
            "_source": {
              "includes": [
                "itemId_string",
                "itemName_string_not_indexed",
                "menuItemId_string_not_indexed",
                "itemBrandName_string_not_indexed"
              ]
            }
          },
          "aggTotalSales": {
            "sum": {
              "field": "totalSales_double_not_indexed"
            }
          }
        }
      }
    }
  }
}

Ah you are right, top hits with a size of 1 gave me what I wanted. I guess I thought top hits wasn't used for this use case of 1, but it is interesting that this is the option to get the other fields of a document in an aggregation.

Thanks for your help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.