Averaging more than 10,000 results

cohenran · April 3, 2018, 11:01am

I have millions of records. I want to get the average of a specific field, but can only return less that 10,000 rows.
Is the average returned is the real average of the millions of rows, or the average of the returned rows?

This is the query I run:

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "ccpair":"EUR / USD"
                }
              },
              {
                "term": {
                  "ccpair":"EUR/USD"
                }
              },
              {
                "term": {
                  "ccpair":"EUR/USD-"
                }
              }]
          }
        },
        {
          "bool": {
            "must": [
              {
                "term": {
                  "tnetServerName": "saxo-p.tradair.com"
                }
              }]
          }
        }
      ]
    }
  },
  "aggs": {
    "platformNames": {
      "terms": {
        "field": "platformName",
        "size": 5
      },
      "aggs": {
        "per_hour": {
          "date_histogram": {
            "field": "timestamp",
            "interval": "hour"
          },
          "aggs": {
            "by_top_hit": {
              "top_hits": {
                "size": <does the size really matter?>
              }
            },
            "aggs": {
              "nested": {
                "path": "tiers"
              },
              "aggs": {
                "avg_spread2": {
                  "avg": {
                    "field": "tiers.spread"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

polyfractal · April 6, 2018, 12:51pm

First, I think your agg syntax is a bit messed up. This section:

cohenran:

       "aggs": {
            "by_top_hit": {
              "top_hits": {
                "size": ...
              }
            },
            "aggs": {
              "nested": {
                "path": "tiers"
              },

looks to be bad syntax. E.g. the next level "aggs" needs to embed inside a named aggregation, next to the agg type. E.g. it'd look more like:

       "aggs": {
            "by_top_hit": {
              "top_hits": {
                "size": ...
              },
              "aggs": {
                "nested": {
                  "path": "tiers"
                },
              }
            }

With the nested agg going next to "top_hits", not below it.

That said, this will not work, because top_hits doesn't accept any child aggregations.

Ok, back to your original question:

Is the average returned is the real average of the millions of rows, or the average of the returned rows?

The average will be calculated on whatever the query and partitioning is configured for. For example, this aggregation:

{
  "size": 100,
  "aggs": {
    "top_level_avg": {
      "avg": {
        "field": "tiers.spread"
      }
    }
  }
}

will return 100 search hits, but because there is no filtering condition in the aggregation, and no partitioning in the aggregation, the average will be across the entire index.

In contrast, this aggregation:

{
  "size": 100,
  "aggs": {
    "top_level_avg": {
      "avg": {
        "field": "tiers.spread"
      }
    },
    "my_terms": {
      "terms": {
        "field": "title",
        "size": 5
      },
      "aggs": {
        "avg_by_title": {
          "avg": {
             "field": "tiers.spread"
           }
        }
      }
    }
  }
}

Will give you the same "global" average of all the documents, but also an average-per-title because of the avg embedded in the terms aggregation. This second average will be complete for all documents that match each of the titles

You'll note the size: 5 on the terms agg. If you have more than five title in your data, then some of the titles will be omitted.

This isn't quite what you were asking since you were using top_hits, but if you can update your query to correct syntax I might be able to answer more specifically.

system · May 4, 2018, 12:51pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Top average list of terms Elasticsearch	4	725	March 22, 2019
Aggs don't return all records Elasticsearch	2	2007	January 5, 2018
Using aggr query over a specific size of result instead of entire index Elasticsearch	2	512	August 11, 2017
Why aggregation result return max results only 10 Elasticsearch	5	2997	February 8, 2019
Return the top k hits from each bucket after aggregation? Elasticsearch	3	470	February 27, 2021

Averaging more than 10,000 results

Related topics