Averaging more than 10,000 results

I have millions of records. I want to get the average of a specific field, but can only return less that 10,000 rows.
Is the average returned is the real average of the millions of rows, or the average of the returned rows?

This is the query I run:

{
  "query": {
    "bool": {
      "must": [
        {
          "bool": {
            "should": [
              {
                "term": {
                  "ccpair":"EUR / USD"
                }
              },
              {
                "term": {
                  "ccpair":"EUR/USD"
                }
              },
              {
                "term": {
                  "ccpair":"EUR/USD-"
                }
              }]
          }
        },
        {
          "bool": {
            "must": [
              {
                "term": {
                  "tnetServerName": "saxo-p.tradair.com"
                }
              }]
          }
        }
      ]
    }
  },
  "aggs": {
    "platformNames": {
      "terms": {
        "field": "platformName",
        "size": 5
      },
      "aggs": {
        "per_hour": {
          "date_histogram": {
            "field": "timestamp",
            "interval": "hour"
          },
          "aggs": {
            "by_top_hit": {
              "top_hits": {
                "size": <does the size really matter?>
              }
            },
            "aggs": {
              "nested": {
                "path": "tiers"
              },
              "aggs": {
                "avg_spread2": {
                  "avg": {
                    "field": "tiers.spread"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

First, I think your agg syntax is a bit messed up. This section:

looks to be bad syntax. E.g. the next level "aggs" needs to embed inside a named aggregation, next to the agg type. E.g. it'd look more like:

       "aggs": {
            "by_top_hit": {
              "top_hits": {
                "size": ...
              },
              "aggs": {
                "nested": {
                  "path": "tiers"
                },
              }
            }

With the nested agg going next to "top_hits", not below it.

That said, this will not work, because top_hits doesn't accept any child aggregations.

Ok, back to your original question:

Is the average returned is the real average of the millions of rows, or the average of the returned rows?

The average will be calculated on whatever the query and partitioning is configured for. For example, this aggregation:

{
  "size": 100,
  "aggs": {
    "top_level_avg": {
      "avg": {
        "field": "tiers.spread"
      }
    }
  }
}

will return 100 search hits, but because there is no filtering condition in the aggregation, and no partitioning in the aggregation, the average will be across the entire index.

In contrast, this aggregation:

{
  "size": 100,
  "aggs": {
    "top_level_avg": {
      "avg": {
        "field": "tiers.spread"
      }
    },
    "my_terms": {
      "terms": {
        "field": "title",
        "size": 5
      },
      "aggs": {
        "avg_by_title": {
          "avg": {
             "field": "tiers.spread"
           }
        }
      }
    }
  }
}

Will give you the same "global" average of all the documents, but also an average-per-title because of the avg embedded in the terms aggregation. This second average will be complete for all documents that match each of the titles

You'll note the size: 5 on the terms agg. If you have more than five title in your data, then some of the titles will be omitted.

This isn't quite what you were asking since you were using top_hits, but if you can update your query to correct syntax I might be able to answer more specifically.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.