Very slow aggregate query with 15 M of records

Hello, I have an index that has more than 15M records of website logs but while I'm trying to do an aggregation to sum the total of some fields the request takes a long time more than 10 seconds and after that gives me an error

{
  "statusCode": 502,
  "error": "Bad Gateway",
  "message": "socket hang up"

And the following query is:

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "day": {
              "gte": "2024-04-03",
              "lte": "2024-04-04",
              "format": "yyyy-MM-dd"
            }
          }
        },
        {
          "term": {
            "country_alias": "kw"
          }
        }
      ]
    }
  },
  "aggs": {
    "entities": {
      "terms": {
        "field": "entity_title",
        "size": 100000
      },
      "aggs": {
        "records": {
          "top_hits": {
            "size": 1,
            "_source": {
              "includes": [
                "entity_id",
                "entity_type",
                "entity_title",
                "country_alias",
                "main_taxonomy",
                "main_taxonomy_title",
                "owner_phone",
                "created_at",
                "updated_at"
              ]
            }
          }
        },
        "total_visits": {
          "sum": {
            "script": {
              "lang": "painless",
              "source": "doc['visit_ios_count'].value + doc['visit_android_count'].value + doc['visit_huawei_count'].value + doc['visit_web_count'].value"
            }
          }
        },
        "total_calls": {
          "sum": {
            "script": {
              "lang": "painless",
              "source": "doc['call_ios_count'].value + doc['call_android_count'].value + doc['call_huawei_count'].value + doc['call_web_count'].value"
            }
          }
        },
        "total_whatsapp": {
          "sum": {
            "script": {
              "lang": "painless",
              "source": "doc['whatsapp_ios_count'].value + doc['whatsapp_android_count'].value + doc['whatsapp_huawei_count'].value + doc['whatsapp_web_count'].value"
            }
          }
        },
        "total_chat": {
          "sum": {
            "script": {
              "lang": "painless",
              "source": "doc['chat_ios_count'].value + doc['chat_android_count'].value + doc['chat_huawei_count'].value + doc['chat_web_count'].value"
            }
          }
        }
      }
    }
  }

Can any one help me to fix this issue?

@dadoonet Can you help me?

This a community forum where everyone volunteers so it is considered rude to ping people not already involved in the convesation. Please also be patient as it can take time to get questions answered. The more specific they are the more time it may take as not everyone may know the anser or that area. If you have not received any response after 2 or 3 business days it is usually fine to bump the thread.

When asking a question it is always useful to specify which version of Elasticsearch you are using and some details about the size and configuration of your cluster.

I have never seen the error you posted. Do you have a proxy in between the client and Elasticsearch that could cause the error?

How many shards is the data spread across? What is the size of these?

When your query runs and is slow, do you notice high CPU usage?

Given that you use scripting to sum up different fields within the document, have you tested instead adding these to the document so you do not have to use scripts?

3 Likes

I'm so sorry for the mistake and all respect for our community.

I have never seen the error you posted. Do you have a proxy in between the client and Elasticsearch that could cause the error?

I have asked the DevOps and he's said it's added Cloudflare

I'm using the latest version of Elasticsearch 8.15.

This is the shards query result.

classified-log-report-per-day 0 p STARTED    54000 28.4mb 28.4mb 172.18.0.2 Elasticsearch
classified-log-report-per-day 0 r UNASSIGNED                                

Finally, I'm using the sum script because this data is synced from MongoDB.

=========================

I'll share the final result example, In this example, I'm trying to group the totals for each entity_title to get the totals in range date for example in 1 month.

{
  "took": 42,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "entities": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 27344,
      "buckets": [
        {
          "key": "ابحث عن عمل ",
          "doc_count": 21,
          "records": {
            "hits": {
              "total": {
                "value": 21,
                "relation": "eq"
              },
              "max_score": 1.6797194,
              "hits": [
                {
                  "_index": "classified-log-report-per-day",
                  "_id": "66a9ad4f72912a8f4a00cf71",
                  "_score": 1.6797194,
                  "_source": {
                    "entity_title": "ابحث عن عمل ",
                    "created_at": "2024-07-31T03:19:43.072000",
                    "updated_at": "2024-07-31T03:19:43.072000",
                    "country_alias": "kw",
                    "main_taxonomy": 195,
                    "owner_phone": "+96599615720",
                    "entity_id": 10464740,
                    "entity_type": "post",
                    "main_taxonomy_title": {
                      "ar": "وظائف / باحثون عن عمل",
                      "en": "Jobs"
                    }
                  }
                }
              ]
            }
          },
          "total_visits": {
            "value": 53
          },
          "total_calls": {
            "value": 0
          },
          "total_whatsapp": {
            "value": 3
          },
          "total_chat": {
            "value": 0
          }
        }
      ]
    }
  }
}```

That then probably explains the error seen.

It looks like you have a single node and as the index is very small it is odd that the query even if it is using scripts is taking that long.

It would help if you could provide details about the amount of resources available to this node, e.g. CPU cores, RAM, heap size and the type of storage used.

It would also help if you could monitor CPU usage while the query is running.

1 Like


Can you check the image above?

That does not really answer my question about provisioned resources. Was it taken when the query was running?