Date histogram agg crashes cluster

javadevmtl · April 9, 2016, 12:14am

Hi running 2.3.0

This issue is 100% reproducible.

We have a 12 node cluster running with 20GB of RAM per node so 240GB total.

There is 2 indexes Index1 is about 35,000,000 records and Index2 about 6,000,000. Both indexes are "identical" except for the mapping difference noted below (We attempted a new mapping).

Index1 has a mapping of...

"myDate": {
        "format": "dateOptionalTime",
        "type": "date"
},

Index2 has a mapping of...

"myDate": {
        "type": "long"
 },

The type is the same on both indexes. The documents are inserted with myDate as yyyyMMdd (no time just the days).

Index1 has...
8,000,0000 documents for 20160101
6,000,0000 documents for 20160102
7,000,0000 documents for 20160103
8,000,0000 documents for 20160104

Index2 has
6,000,0000 documents for 20160407

When we run the below query the cluster crashes. We loose nodes...

GET index*/myType/_search
{
  "size" : 0,
  "aggregations" : {
    "Date" : {
      "date_histogram" : {
        "field" : "myDate",
        "interval" : "1d"
      },
      "aggregations" : {
        "Record Count" : {
          "value_count" : {
            "field" : "myId"
          }
        }
      }
    }
  },
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUser" : {
            "type" : "phrase",
            "query" : "user1"
          }
        }
      }
    }
  }
}

If we run the same agg individually on each index with out the wildcard. It seems to work. Though we have noticed that on Index1, the agg will return a couple thousand records for each "day". Index2 the agg returns a bucket with a doc count.

When we run the agg on a wildcard for both indexes that's where the problem occurs and we lose the cluster. From application stand point we are trying to rectify the issue by revising the mapping and the data inserted.

Just letting you know that the above combination wreaks havoc on Elasticsearch and hopefully something you can reproduce and fix to avoid this kind of crash.

Thanks

warkolm · April 9, 2016, 1:47am

Crashes it how?

javadevmtl · April 11, 2016, 2:22am

We lose nodes. You see them disconnect and cluster tries to rebalance itself. They come back eventually I think. But the first time I manually bounced the nodes. I'll reproduce it again. And see what happens...

warkolm · April 11, 2016, 2:54am

Yeah but why are they lost - CPU, OOM, other?

javadevmtl · April 11, 2016, 12:42pm

I didn't see oom exception but it looks like memory issue.

It has something to do with the fact that one index is date optional and the other is long and it seems to try to load to much data in ram?

javadevmtl · April 11, 2016, 2:33pm

It's GC. The node the that received the query, KOPF reported the RAM usage to 100%

Below query ok.

Where myDate is dateOptionalTime
GET index1-201601/myType/_search
{
  "size": 0,
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUserId" : {
            "type" : "phrase",
            "query" : 100000
          }
        }
      }
    }
  },
  "aggs": {
    "bydate": {
      "date_histogram": {
        "field": "myDate",
        "interval": "day",
        "format" : "yyyyMMdd"
      }
    }
  }
}

Below query ok. Returns just a doc count. Should date histogram even work on long?

Where myDate is long
GET index2-201604/myType/_search
{
  "size": 0,
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUserId" : {
            "type" : "phrase",
            "query" : 100000
          }
        }
      }
    }
  },
  "aggs": {
    "bydate": {
      "date_histogram": {
        "field": "myDate",
        "interval": "day",
        "format" : "yyyyMMdd"
      }
    }
  }
}

Below query the culprit.

Query on wildcard.
GET index*/myType/_search
{
  "size": 0,
  "query" : {
    "bool" : {
      "must" : {
        "match" : {
          "myUserId" : {
            "type" : "phrase",
            "query" : 100000
          }
        }
      }
    }
  },
  "aggs": {
    "bydate": {
      "date_histogram": {
        "field": "myDate",
        "interval": "day",
        "format" : "yyyyMMdd"
      }
    }
  }
}

Causes immediate GC thrashing on the node that received the query.

Logs here: http://pastebin.com/Dbczc8qK

The rest of the nodes seem ok. But the 1 node still hasn't recovered.

javadevmtl · April 11, 2016, 2:42pm

Ok, the node core dumped. The rest of the nodes are fine. So wherever that query is run it causes kaos to the node.

warkolm · April 11, 2016, 9:01pm

Maybe check with the https://www.elastic.co/guide/en/elasticsearch/reference/current/search-validate.html API and see what is happening?

Though "one index is date optional and the other is long" doesn't sound good even if it isn't related.

javadevmtl · April 13, 2016, 7:42pm

Validate doesn't support aggs.

Anyways I can reproduce this all the time. @warkolm is there anybody who can take a closer look at this?

javadevmtl · April 19, 2016, 1:23pm

@warkolm

Hello Mark what should I do? File a bug? Any other thoughts on this?

Topic		Replies	Views
Different values using date histogram Elasticsearch	1	193	March 24, 2023
Date_histogram returning duplicates in multi-cluster after upgrade Elasticsearch ccs-cross-cluster-search	2	177	May 3, 2024
Too_many_buckets_exception Elasticsearch	1	690	February 24, 2020
Date histogram aggregation issue for arrays fields Elasticsearch	2	275	February 26, 2022
Histogram aggregation Elasticsearch Elasticsearch	1	337	November 15, 2019

Date histogram agg crashes cluster

Related topics