Query, aggregation (date_histogram) and timezone performance ES 5.6.2

Hi!

We are running ES 5.6.2 on an 8 core, 64GB (30 GB heap). We are using a SAN (yes, I know this is not recommended but let me continue). During the tests that we perform at the moment we use a single data node + separate master.

Yesterday I started to do some basic ingest performance tests with the settings below (tuning some of the parameters a little bit).

"index.number_of_replicas": 0,    
"index.number_of_shards" : 1,    
"index.refresh_interval" : "-1",    
"index.translog.durability" : "async",    
"index.translog.flush_threshold_size": "1gb"
 
PUT _cluster/settings { "transient" : { "indices.store.throttle.type" : "none"    }  } (reverted to merge before search)
PUT _cluster/settings { "persistent" : { "indices.store.throttle.max_bytes_per_sec" : "200mb"   } }

This is what we are indexing

 "_source": {
    "@version": "1",
    "@timestamp": "2017-10-18T15:18:24.595Z",
    "message": "contents of an access log file",
    "kafka": {
      "topic": "perf_test",
      "consumer_group": "perf_test",
      "partition": 1,
      "offset": 2165328553,
      "key": null
    }
  }

Mapping

 "mappings" : {
    "_default_" : {
      "_all" : {"enabled" : false, "norms" : false},
      "dynamic_templates" : [ {
        "message_field" : {
          "path_match" : "message",
          "match_mapping_type" : "string",
          "mapping" : {
            "type" : "text",
            "norms" : false
          }
        }
      }, {
        "string_fields" : {
          "match" : "*",
          "match_mapping_type" : "string",
          "mapping" : {
            "type" : "text", "norms" : false
          }
        }
      } ],
      "properties" : {
        "@timestamp": { "type": "date", "include_in_all": false },
        "@version": { "type": "keyword", "include_in_all": false },
        "geoip"  : {
          "dynamic": true,
          "properties" : {
            "ip": { "type": "ip" },
            "location" : { "type" : "geo_point" },
            "latitude" : { "type" : "half_float" },
            "longitude" : { "type" : "half_float" }
          }
        }
      }
    }
  }

We had an avg throughput of 55.000 docs per second which i think is quite ok. The index now contains 364 million records accross two hours. I was not able to track down any weird I/O related things during indexing so the SAN seems to work ok for now.

The problem: After this I went to kibana to just have a look the the result but I kept getting timeouts until I increased the timeout a lot. The default kibana (Discovery page, 24 hours, query=*) takes over 130 seconds and the CPU (one core) goes up to 100%. During this time we have no other queries and no indexing is performed. If I start a new query the CPU goes to 200% but the response times ("took":time_ms) are the same.

This is the query that kibana uses.

{
  "version": true,
  "size": 500,
  "sort": [
    {
      "@timestamp": {
        "order": "desc",
        "unmapped_type": "boolean"
      }
    }
  ],
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "*",
            "analyze_wildcard": true
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": 1508316014801,
              "lte": 1508402414801,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "30m",
        "time_zone": "Europe/Berlin",
        "min_doc_count": 1
      }
    }
  },
  "stored_fields": [
    "*"
  ],
  "script_fields": {},
  "docvalue_fields": [
    "@timestamp"
  ],
  "highlight": {
    "pre_tags": [
      "@kibana-highlighted-field@"
    ],
    "post_tags": [
      "@/kibana-highlighted-field@"
    ],
    "fields": {
      "*": {
        "highlight_query": {
          "bool": {
            "must": [
              {
                "query_string": {
                  "query": "*",
                  "analyze_wildcard": true,
                  "all_fields": true
                }
              },
              {
                "range": {
                  "@timestamp": {
                    "gte": 1508316014801,
                    "lte": 1508402414801,
                    "format": "epoch_millis"
                  }
                }
              }
            ],
            "must_not": []
          }
        }
      }
    },
    "fragment_size": 2147483647
  }
}

Remove aggregations: If I remove the aggs2 part of the same query it takes approx 7 seconds to perform this query which is of course much faster but still sort of long time.

Remove timezone: If I keep the aggs2 but remove the timezone part it goes from 130 secs down to 26 seconds. I still think this is way to long but I am not sure what to expect here.

The question: Is this expected behaviour? I was expecting this to be a bit faster and not consume this amount of cpu. System load is low and I cannot find any I/O issues during search either. Or is 300 million docs to many to work with even with this simple query?

Any suggestions are highly appreciated.

Regards /Johan

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.