Improving performance of string id filtering and aggregation


With the query i get a result of 900ms - 1200ms, this is not good enough. i want to improve it and i feel i got stuck.

my SPEC:

3 nodes
8 ram (each node)
4 cpu (each node) - 2.90GHz, capacity: 4230MHz, width: 64 bits
5 shards, 3 replica each shard
ES: 5.2.1
about 7,000,000 docs


        "properties" : {
          "user_id": {
            "type": "string",
            "index": "not_analyzed"
          "date": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          "tran_value": {
            "type": "float"
          "type": {
            "type": "string",
            "index": "not_analyzed"


    "profile": true,
  "query": {
    "bool": {
      "must": [
          "term": {
            "doc.type": "transaction"
          "range": {
            "": {
              "lt": "2018-10-14T04:51:19.807Z"
          "terms": {
            "doc.user_id": [
  "size": 0,
  "from": 0,
  "aggs": {
    "transactionTotalValue": {
      "sum": {
        "field": "doc.tran_value"
    "transactionMinDate": {
      "min": {
        "field": ""

profile result:

(Christian Dahlqvist) #2

How large are your documents? How large are your shards? How many unique user_ids do you have in the data set? Is this the full data set you will be using in production? What number of concurrent queries are you expecting?


i have per shard about 1,266,414 docs with the total size of 820mb.
unique user_ids i have about 6 million.
in prod i have a bigger mapping and more aggs inside the query, but i found that the this filter and those aggs are the longest.
i will have about 4-5 concurrent queries that will run about 9-10 time in a min.
i got the time written here only for running this query alone.

is mapping the user_id as long number instead of a string can improve the filter time?

(Christian Dahlqvist) #4

As your data set is quite small it is likely that it will be cached, at least after the first query has run. I would recommend looking at this guide, and especially try forcemerging down to a single segment to see what effect that has. You can also experiment with the shard count and try increasing it a bit, and try using preference to get a spread of processing across the nodes in the cluster.

(ddorian43) #5

String user_id is faster as a filter compared to long. You can make it shorter though.
And if you can, try to encode the integers in there as base128/256 strings, to maybe make them shorter and maybe make the term-dictionary faster. Otherwise you have slow disk / low-ram / low-cpu if that simple filter takes that much time.

Or retry the query several times, to get the cached response and see if your disk is too slow.

(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.