Improving performance of string id filtering and aggregation

With the query i get a result of 900ms - 1200ms, this is not good enough. i want to improve it and i feel i got stuck.

my SPEC:

3 nodes
8 ram (each node)
4 cpu (each node) - 2.90GHz, capacity: 4230MHz, width: 64 bits
5 shards, 3 replica each shard
ES: 5.2.1
about 7,000,000 docs

mapping:

     {
        "properties" : {
          "user_id": {
            "type": "string",
            "index": "not_analyzed"
          },
          "date": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "tran_value": {
            "type": "float"
          },
          "type": {
            "type": "string",
            "index": "not_analyzed"
          }
      }

query:

{
    "profile": true,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "doc.type": "transaction"
          }
        },
        {
          "range": {
            "doc.date": {
              "lt": "2018-10-14T04:51:19.807Z"
            }
          }
        },
        {
          "terms": {
            "doc.user_id": [
              "u::2-1482136850986-166310",
              "u::2-1476646137089-722440",
              "u::2-1479194700963-852066",
              "u::2-1480265460875-343974",
              "u::2-1481480223561-687220",
              "u::2-1490893533742-495973",
              "u::2-1491369718244-637855",
              "u::2-1516387240762-347825",
              "u::2-1528653244185-72739",
              "u::2-1528655615289-795763",
              "u::2-1529855538547-966658",
              "u::2-1530351338917-849377",
              "u::2-1530370314753-233523",
              "u::2-1530895620180-904769",
              "u::2-1534446475987-300580",
              "u::2-1535168457544-236782",
              "u::2-1537987992702-74474",
              "u::2-1538136688139-639988",
              "u::2-1538496245078-5097"
            ]
          }
        }
      ]
    }
  },
  "size": 0,
  "from": 0,
  "aggs": {
    "transactionTotalValue": {
      "sum": {
        "field": "doc.tran_value"
      }
    },
    "transactionMinDate": {
      "min": {
        "field": "doc.date"
      }
    }
  }
}

profile result:
profile_1_shard

How large are your documents? How large are your shards? How many unique user_ids do you have in the data set? Is this the full data set you will be using in production? What number of concurrent queries are you expecting?

i have per shard about 1,266,414 docs with the total size of 820mb.
unique user_ids i have about 6 million.
in prod i have a bigger mapping and more aggs inside the query, but i found that the this filter and those aggs are the longest.
i will have about 4-5 concurrent queries that will run about 9-10 time in a min.
i got the time written here only for running this query alone.

is mapping the user_id as long number instead of a string can improve the filter time?

As your data set is quite small it is likely that it will be cached, at least after the first query has run. I would recommend looking at this guide, and especially try forcemerging down to a single segment to see what effect that has. You can also experiment with the shard count and try increasing it a bit, and try using preference to get a spread of processing across the nodes in the cluster.

String user_id is faster as a filter compared to long. You can make it shorter though.
And if you can, try to encode the integers in there as base128/256 strings, to maybe make them shorter and maybe make the term-dictionary faster. Otherwise you have slow disk / low-ram / low-cpu if that simple filter takes that much time.

Or retry the query several times, to get the cached response and see if your disk is too slow.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.