Improving performance of string id filtering and aggregation


#1

With the query i get a result of 900ms - 1200ms, this is not good enough. i want to improve it and i feel i got stuck.

my SPEC:

3 nodes
8 ram (each node)
4 cpu (each node) - 2.90GHz, capacity: 4230MHz, width: 64 bits
5 shards, 3 replica each shard
ES: 5.2.1
about 7,000,000 docs

mapping:

     {
        "properties" : {
          "user_id": {
            "type": "string",
            "index": "not_analyzed"
          },
          "date": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "tran_value": {
            "type": "float"
          },
          "type": {
            "type": "string",
            "index": "not_analyzed"
          }
      }

query:

{
    "profile": true,
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "doc.type": "transaction"
          }
        },
        {
          "range": {
            "doc.date": {
              "lt": "2018-10-14T04:51:19.807Z"
            }
          }
        },
        {
          "terms": {
            "doc.user_id": [
              "u::2-1482136850986-166310",
              "u::2-1476646137089-722440",
              "u::2-1479194700963-852066",
              "u::2-1480265460875-343974",
              "u::2-1481480223561-687220",
              "u::2-1490893533742-495973",
              "u::2-1491369718244-637855",
              "u::2-1516387240762-347825",
              "u::2-1528653244185-72739",
              "u::2-1528655615289-795763",
              "u::2-1529855538547-966658",
              "u::2-1530351338917-849377",
              "u::2-1530370314753-233523",
              "u::2-1530895620180-904769",
              "u::2-1534446475987-300580",
              "u::2-1535168457544-236782",
              "u::2-1537987992702-74474",
              "u::2-1538136688139-639988",
              "u::2-1538496245078-5097"
            ]
          }
        }
      ]
    }
  },
  "size": 0,
  "from": 0,
  "aggs": {
    "transactionTotalValue": {
      "sum": {
        "field": "doc.tran_value"
      }
    },
    "transactionMinDate": {
      "min": {
        "field": "doc.date"
      }
    }
  }
}

profile result:
profile_1_shard


(Christian Dahlqvist) #2

How large are your documents? How large are your shards? How many unique user_ids do you have in the data set? Is this the full data set you will be using in production? What number of concurrent queries are you expecting?


#3

i have per shard about 1,266,414 docs with the total size of 820mb.
unique user_ids i have about 6 million.
in prod i have a bigger mapping and more aggs inside the query, but i found that the this filter and those aggs are the longest.
i will have about 4-5 concurrent queries that will run about 9-10 time in a min.
i got the time written here only for running this query alone.

is mapping the user_id as long number instead of a string can improve the filter time?


(Christian Dahlqvist) #4

As your data set is quite small it is likely that it will be cached, at least after the first query has run. I would recommend looking at this guide, and especially try forcemerging down to a single segment to see what effect that has. You can also experiment with the shard count and try increasing it a bit, and try using preference to get a spread of processing across the nodes in the cluster.


(ddorian43) #5

String user_id is faster as a filter compared to long. You can make it shorter though.
And if you can, try to encode the integers in there as base128/256 strings, to maybe make them shorter and maybe make the term-dictionary faster. Otherwise you have slow disk / low-ram / low-cpu if that simple filter takes that much time.

Or retry the query several times, to get the cached response and see if your disk is too slow.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.