3 big indices with very slow query

#1

I have 3 big indices (logs_2016, logs_2017 and logs_2018). They are about 200gb size and 200000000 docs count. I'm doing 99% of my queries by parameter user_id. Here's the mapping for logs_2017 e.g.:

{
  "logs_2017": {
    "mappings": {
      "log_item": {
        "_routing": {
          "required": true
        },
        "properties": {
          "@timestamp": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "app_id": {
            "type": "string",
            "index": "not_analyzed"
          },
          "user_id": {
            "type": "string",
            "index": "not_analyzed"
          },
          "description": {
            "type": "string"
          },
          "type_app": {
            "type": "long"
          },
          "os": {
            "type": "string",
            "index": "not_analyzed"
          },
          "session_id": {
            "type": "string",
            "index": "not_analyzed"

.. .
}
}
}
}
}
}

How I query (getting all logs for specific event_id and device_type, ordering by timestamp and session_id):

GET logs_2016,logs_2017,logs_2018/_search
{
  "query": {
    "bool": {
      "must": [
        {
         "match": {
           "user_id": "32232699237"
         } 
        },
        {
          "match": {
            "device_type": "31"
          }
        }
      ]
    }
  },
  "size":100,
  "from":0,
  "sort":[{"timestamp":{"order":"desc"}}, {"session_id":{"order":"desc"}}]
}

My results are very slow, 2-4 seconds. How can I improve query results, should I put more indices with the same last event_id number e.g. or maybe monthly indices (12 per year, but then I would do searching over 36 indices for 3 years)?

Running ES on 2 client nodes (4 CPU, 6GB RAM, 2GB HDD), 2 data nodes (8 CPU, 16GB RAM, 400GB HDD).

While doing logs template i put 16 number of shards and 1 number of replicas.

Ty

(Christian Dahlqvist) #2

What does CPU and disk I/O (and iowait) look like on the data nodes when you are querying? Are you indexing at the same time?

(Safwan Rahman) #3

Your query does nto reflect your mapping type. Can you provide the correct mapping here where event_id and device_type is present?

#4

it should be user_id instead of event_id, i just copied wrong query..

#5

where can i check this, _cat/nodes?v ?

(Christian Dahlqvist) #6

If you are on Linux you can use 'iostat -x'.

#7

i can connect only to client node via ssh unfortunately..

(Safwan Rahman) #8

I think you should use keyword datatype instead of string datatype. And moreover, it seems like you should use filter with term query instead of must with match query.

Can you run below query without changing the mapping and check if any performance get improved?

GET logs_2016,logs_2017,logs_2018/_search
{
  "query": {
    "bool": {
      "filter": [
        {
         "match": {
           "user_id": "32232699237"
         } 
        },
        {
          "match": {
            "device_type": "31"
          }
        }
      ]
    }
  },
  "size":100,
  "from":0,
  "sort":[{"timestamp":{"order":"desc"}}, {"session_id":{"order":"desc"}}]
}
#9

I already tried filter query, but I'm getting pretty much same results (3-4 seconds) :confused:

(Safwan Rahman) #10

What about sorting the indexes according to timestamp and session_id?
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html

(Safwan Rahman) #11

You may also use the track_total_hits: false so that Elasticsearch will not try to count the number of documents.

#12

forgot to mention i'm using ES 2.3 version, there is no index sorting or type keyword in this version..

(Safwan Rahman) #13

In that case the best advice I can give is to upgrade your cluster to atleast 5.x

#14

Need to reindex everything, it will take 4 days :slight_smile:

(Safwan Rahman) #15

If you are not able to upgrade, I think then you should increase the data node of your cluster.
Seems like every index have 16 shards and 1 replica. And there are 3 indexes. So total 16x2x3 = 96 shards. 2 nodes are overwhelmed with this 96 shards. Maybe increasing the data node will solve the slow query issue.

#16

I don't think that will be possible in the near future.. probably trying those things you mentioned in new ES version. Was testing now without "sort" and it's about 5 times faster (that's acceptable), but I need data to be sorted by timestamp. Just realized all users have user_id in number format, and I have "string" in my mapping. Does it matter or it would be the same? And thanks a lot..

(Safwan Rahman) #17

I dont know if saving the user_id will help, but I think if you do not need mapping store for the field you needed. Maybe it will give some performance boosting.
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/mapping-store.html

#18

Should I put more data nodes with less size maybe? I can put maybe 8 data nodes X 50GB instead of 2 data nodes X 200 GB?

(Alex Marquardt) #19

How many shards do you have in each index?

#20

16, and 1 replica