3 big indices with very slow query

ansamHox · March 15, 2019, 6:37pm

I have 3 big indices (logs_2016, logs_2017 and logs_2018). They are about 200gb size and 200000000 docs count. I'm doing 99% of my queries by parameter user_id. Here's the mapping for logs_2017 e.g.:

{
  "logs_2017": {
    "mappings": {
      "log_item": {
        "_routing": {
          "required": true
        },
        "properties": {
          "@timestamp": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          },
          "app_id": {
            "type": "string",
            "index": "not_analyzed"
          },
          "user_id": {
            "type": "string",
            "index": "not_analyzed"
          },
          "description": {
            "type": "string"
          },
          "type_app": {
            "type": "long"
          },
          "os": {
            "type": "string",
            "index": "not_analyzed"
          },
          "session_id": {
            "type": "string",
            "index": "not_analyzed"

.. .
}
}
}
}
}
}

How I query (getting all logs for specific event_id and device_type, ordering by timestamp and session_id):

GET logs_2016,logs_2017,logs_2018/_search
{
  "query": {
    "bool": {
      "must": [
        {
         "match": {
           "user_id": "32232699237"
         } 
        },
        {
          "match": {
            "device_type": "31"
          }
        }
      ]
    }
  },
  "size":100,
  "from":0,
  "sort":[{"timestamp":{"order":"desc"}}, {"session_id":{"order":"desc"}}]
}

My results are very slow, 2-4 seconds. How can I improve query results, should I put more indices with the same last event_id number e.g. or maybe monthly indices (12 per year, but then I would do searching over 36 indices for 3 years)?

Running ES on 2 client nodes (4 CPU, 6GB RAM, 2GB HDD), 2 data nodes (8 CPU, 16GB RAM, 400GB HDD).

While doing logs template i put 16 number of shards and 1 number of replicas.

Ty

Christian_Dahlqvist · March 15, 2019, 7:03pm

What does CPU and disk I/O (and iowait) look like on the data nodes when you are querying? Are you indexing at the same time?

safwan · March 15, 2019, 8:36pm

Your query does nto reflect your mapping type. Can you provide the correct mapping here where event_id and device_type is present?

ansamHox · March 15, 2019, 9:33pm

it should be user_id instead of event_id, i just copied wrong query..

ansamHox · March 15, 2019, 9:33pm

where can i check this, _cat/nodes?v ?

Christian_Dahlqvist · March 15, 2019, 9:36pm

If you are on Linux you can use 'iostat -x'.

ansamHox · March 15, 2019, 9:47pm

i can connect only to client node via ssh unfortunately..

safwan · March 15, 2019, 10:03pm

I think you should use keyword datatype instead of string datatype. And moreover, it seems like you should use filter with term query instead of must with match query.

Can you run below query without changing the mapping and check if any performance get improved?

GET logs_2016,logs_2017,logs_2018/_search
{
  "query": {
    "bool": {
      "filter": [
        {
         "match": {
           "user_id": "32232699237"
         } 
        },
        {
          "match": {
            "device_type": "31"
          }
        }
      ]
    }
  },
  "size":100,
  "from":0,
  "sort":[{"timestamp":{"order":"desc"}}, {"session_id":{"order":"desc"}}]
}

ansamHox · March 15, 2019, 10:09pm

I already tried filter query, but I'm getting pretty much same results (3-4 seconds)

safwan · March 15, 2019, 10:11pm

What about sorting the indexes according to timestamp and session_id?
https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules-index-sorting.html

safwan · March 15, 2019, 10:16pm

You may also use the track_total_hits: false so that Elasticsearch will not try to count the number of documents.

ansamHox · March 15, 2019, 10:50pm

forgot to mention i'm using ES 2.3 version, there is no index sorting or type keyword in this version..

safwan · March 15, 2019, 10:54pm

In that case the best advice I can give is to upgrade your cluster to atleast 5.x

ansamHox · March 15, 2019, 10:56pm

Need to reindex everything, it will take 4 days

safwan · March 15, 2019, 11:04pm

If you are not able to upgrade, I think then you should increase the data node of your cluster.
Seems like every index have 16 shards and 1 replica. And there are 3 indexes. So total 16x2x3 = 96 shards. 2 nodes are overwhelmed with this 96 shards. Maybe increasing the data node will solve the slow query issue.

ansamHox · March 15, 2019, 11:22pm

I don't think that will be possible in the near future.. probably trying those things you mentioned in new ES version. Was testing now without "sort" and it's about 5 times faster (that's acceptable), but I need data to be sorted by timestamp. Just realized all users have user_id in number format, and I have "string" in my mapping. Does it matter or it would be the same? And thanks a lot..

safwan · March 15, 2019, 11:55pm

I dont know if saving the user_id will help, but I think if you do not need mapping store for the field you needed. Maybe it will give some performance boosting.
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/mapping-store.html

ansamHox · March 19, 2019, 11:30am

Should I put more data nodes with less size maybe? I can put maybe 8 data nodes X 50GB instead of 2 data nodes X 200 GB?

Alex_Marquardt · March 27, 2019, 7:09pm

How many shards do you have in each index?

ansamHox · March 27, 2019, 8:43pm

16, and 1 replica