Query performance questions

Hello!

I have some questions about how queries work and what's the effect of some config options.

I have 14 indexes (I'm using the logstash-YYYY-MM as their names to have documents from the same time range in the same index) each one with 5 shards and 1 replica. Each index has (approx) 10 GB of data and 25 millions of documents.

I'm using the following mapping:

{
  "template": "logstash-*",
  "settings": {
    "index": {
      "refresh_interval": "5s",
      "cache": {
        "query": {
          "enable": "true"
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "dynamic_templates": [
        {
          "strings": {
            "match": "*",
            "match_mapping_type": "string",
            "mapping":   { "type": "string", "doc_values": true, "index": "not_analyzed" }
          }
        }
      ],
      "_all":    { "enabled": false },
      "_source": { "enabled": true },
      "properties": {
        "@timestamp":         { "type": "date",    "doc_values": true }
      }
    }
  }
}

I'm mostly interested in analytics, so I'm using aggregations heavily. Most aggregations are data_histograms on @timestamp, but I also use different terms aggregations on string fields (let's call them A, B and C). Also, it's worth mentioning that I want to support full-year queries in acceptable times, which will get data from 12 indexes.

Currently, I'm using a 5 node cluster where each node has 2 cores and 7.5 GB of RAM, so in total I have 10 cores and 37,5 GB of RAM.

My main issue is that I'm getting high response times (30 seconds, and after that the request timeouts) and seeing 100% CPU usage when trying to display a full-year dashboard with some charts, and I want to understand if I'm just hitting a CPU bottleneck because of the amount of data that is being queried, or if something is wrongly configured.

These are some observations and some questions:

  1. Only 25% of RAM is being used. I thought that aggregations relied on RAM to make them work faster, so I was expecting to have much more data in the heap (although I know I'm using doc_values, which are supposed to store fielddata in disk).
  2. When I use the cat API to show fielddata on nodes and how many RAM they're using, I can see A, B and C listed there, but timestamp is not. Is that normal? Shouldn't timestamp be listed there, too?
  3. I tried the following: I requested a year dashboard with Kibana and then I used the cat api to list the thread pools. I saw that each node had 6 search threads for the search pool, with like 100 pending tasks on each pool. AFAIK each thread processes one shard, so does that mean that if I reduce the number of shards per index, I would have less pending tasks on the pool and therefore better response times or that would just reduce the pending tasks but make them run longer?
  4. It's stated in the docs in several places that it's okay to overallocate shards. Doesn't that have more performance issues like the one pointed out in point 3? Could reduce the number of shards per index improve query response times?
  5. Is there any way of knowing why a query is taking so long to process? Where time is being spent? Keep in mind I don't have access to log files, so I can't use the slow log.
  6. I keep seeing in Kibana (full-year queries) that I have acceptable query times (in the order of 2-3 seconds) but the request time is much higher (20-25 seconds). Any idea of what the reason could be?
  7. Is my cluster just too small to handle that kind of queries for that amount of data?

I'm highly interested in any other guideline/tip/recommendation that you guys could give me about other stuff I could check to gain more information about the cluster or to understand how queries work and where the bottleneck could be.

Thanks!