Cluster configuration for log storage. 140Gb/day


(Vlad Nashekin) #1

Good day, everyone!
First time user here.

Im trying to configure ELK stack for indexing website logs.
In developer environment all works great, but I have problems with production.

Our system:

140 Gb of logs/day
500 million log records/day ~ 600 Gb index size
Indexing rate 10000/s

Cluster specifications:

Two windows server 2012:

  • 96 Gb RAMw
  • Intel Xeon CPU X5675 @ 3.07 GHz , 12 cores
  • HDD: Ibm serveraid m5015
  • CPU Usage: 20%
  • JVM Heap: 31 Gb max
  • One ElasticSearch node on each server

Everyday I'm creating new index, for example filebeat-2017.10.13

Store logs for last 6 days

Total Documents: 3.5 billion

Total index size: 4 Tb

Fields in index: 100

Problem:
Queries are slowwwww.
For example: simple count aggregation for all 6 days data took 5 minutes

        GET _search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "_exists_:aggregate_final",
            "analyze_wildcard": false
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": 1507208538462,
              "lte": 1507813338462,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "3h",
        "time_zone": "Asia/Baghdad",
        "min_doc_count": 1
      }
    }
  }
}
But other queries are slow too

Questions:

  1. Is there enough capacity in the cluster? Do I need more nodes or more machines?
    Maybe I have violated some obvious best practices.
  2. Can SSD help me?
  3. Should I check mapping because index size is 5 times bigger in raw size?

Thanks for response!


(David Pilato) #2

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Some thoughts:

  • What is exactly the full query? I mean is the query_string part inside a query or a filter?
  • analyze_wildcard: do you really intend to run queries like foo*bar? As per doc says it's super slow.
  • Do you really want to compute a bucket for every 3 hours but for the full 6 days? Don't you want to add a filter by date and just look at the last 24 hours for example?

What are the index settings? How many shards per day?

Also using _exists_:aggregate_final is going to most likely in your use case give back all the documents. So you compute an aggregation on 3.5 billion docs most likely + the cost of running the query which could be faster with a match_all.

One thing you can do is to run a query filtered per day and compute the agg only for that day. Then use a multisearch query to run 5 of them in parallel.

Can SSD help me?

Yes.

Should I check mapping because index size is 5 times bigger in raw size?

Yes. Remove _all, remove non needed keyword fields, non needed text fields.

If you are planning to query often on the existence of aggregate_final field, may be you should simply index that value as a boolean and filter by that.

Just some thoughts.


(Christian Dahlqvist) #3

Try to identify what is limiting performance. Is it disk I/O? If so more nodes or faster storage, e.g. SSDs, or scaling out the cluster will help. Is it CPU or heap pressure? Then scaling out the cluster might be necessary?

With respect to the size the data takes up on disk, you can look at this blog post, which discusses the effects of enrichment and mappings on disk usage.


(Vlad Nashekin) #4

Thank you for reply, Christian!
I looked I/O stats in windows, and see 100% disk load, mostly writes.
So my plan is reduce index size with good mapping, and trying to obtain mode servers.
How many machines i need for such a load?

Thanks.


(Christian Dahlqvist) #5

I do not know, as that depends on what your query load looks like (type of queries/aggregations, time period queried, number of concurrent queries) and how much resources the indexing load consumes.

If you can rate limit indexing and lower it until query performance is acceptable for your query mix/load, you may get an indication of how many nodes you need to add to also handle the full indexing load.

If disk I/O is as heavy as you say, I would expect SSDs to help a lot as well and allow you to utilise the rest of your resources better.


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.