Cluster configuration for log storage. 140Gb/day

Vlad_Nashekin · October 13, 2017, 11:52am

Good day, everyone!
First time user here.

Im trying to configure ELK stack for indexing website logs.
In developer environment all works great, but I have problems with production.

Our system:

140 Gb of logs/day
500 million log records/day ~ 600 Gb index size
Indexing rate 10000/s

Cluster specifications:

Two windows server 2012:

96 Gb RAMw
Intel Xeon CPU X5675 @ 3.07 GHz , 12 cores
HDD: Ibm serveraid m5015
CPU Usage: 20%
JVM Heap: 31 Gb max
One ElasticSearch node on each server

Everyday I'm creating new index, for example filebeat-2017.10.13

Store logs for last 6 days

Total Documents: 3.5 billion

Total index size: 4 Tb

Fields in index: 100

Problem:
Queries are slowwwww.
For example: simple count aggregation for all 6 days data took 5 minutes

        GET _search
{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "_exists_:aggregate_final",
            "analyze_wildcard": false
          }
        },
        {
          "range": {
            "@timestamp": {
              "gte": 1507208538462,
              "lte": 1507813338462,
              "format": "epoch_millis"
            }
          }
        }
      ],
      "must_not": []
    }
  },
  "_source": {
    "excludes": []
  },
  "aggs": {
    "2": {
      "date_histogram": {
        "field": "@timestamp",
        "interval": "3h",
        "time_zone": "Asia/Baghdad",
        "min_doc_count": 1
      }
    }
  }
}

But other queries are slow too

Questions:

Is there enough capacity in the cluster? Do I need more nodes or more machines?
Maybe I have violated some obvious best practices.
Can SSD help me?
Should I check mapping because index size is 5 times bigger in raw size?

Thanks for response!

dadoonet · October 13, 2017, 12:34pm

Please format your code using </> icon as explained in this guide. It will make your post more readable.

Or use markdown style like:

```
CODE
```

Some thoughts:

What is exactly the full query? I mean is the query_string part inside a query or a filter?
analyze_wildcard: do you really intend to run queries like foo*bar? As per doc says it's super slow.
Do you really want to compute a bucket for every 3 hours but for the full 6 days? Don't you want to add a filter by date and just look at the last 24 hours for example?

What are the index settings? How many shards per day?

Also using _exists_:aggregate_final is going to most likely in your use case give back all the documents. So you compute an aggregation on 3.5 billion docs most likely + the cost of running the query which could be faster with a match_all.

One thing you can do is to run a query filtered per day and compute the agg only for that day. Then use a multisearch query to run 5 of them in parallel.

Can SSD help me?

Yes.

Should I check mapping because index size is 5 times bigger in raw size?

Yes. Remove _all, remove non needed keyword fields, non needed text fields.

If you are planning to query often on the existence of aggregate_final field, may be you should simply index that value as a boolean and filter by that.

Just some thoughts.

Christian_Dahlqvist · October 13, 2017, 12:52pm

Try to identify what is limiting performance. Is it disk I/O? If so more nodes or faster storage, e.g. SSDs, or scaling out the cluster will help. Is it CPU or heap pressure? Then scaling out the cluster might be necessary?

With respect to the size the data takes up on disk, you can look at this blog post, which discusses the effects of enrichment and mappings on disk usage.

Vlad_Nashekin · October 13, 2017, 1:28pm

Thank you for reply, Christian!
I looked I/O stats in windows, and see 100% disk load, mostly writes.
So my plan is reduce index size with good mapping, and trying to obtain mode servers.
How many machines i need for such a load?

Thanks.

Christian_Dahlqvist · October 13, 2017, 1:40pm

I do not know, as that depends on what your query load looks like (type of queries/aggregations, time period queried, number of concurrent queries) and how much resources the indexing load consumes.

If you can rate limit indexing and lower it until query performance is acceptable for your query mix/load, you may get an indication of how many nodes you need to add to also handle the full indexing load.

If disk I/O is as heavy as you say, I would expect SSDs to help a lot as well and allow you to utilise the rest of your resources better.

system · November 10, 2017, 1:40pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Configuration of ELK Stack Elasticsearch	3	347	July 11, 2019
Elasticsearch node Sizing for production Elasticsearch	5	4006	July 16, 2019
Hardware requirement for my server ELK Logstash	5	3241	July 1, 2022
Setting up Multi-node Architecture of ELK for log monitoring Elasticsearch	6	747	June 10, 2019
Scaling and Optimisation advices Elasticsearch	6	580	June 22, 2018

Cluster configuration for log storage. 140Gb/day

Related topics