Best practices for indexing log data

(Mike Charnoky) #1

Hi, I'm new to the Elastic/ELK stack and am using it for log aggregation/analysis. Basically, the classic use case of using Filebeat to ship log files to Logstash and output the data to Elasticsearch.

My main problem is trying to understand: What are the best practices for indexing a wide variety of log files? I have no problems in the n=1 case where I have only one log type of log file on one server. But, how should indexes and data be structured for a wide variety of log types? Does it make sense for say, apache logs a Spring app's logs to be stored in different indexes? What about, say, apache logs from different departments or environments (test vs prod) - use the same index or a different one?

According to this Elastic blog post, "Multiple types in the same index really shouldn't be used all that often." This gives me the impression that we should be splitting our indexes based on log type, where the fields are mostly the same, e.g. apache-YYYY.MM.dd, syslog-YYYY.MM.dd, spring_app1-YYYY.MM.dd, etc.

Yet, a lot of the examples I see posted (including from the official Logstash documentation ) seem to run contrary to that advice. I often see this logstash config:
output {
elasticsearch {
index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
document_type => "%{[@metadata][type]}"
The result is that an index of filebeat-YYYY.MM.dd is created for all log types, but the type field is set to the log type (apache, syslog, etc). Is this really a good practice? I would think that the index should be based on the document_type like this:
index => "%{[@metadata][type]}-%{+YYYY.MM.dd}"

Any advice is much appreciated!

Index and Type creation
(Christian Dahlqvist) #2

Whether different types of data should be sharing an index or not often comes down to the mapping and whether these can coexist and have the same retention period. It also depends on the data volumes you are expecting.

Splitting different types of logs and data into separate indices may sound appealing, but it is important to know that having lots of small indices and shards can be very inefficient. Sometimes you can offset this by e.g. using monthly indices to keep the index and shard count down.

(Mike Charnoky) #3

Thanks for the quick reply Christian!

I must be missing something here because it seems like I'm reading two contrary pieces of advice. Can you help clarify these apparently opposing viewpoints?

  1. Based on the Elastic blog post I previously cited, the takeaways were: "Multiple types in the same index really shouldn't be used all that often and one of the few use cases for types is parent child relationships." and "Sparsity should be avoided... Types almost always increase the sparsity of your data because different types have different fields." This leads me to believe that different log types (say, apache and a custom Spring app) should be stored in different indexes, since they will have different fields.

  2. Based on your advice and the Elastic blog post you cited, "having lots of small indices and shards can be very inefficient". This leads me to believe that different log types should be stored in the same index with a different "type" field... which is contrary to the previous advice.

What am I missing? Maybe I'm not understanding what a parent-and-child relationship looks like with respect to log data...

(Christian Dahlqvist) #4

You can store different types of logs in the same index within a single type as long as you do not have conflicting mappings, and if you add a field that indicate the type of record you can use this for filtering when querying. There is nothing that forces you to use different types.

Storing multiple types of data in a single index, can however lead to sparse fields, which in turn may increase the relative size on disk. Whether this is a severe problem or not depends on your mappings and data volumes. For a lot of use cases I come across this is not a problem at all.

Having a large number of small indices and shards is quite inefficient, and having too many shards in a cluster is a common cause of problems, which is why I wrote that blog post. If you really need to have separate indices per data type, I would strongly recommend adjusting the number of primary shards and the time period each index covers so that you do not end up with too small shards.

(Mike Charnoky) #5

Yes, I think the conflicting field mappings will be an issue if multiple types of logs are getting thrown into the same index. Does it make sense to have a hierarchy of field names in this case instead of putting all the field names at the top level? e.g. apache2.request, apache2.clientip, myapp.event, myapp.userid? Is this a common practice?

(Christian Dahlqvist) #6

I believe Metricbeat does that, but a specific field , e.g. clientip, must have a single mapping wherever it is in the hierarchy. If you are in control of mappings I would recommend having types of data with low daily volumes share an index. It all really depends on how much data you have coming in and how long you want to keep it for.

(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.