We are just starting out with elasticsearch for centralized logging. Our backend is an integration platform that has say 100 integrations currently running.
My question has to do with index creation. What is the best practise for this? When should the decision be made to send data to a new index? In our case should all integrations log to the same index e.g. /integrations/log or should they have their own index /integrations/<integration_xxx>/log ?
Will it affect performance if you go for the same index?
I should add that we are sending directly to elastic and not via logstash since we are only sending json data.
If the formats are the same then putting them in the same index is possible.
If you want to have different retention periods, then different indices makes sense.
Don't mix different formats.
Don't mix different environments (eg prod, no-prod).
Use time based indices, either date based or with rollover.
I did log aggregation in Elasticsearch for my applications using with Fluentd and not used Logstash. You should have different index for different types of logs. For example, you can have all access.log in one index and error.log in one index.
You can use filebeat to send it directly to Elasticsearch or from filebeat to logstash and then pushed to elasticsearch.
If you use logstash then you can have Grok filters to have a pipeline or converting your logs to json format.
In what way do you mean don't mix different formats? Are you referring to the message structure? I.e. json vs xml? or do you mean the format of the json message itself? So if we have different json data they should be indexed differently?
As warkolm pointed, using time reference say -YYYY-MM for each month would help accessing recent data from recent indices and benefits performance.
Paying attention to mapping/templates for log data fields can help avoid string fields/columns being analyzed while visualizing the data
Also grouping multiple indices into an alias would help manage/access them better.
I have experienced the two sides of index creation myself. One where I started with the index to rule them all, and now whereby I have time based indices with a rollover period of a month.
No Splitting
The initial setup where I sent everything to one index was simple to create but quickly became troublesome as the data increased. Searching took longer than needed at times, sending pieces of relevant data was not possible easily since it was all in one bucket.
Splitting on time
The second setup is quite useful in that we have an automated setup which creates a new index every month and sets up an alias to it which is generic. So for example if we have an index-2017-10, we have an alias called current-index that points to it. When November comes along, we create an index called index-2017-11, and the alias current-index points now to this index. So any third party programs that need to enter data only need care about the alias. Behind the scenes, data is split very nicely across time frames. We also have another alias called search-pool which has all the indices created and is used exclusively when searching something.
I highly recommend splitting across a set time frame as it makes it quite easy to understand your data and also distribute it across to interested parties or programs.
Splitting via application
Each index should represent a set domain. If you want to visualise your home internet speed, thats one domain and hence is a separate index compared to your need of say visualising your home internet usage. That is another way of splitting the indices and has the benefits of making searching easier since your search pool is now not only split across time, but also different use cases.
Splitting across source
As @warkolm mentioned, different environments should not be mixed up as they have different SLA's, priorities and access rules. The people who should have access to development data should not necessarily have access to production data and vice versa. Splitting up indices across source is a very good idea.
Note
Different indices based on different needs are also useful when you have to set up authentication and authorisation rules for access.
Aliases are your friend. They make it easy to write programs to consume or insert data into indices without having to worry about underlying splitting structure.
Up until elastic 2.6, you could install the plugin elasticsearch-head which was a very good way to visualise your index and alias setup so you could have a go with that. I am not sure what the current version of elastic provides as a replacement for that.
This sounds very similar to what the rollover index APIdoes, so you can use that instead and avoid having to manually create aliases.
When you split across sources, also make sure that you do not do this at a too fine-grained level, creating too many small shards, as this can be very inefficient.
Well, is the rollover index needed in my use case?
Basically we have AppServer -->Json message --> Elastic
The json messages are purely log messages. The structure may vary slightly. However we don't indeed to do queries via the API. Any searching will be done via Kibana.
Off course, the index should only contain a month's worth of data and then either remove documents or backup them somewhere.
Thanks, good to know. Do you have any recommendation on how to clear the index of documents on a regular basis? Is this something we should do via the API or can it be configured in some way? Say the index X should only hold documents up to a month old and then do a backup on disk and clear the index..
Use time based indices together with Curator to delete full indices once all the data in these indices have exceeded the retention period. Deleting full indices is far more efficient than deleting records individually from an index.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.