Best practise for index creation

Souciance_Eqdam_Rash · November 25, 2017, 8:36am

Hi,

We are just starting out with elasticsearch for centralized logging. Our backend is an integration platform that has say 100 integrations currently running.

My question has to do with index creation. What is the best practise for this? When should the decision be made to send data to a new index? In our case should all integrations log to the same index e.g. /integrations/log or should they have their own index /integrations/<integration_xxx>/log ?

Will it affect performance if you go for the same index?

I should add that we are sending directly to elastic and not via logstash since we are only sending json data.

Thanks

warkolm · November 25, 2017, 8:46am

If the formats are the same then putting them in the same index is possible.
If you want to have different retention periods, then different indices makes sense.
Don't mix different formats.
Don't mix different environments (eg prod, no-prod).

Use time based indices, either date based or with rollover.

Balaji_V · November 25, 2017, 8:51am

Hi Rash,

I did log aggregation in Elasticsearch for my applications using with Fluentd and not used Logstash. You should have different index for different types of logs. For example, you can have all access.log in one index and error.log in one index.

You can use filebeat to send it directly to Elasticsearch or from filebeat to logstash and then pushed to elasticsearch.

If you use logstash then you can have Grok filters to have a pipeline or converting your logs to json format.

Regards,
Balaji

Souciance_Eqdam_Rash · November 25, 2017, 9:13am

In what way do you mean don't mix different formats? Are you referring to the message structure? I.e. json vs xml? or do you mean the format of the json message itself? So if we have different json data they should be indexed differently?

Souciance_Eqdam_Rash · November 25, 2017, 9:14am

Hi Balaji,

We are looking into Fluentd as the next step when we move towards docker. Thanks for the tip.

Regards
Souciance

warkolm · November 25, 2017, 9:16am

If you have 4 applications and they all have different log structures, you want 4 different indices.

Souciance_Eqdam_Rash · November 25, 2017, 9:25am

Got it, thanks. We are looking into have a common log structure so that should point us towards a single index.

1verse · November 25, 2017, 10:11am

As warkolm pointed, using time reference say -YYYY-MM for each month would help accessing recent data from recent indices and benefits performance.
Paying attention to mapping/templates for log data fields can help avoid string fields/columns being analyzed while visualizing the data
Also grouping multiple indices into an alias would help manage/access them better.

mujtabahussain · November 27, 2017, 1:39am

Hey!

I have experienced the two sides of index creation myself. One where I started with the index to rule them all, and now whereby I have time based indices with a rollover period of a month.

No Splitting

The initial setup where I sent everything to one index was simple to create but quickly became troublesome as the data increased. Searching took longer than needed at times, sending pieces of relevant data was not possible easily since it was all in one bucket.

Splitting on time

The second setup is quite useful in that we have an automated setup which creates a new index every month and sets up an alias to it which is generic. So for example if we have an index-2017-10, we have an alias called current-index that points to it. When November comes along, we create an index called index-2017-11, and the alias current-index points now to this index. So any third party programs that need to enter data only need care about the alias. Behind the scenes, data is split very nicely across time frames. We also have another alias called search-pool which has all the indices created and is used exclusively when searching something.

I highly recommend splitting across a set time frame as it makes it quite easy to understand your data and also distribute it across to interested parties or programs.

Splitting via application

Each index should represent a set domain. If you want to visualise your home internet speed, thats one domain and hence is a separate index compared to your need of say visualising your home internet usage. That is another way of splitting the indices and has the benefits of making searching easier since your search pool is now not only split across time, but also different use cases.

Splitting across source

As @warkolm mentioned, different environments should not be mixed up as they have different SLA's, priorities and access rules. The people who should have access to development data should not necessarily have access to production data and vice versa. Splitting up indices across source is a very good idea.

Note

Different indices based on different needs are also useful when you have to set up authentication and authorisation rules for access.
Aliases are your friend. They make it easy to write programs to consume or insert data into indices without having to worry about underlying splitting structure.
Up until elastic 2.6, you could install the plugin elasticsearch-head which was a very good way to visualise your index and alias setup so you could have a go with that. I am not sure what the current version of elastic provides as a replacement for that.

Hope that helps.

Christian_Dahlqvist · November 27, 2017, 6:31am

mujtabahussain:

Splitting on time
The second setup is quite useful in that we have an automated setup which creates a new index every month and sets up an alias to it which is generic. So for example if we have an index-2017-10, we have an alias called current-index that points to it. When November comes along, we create an index called index-2017-11, and the alias current-index points now to this index. So any third party programs that need to enter data only need care about the alias. Behind the scenes, data is split very nicely across time frames. We also have another alias called search-pool which has all the indices created and is used exclusively when searching something.

I highly recommend splitting across a set time frame as it makes it quite easy to understand your data and also distribute it across to interested parties or programs.

This sounds very similar to what the rollover index APIdoes, so you can use that instead and avoid having to manually create aliases.

When you split across sources, also make sure that you do not do this at a too fine-grained level, creating too many small shards, as this can be very inefficient.

mujtabahussain · November 27, 2017, 6:43am

That's awesome. I so did not know that ... Thanks.

@Rashti, Definitely do what @Christian_Dahlqvist is recommending w.r.t. Roll Over API instead of manually curating like me.

Souciance_Eqdam_Rash · November 27, 2017, 5:44pm

Well, is the rollover index needed in my use case?

Basically we have AppServer -->Json message --> Elastic

The json messages are purely log messages. The structure may vary slightly. However we don't indeed to do queries via the API. Any searching will be done via Kibana.

Off course, the index should only contain a month's worth of data and then either remove documents or backup them somewhere.

Christian_Dahlqvist · November 27, 2017, 6:00pm

There is no need to use it - it is there as a convenience. Using standard time-based indices with year and month in the name will work as well.

Souciance_Eqdam_Rash · November 27, 2017, 6:19pm

Thanks, good to know. Do you have any recommendation on how to clear the index of documents on a regular basis? Is this something we should do via the API or can it be configured in some way? Say the index X should only hold documents up to a month old and then do a backup on disk and clear the index..

Christian_Dahlqvist · November 27, 2017, 6:24pm

Use time based indices together with Curator to delete full indices once all the data in these indices have exceeded the retention period. Deleting full indices is far more efficient than deleting records individually from an index.

system · December 25, 2017, 6:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best practices for managing application logs Elasticsearch	2	1250	February 22, 2021
Best practices for indexing log data Logstash	6	19771	October 25, 2017
Best practice for the creation of an index Elasticsearch	2	215	February 27, 2023
Create type for using in different indexes Elasticsearch	3	727	July 5, 2017
Trying to understand "indexes" and fields to store logs Elasticsearch	3	360	June 21, 2019