Dealing with large index collection strategy?

eeveaud · July 24, 2015, 4:57pm

Hi,

we would like to deal with one year of various source of logs. that range from 80^6 events per day to 1000 per day with kibana on top to have reporting and dashbord of activity from those different sources.

we plan to have different sources of input so at the end we will have to deal with a collection of 365*5 index
all index collection will be standardized as much as possible in terms of idexed fields

as a POC we tried to index 6 month of logs from only one source and hited "memory heap" and "too many files open" and encountered some latency in the kibana search.

as far as I understand ES + kibana is most used for short time analysis not realy for long lasting log analysis.

does ES is suitable for this kind of task and will it support thhis kind of scaling

and what will be the best architecture we can eploy to cover this kind of task.

best regards

Eric

warkolm · July 25, 2015, 12:40am

You can do this, you just have to scale across a lot of nodes as you mention.
How many is up to your use case though.

Make sure you implement doc values as much as possible, that will help.

eeveaud · July 25, 2015, 7:44am

currently I am testing it on my desktop machine (16Gb ram, 8cpu) and I have set up a cluster with 4 nodes
this is the initial setup for the toy study and to go further we will scale up over various VM in order to set up a more robust cluster.
I try to push the toy case as far as possible and stress it to check the robustness.

sorry can you emphasis what you mean by this ?

Eric

Christian_Dahlqvist · July 25, 2015, 8:15am

How many shards do you have in the cluster and what is the average shard size? Each shard in Elasticsearch is a separate Lucene index and carries with it a certain amount of memory and file descriptor overhead. For logging use cases a reasonable shard size is often from a few GB to tens of GBs, although we generally recommend keeping it below 50GB as very large shards can have a negative impact on recovery. If your shards are quite small, it may make sense to have applications/streams share indices (assuming mappings allow this), reduce the number of shards for the indices or even go from daily indices to weekly or monthly. One of the benefits of using time-based indices is that you can change the number of shards for an index for the next period if volumes change.

warkolm · July 26, 2015, 4:28am

Doc values = https://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html

eeveaud · July 29, 2015, 2:25pm

thanks

curently running tests with
expanded number of shard, doc-values and weekly index
I will let you know.

Eric

Topic		Replies	Views
Analyzing logs and document limit per shard Elasticsearch	11	1285	February 21, 2017
Tips on Optimization Elasticsearch	10	1380	November 6, 2017
[Help!] Number of indexes and shards per node Elasticsearch	9	3435	May 5, 2017
ES indexing strategy Elasticsearch	4	3087	July 5, 2017
Experiences in "how to manage much data" needed Elasticsearch	8	549	August 10, 2018

Dealing with large index collection strategy?

Related topics