Dealing with large index collection strategy?


(Eric Deveaud) #1

Hi,

we would like to deal with one year of various source of logs. that range from 80^6 events per day to 1000 per day with kibana on top to have reporting and dashbord of activity from those different sources.

we plan to have different sources of input so at the end we will have to deal with a collection of 365*5 index
all index collection will be standardized as much as possible in terms of idexed fields

as a POC we tried to index 6 month of logs from only one source and hited "memory heap" and "too many files open" and encountered some latency in the kibana search. :wink:

as far as I understand ES + kibana is most used for short time analysis not realy for long lasting log analysis.

does ES is suitable for this kind of task and will it support thhis kind of scaling

and what will be the best architecture we can eploy to cover this kind of task.

best regards

Eric


(Mark Walkom) #2

You can do this, you just have to scale across a lot of nodes as you mention.
How many is up to your use case though.

Make sure you implement doc values as much as possible, that will help.


(Eric Deveaud) #3

currently I am testing it on my desktop machine (16Gb ram, 8cpu) and I have set up a cluster with 4 nodes
this is the initial setup for the toy study and to go further we will scale up over various VM in order to set up a more robust cluster.
I try to push the toy case as far as possible and stress it to check the robustness.

sorry can you emphasis what you mean by this ?

Eric


(Christian Dahlqvist) #4

How many shards do you have in the cluster and what is the average shard size? Each shard in Elasticsearch is a separate Lucene index and carries with it a certain amount of memory and file descriptor overhead. For logging use cases a reasonable shard size is often from a few GB to tens of GBs, although we generally recommend keeping it below 50GB as very large shards can have a negative impact on recovery. If your shards are quite small, it may make sense to have applications/streams share indices (assuming mappings allow this), reduce the number of shards for the indices or even go from daily indices to weekly or monthly. One of the benefits of using time-based indices is that you can change the number of shards for an index for the next period if volumes change.


(Mark Walkom) #5

Doc values = https://www.elastic.co/guide/en/elasticsearch/guide/current/doc-values.html


(Eric Deveaud) #6

thanks

curently running tests with
expanded number of shard, doc-values and weekly index
I will let you know.

Eric


(system) #7