Analyzing logs and document limit per shard

John16 · January 22, 2017, 7:58pm

Hello,

I wast to store my log files in Elasticsearch for further analysis. But as far as I understand, there is a limit of 2bil of docs per shard.

My application generates about 1 bil of log lines per day, so this limit will be reached fast.

What is the solution for this (I think that large number of log lines is rather common situation)? I want to store at least 1 month of logs (30 bil) or even more (6 month or even 1 year).

Thanks in advance!

warkolm · January 22, 2017, 8:34pm

If you use time based indices, or the rollover API, then you will be fine.

John16 · January 23, 2017, 9:34am

So in terms of search performance and efficiency it does not matter if I query single index or several indexes at once, right?

GyllingSW · January 23, 2017, 12:04pm

What really matters is, that you at all costs keep the physical shard_size <= size of process memory and that the process memory are pinned to the ES Java process, so no swapping has to be carried out.

And make sure, that you only utilize 50% of your physical memory for the ES process. The rest should be accessible to the OS layer for Lucene to use via the OS.

Apart from this, you can query across several indices without any problems.

warkolm · January 24, 2017, 2:44am

In that case you'd suggest only one shard per node. Otherwise it'd have to swap out anyway, right?

GyllingSW · January 24, 2017, 8:29am

Yes - usually it's the best utilization of the underlying I/O system.

Scale writes by number of shards and reads by number of replicas per shard

warkolm · January 24, 2017, 8:30am

You're basically stating that the max capacity of an ES node is ~30GB. That seems highly inefficient does it not?

GyllingSW · January 24, 2017, 8:37am

I'm stating, that you easily can write much more to single node, but when you want to read it - you will run into troubles if you need to read multiple full shards into memory.

If you known what you are looking for and know your data, then it's usually not a problem, but it will be if you provide access to Kibana to a large group of data scientist. If you can manage to have pre-configured dashboards, then you are also safe.

warkolm · January 24, 2017, 8:39am

Honestly, I have never seen this to be an issue to the extent you seem to be hinting at.
ES can handle TB's of data per node, with super fast response times.

Christian_Dahlqvist · January 24, 2017, 8:41am

Elasticsearch can be tuned in a number of ways for different use cases, and determining the ideal shard size and number of shards a node can handle is no different. Having a small enough data set per node so that all data can be cached by the OS (which it seems you are recommending if I read your post correctly) may be applicable for search use cases with very high query rates and low required latencies, but this does in my opinion rarely make sense for the vast majority of logging use cases.

GyllingSW · January 24, 2017, 9:09am

I'll give you this one as I mainly work with high volume search cases

system · February 21, 2017, 9:09am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Just how big should an index be allowed to be? Elasticsearch	2	1679	July 6, 2017
How many documents can one Index/shard hold? Elasticsearch	4	3711	July 26, 2017
Max number of shards per node Elasticsearch	6	3409	July 6, 2017
Dealing with large index collection strategy? Elasticsearch	6	1592	July 5, 2017
Limit for shard size? Elasticsearch	2	3741	July 5, 2017

Analyzing logs and document limit per shard

Related topics