Analyzing logs and document limit per shard


I wast to store my log files in Elasticsearch for further analysis. But as far as I understand, there is a limit of 2bil of docs per shard.

My application generates about 1 bil of log lines per day, so this limit will be reached fast.

What is the solution for this (I think that large number of log lines is rather common situation)? I want to store at least 1 month of logs (30 bil) or even more (6 month or even 1 year).

Thanks in advance!

If you use time based indices, or the rollover API, then you will be fine.

So in terms of search performance and efficiency it does not matter if I query single index or several indexes at once, right?

What really matters is, that you at all costs keep the physical shard_size <= size of process memory and that the process memory are pinned to the ES Java process, so no swapping has to be carried out.

And make sure, that you only utilize 50% of your physical memory for the ES process. The rest should be accessible to the OS layer for Lucene to use via the OS.

Apart from this, you can query across several indices without any problems.

In that case you'd suggest only one shard per node. Otherwise it'd have to swap out anyway, right?

Yes - usually it's the best utilization of the underlying I/O system.

Scale writes by number of shards and reads by number of replicas per shard

You're basically stating that the max capacity of an ES node is ~30GB. That seems highly inefficient does it not?

I'm stating, that you easily can write much more to single node, but when you want to read it - you will run into troubles if you need to read multiple full shards into memory.

If you known what you are looking for and know your data, then it's usually not a problem, but it will be if you provide access to Kibana to a large group of data scientist. If you can manage to have pre-configured dashboards, then you are also safe.

Honestly, I have never seen this to be an issue to the extent you seem to be hinting at.
ES can handle TB's of data per node, with super fast response times.

Elasticsearch can be tuned in a number of ways for different use cases, and determining the ideal shard size and number of shards a node can handle is no different. Having a small enough data set per node so that all data can be cached by the OS (which it seems you are recommending if I read your post correctly) may be applicable for search use cases with very high query rates and low required latencies, but this does in my opinion rarely make sense for the vast majority of logging use cases.

I'll give you this one as I mainly work with high volume search cases

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.