For Efficient and High Performance Search of Logs


(Sushant Sood) #1

Hi I have a scenario where I would have logs from 100 node horizontal scalable Cluster , out of below options which would be the better option to achieve efficient and high performance search ?

1)Creation of 1 Daily Index of logs with 5 primary Shards per node of cluster , Means 100 indices for 100 nodes .

2)Creation of Index of Logs from the subset of nodes like 1 daily index (5 primary shards ) per 10 nodes of Cluster , means in this case 10 daily Indices for 100 node cluster .
Logs would be searched from all the indices . Please suggest


(Magnus Bäck) #2

The 100-node cluster you're talking about isn't your ES cluster, right? But rather the cluster whose logs you want to make searchable?

Any particular reason you want to put logs from each machine into an index of its own? The common practice is to store the hostname in a separate field and include that field in queries.


(Sushant Sood) #3

That's correct 100 node cluster is not my ES cluster .I am already putting hostname in separate field to use it in queries.How should I index the logs from 100 nodes? How should I create index buckets for scalable search in future .index would be daily index


(Magnus Bäck) #4

With daily indexes each index will have a bounded size, but you may want to have a shard count > 1 to keep the total size of each index down (shards of up to a few tens of GB are generally okay). With a multi-node cluster that should also improve performance since multiple nodes can help out with queries affecting a particular day. On the other hand, queries that span more than one day will probably touch multiple nodes even if the shard count is 1.

The ideal number of shards depends on many factors, including the number of nodes in the ES cluster, the amount of data, and the query patterns. You may have to experiment.


(Sushant Sood) #5

Thanks for the quick reply.I have a multi node cluster with dedicated eligible masters but One query I have is should I store logs from all 100 nodes into a single Daily index?


(Sushant Sood) #6

Hi Magnus , I would really appreciate If you can clarify the query of storing 100 Nodes logs into a Single Daily index or Small multiple Indices buckets like subset of 10 machines logs in 1 index ?


(Christian Dahlqvist) #7

An index in Elasticsearch can handle large amounts of data, so storing data related to hundreds or thousands of servers/devices is generally not a problem at all. If the shards get too big (larger than a few tens of GB), you can increase the number of shards to handle more data. The same way you can reduce the number of shards from the default 5 if you have small volumes.

The decision to use multiple indices is generally based more on the nature of the data being stored. If you have different types of data that are never queried together and have very different structure and possibly mapping conflicts, storing this in multiple indices might make sense.

Each index and shard has a certain amount of overhead, so having lots of small indices/shards in generally inefficient.


(Sushant Sood) #8

Hi Christian thanks for your elaborative response, now it clarifies my doubt .


(Sushant Sood) #9

Thanks Magnus and Christian in spending some time to provide your inputs


(Sushant Sood) #10

Hi Christian , I would be having 90-100 GB of logs per day but type of data will be same , so storing the 100 GB logs in 1 index with 10 shards will be fine to get the better search performance ?


(Sushant Sood) #11

Hi Elastic team, please let me know any suggestion on the query so that we can proceed further.


(Magnus Bäck) #12

What results in the best query performance depends on a wide range of factors, but for time-series data like logs that you can't keep around forever the best practice is to use time-series indexes (e.g. daily indexes) and shard each index as necessary to keep the shard size manageable. Sharding the data per source host (e.g. one index per host) is not recommended, but as Christian said earlier it could make sense to shard per type (e.g. one index series for syslog and one for HTTP requests).


(system) #13