I have build a POC in which there are 4 data nodes , 3 master nodes and 2 load balancer nodes(one client node is defined on the kibana itself). All are in cluster.
What should be the number of shards in this scenario? Presently, I have defined it as 5.
It is usually faster to write when you have one shard on each node. So with replicas set to 1, getting you 2 way replication, you will write fastest with 2 shards. It is generally faster to query as few shards as possible so long as the shards aren't too large. 30GB is a fairly good size to shoot for. So maybe you should have 1 shard per index. But that only makes sense if you are querying more than one index because your queries would only go to a single node and three of your data nodes wouldn't do anything.
For times based data (logs, sensor measurements, stuff like that) you should investigate using 2 shards and then use the rollover and shrink APIs to shrink down to one. The advice for non-time based data is more free form. Do performance testing and see.
I should point out that "load balancer nodes" generally aren't all that useful. Master only nodes are great, especially if you can put them on sensibly small machines. But node that don't hold data and aren't master eligible don't tend to buy you much.
I have created 3 index for windows, linux and network devices respectively each month (each month new indexer is made). Currently, for unix index, the shard store is 25.6 gb. I am facing a lag in offset at kafka when I input the unix logs.
With the concept of 1 shard per index per node, there should be 3 shards. I have not created any replicas.
Why is that still I am getting lag in the offset at kafka? This lag occured first when I made a query in kibana. And it is increaseing with time.
I have no idea what you mean here. I don't know Kafka so I'm probably just missing vocabulary words. Are you saying that you can't index as fast as you'd like?
I have 3 indexer each for Linux data, windows data and network devices respectively. The events are received at kibana are about 500GB per day from about 5000 devices. This will increase to 25000 devices and 1 TB events per day
Based on this it looks like you are wasting a lot of powerful hardware on dedicated master and client nodes that could better be used as additional data nodes. The specification for the master nodes is overkill as they are expected to do little work. 2 CPU cores and 4GB RAM (3GB heap) is sufficient even for reasonably large clusters. As Nik pointed out, dedicated client nodes are also not necessarily required in many log analytics use cases. I would suspect you would benefit from having 3 smaller master nodes and more data nodes.
When doing heavy indexing in Elasticsearch CPU and RAM are important, but it is often the latency and throughput of the storage that can limit performance. Based on your description it is difficult to tell whether it is the indexing layer or Elasticsearch itself that is the bottleneck. Look at the cluster when it is indexing and try to identify what is limiting performance, e.g. CPU or long IO wait. If nothing stands out, try measuring the performance of your indexing pipeline without writing to Elasticsearch. If the throughput increases when removing Elasticsearch from the equation, it is Elasticsearch that is the bottleneck.
Look at the following video for a discussion around sizing and benchmarking.
I wanted to know that as right now I have 5 shards (no replicas )with 3 index and 4 data node. So I understand that when I apply a query at kibana ( a simple search of all logs of a particular index), it would search 15 locations (5 shards * 3 index = 15) . Is this the reason that there is always a lag in the logs (due to more number of shards) even if I restart the sending of logs after all lags are consumed?
Following is the output of the API.. /_cat/shards?v
index shard prirep state docs store ip node
.kibana 0 p STARTED 5 31.6kb x.x.x.x es-dn-3
.kibana 0 r STARTED 5 31.6kb x.x.x.x es-dn-2
testwins_2016.11 4 p STARTED 1505859 601mb x.x.x.x es-dn-2
testwins_2016.11 3 p STARTED 1503570 595.1mb x.x.x.x es-dn-1
testwins_2016.11 2 p STARTED 1503781 596mb x.x.x.x es-dn-1
testwins_2016.11 1 p STARTED 1506562 598.1mb x.x.x.x es-dn-4
testwins_2016.11 0 p STARTED 1503442 594.5mb x.x.x.x es-dn-3
testunix_2016.11 4 p STARTED 71884293 25.8gb x.x.x.x es-dn-1
testunix_2016.11 3 p STARTED 71880403 25.5gb x.x.x.x es-dn-2
testunix_2016.11 2 p STARTED 71886145 26.1gb x.x.x.x es-dn-4
testunix_2016.11 1 p STARTED 71854318 25.9gb x.x.x.x es-dn-4
testunix_2016.11 0 p STARTED 71888839 25.9gb x.x.x.x es-dn-3
testnw_2016.11 4 p STARTED 2492 1mb x.x.x.x es-dn-1
testnw_2016.11 3 p STARTED 2477 1mb x.x.x.x es-dn-1
testnw_2016.11 2 p STARTED 2479 1mb x.x.x.x es-dn-2
testnw_2016.11 1 p STARTED 2578 1.1mb x.x.x.x es-dn-3
testnw_2016.11 0 p STARTED 2642 1mb x.x.x.x es-dn-4
Could you explain me the resource utilization in this situation?
I am not sure what you mean with lag. Can you please clarify? I would also recommend adding a replica to your indices as this provides better resiliency as well as read throughput.
This show the logs being lagged at Kafka (the LAG column) when I query at kibana. After kafka, there is a logstash which applies gork filter and sends the data to elasticsearch cluster.
When I stop data from source and wait for few minutes, the LAG gets to zero. If I continue the logs from source, it is shown at kibana but with delay as the logs are queued at kafka. The LAG increases with time (even if I stop query at kibana).
Moreover, one thing I noted is that, LAG is zero when I start logs all over again and apply no query at kibana. And once I apply my first query at kibana, the LAG increases.
Indexing and querying competes for the same resources in the cluster, e.g. CPU and disk IO. Based on your description it sounds like your data nodes are able to keep up indexing data as long as indexing has all the resources at its disposal, but when querying also starts consuming resources the actual indexing rate drops.
What is the specification of the data nodes with respect to CPU, RAM and type of storage? When you are indexing and querying at the same time, what does CPU and disk IO utilization look like?
Somehow the numbers you have provided do not add up for me. You stated that you receive 500GB per day, yet you look to have monthly indices with 5 shards each, which at that ingest speed would get huge - well beyond the generally recommended max shard size of 50GB. In addition to this you state that your 4 data nodes have 500GB of SSD each, which total only about 2TB of storage, or roughly 4 days of storage. If you only can keep a few days of data on disk, why use monthly indices?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.