Hello,
I am attempting to set up a large scale ELK setup at work. Here is a
basic setup of what we have so far:
Nodes (approx 150)
[logstash]
|
|
+-----------+
| |
Indexer1 Indexer2
[Redis] [Redis]
[Logstash] [Logstash]
| |
| |
+----+------+
|
|
ES Master ---------- Kibana3
[Master: yes]
[Data: no]
|
|
ES Data (4 data nodes)
[Master: no]
[Data: yes]
In case the formatting does not hold with the above, I've created a
paste here: https://baneofswitches.privatepaste.com/c8dfc2c30b
The Setup
-
We have approximately 150 nodes configured to send to a "shuffled"
Redis instance on either Indexer1 or Indexer2. A sanitized version of
the node Logstash config is here:
https://baneofswitches.privatepaste.com/345b94064d -
Each indexer is identical. They both run their own independent Redis
service. They then each have a Logstash service that pulls events from
Redis and pushes them to the ES Master. They are using the http
protocol. A sanitized version of their config is here:
https://baneofswitches.privatepaste.com/e19eae690f -
The ES Master is configured to only be a Master, and is not set to be
a data node. It has 32 GB of RAM. -
There are 4 ES data nodes, configured to be data nodes only, they have
been configured to be ineligible to be elected as Masters. They have 62
GB RAM and the storage for ES is on SSDs -
We have Kibana3 configured to search from the ES Master.
-
Average # of logs generated by all nodes total seems to be
approximately 7k/sec, with peaks up to about 16k/s. -
Indexer throughput seems to be good enough that one indexer can work
just fine during normal usage. -
We are using the default 5 shards with 1 replica
The Problem
When this setup is loaded as mentioned above, we are noticing that some
logs are being dropped. We were able to test this by running something like:
seq 1 5000 | xargs -I{} -n 1 -P 40 logger "Testing unqString {} of 5000"
Sometimes we would see all 5000 show up in Kibana, other times a subset
of them (for example 4800 events).
Troubleshooting
We have taken a number of steps to eliminate possibilities. We have
confirmed that logs are being reliably transferred from nodes to Redis
and from Redis through Logstash. We confirmed this by monitoring counts
over many trials. The Redis-> logstash leg was tested by outputting to a
file and comparing counts.
That left the Logstash -> ES leg. We tested this by writing a script
that pushed fake events via the bulk API. We were unable to reproduce
the problem with one request. However, when the cluster is under load
(we let 'real' logs flow) and we push via the bulk API with our script
we occasionally see partial loss of data.
It's important to note that partial loss here means that the request
succeeds (200 return code), and much of the data in the bulk request is
then searchable, however not all will be. For example, if we put the
cluster under load and push a request with a bulk of 5000 events in, we
will see 4968 of the 5000 in our subsequent search.
We have tried increasing the bulk api threadpool as well as giving a
greater percentage (50%) to the indexing buffer. Neither has fixed the
issue.
Conclusion
I am looking for feedback on how to troubleshoot this further and find
the cause. I am also looking for information to see if anyone else out
there is getting these sorts of incoming volume and what sorts of things
they had to do to get their setup working. I appreciate all feedback.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53B46818.7020005%40gmail.com.
For more options, visit https://groups.google.com/d/optout.