Documents not being stored

Joseph_Johnson · July 2, 2014, 8:14pm

Hello,

I am attempting to set up a large scale ELK setup at work. Here is a
basic setup of what we have so far:

Nodes (approx 150)
[logstash]

  |
  |
  +-----------+
  |           |
Indexer1     Indexer2
[Redis]      [Redis]
[Logstash]   [Logstash]
  |           |
  |           |
  +----+------+
       |
       |
     ES Master ---------- Kibana3
     [Master: yes]
     [Data: no]
       |
       |
     ES Data (4 data nodes)
     [Master: no]
     [Data: yes]

In case the formatting does not hold with the above, I've created a
paste here: https://baneofswitches.privatepaste.com/c8dfc2c30b

The Setup

We have approximately 150 nodes configured to send to a "shuffled"
Redis instance on either Indexer1 or Indexer2. A sanitized version of
the node Logstash config is here:
https://baneofswitches.privatepaste.com/345b94064d
Each indexer is identical. They both run their own independent Redis
service. They then each have a Logstash service that pulls events from
Redis and pushes them to the ES Master. They are using the http
protocol. A sanitized version of their config is here:
https://baneofswitches.privatepaste.com/e19eae690f
The ES Master is configured to only be a Master, and is not set to be
a data node. It has 32 GB of RAM.
There are 4 ES data nodes, configured to be data nodes only, they have
been configured to be ineligible to be elected as Masters. They have 62
GB RAM and the storage for ES is on SSDs
We have Kibana3 configured to search from the ES Master.
Average # of logs generated by all nodes total seems to be
approximately 7k/sec, with peaks up to about 16k/s.
Indexer throughput seems to be good enough that one indexer can work
just fine during normal usage.
We are using the default 5 shards with 1 replica

The Problem

When this setup is loaded as mentioned above, we are noticing that some
logs are being dropped. We were able to test this by running something like:

seq 1 5000 | xargs -I{} -n 1 -P 40 logger "Testing unqString {} of 5000"

Sometimes we would see all 5000 show up in Kibana, other times a subset
of them (for example 4800 events).

Troubleshooting

We have taken a number of steps to eliminate possibilities. We have
confirmed that logs are being reliably transferred from nodes to Redis
and from Redis through Logstash. We confirmed this by monitoring counts
over many trials. The Redis-> logstash leg was tested by outputting to a
file and comparing counts.

That left the Logstash -> ES leg. We tested this by writing a script
that pushed fake events via the bulk API. We were unable to reproduce
the problem with one request. However, when the cluster is under load
(we let 'real' logs flow) and we push via the bulk API with our script
we occasionally see partial loss of data.

It's important to note that partial loss here means that the request
succeeds (200 return code), and much of the data in the bulk request is
then searchable, however not all will be. For example, if we put the
cluster under load and push a request with a bulk of 5000 events in, we
will see 4968 of the 5000 in our subsequent search.

We have tried increasing the bulk api threadpool as well as giving a
greater percentage (50%) to the indexing buffer. Neither has fixed the
issue.

Conclusion

I am looking for feedback on how to troubleshoot this further and find
the cause. I am also looking for information to see if anyone else out
there is getting these sorts of incoming volume and what sorts of things
they had to do to get their setup working. I appreciate all feedback.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/53B46818.7020005%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · July 3, 2014, 7:58pm

It would be a useful information if switching to replica level 0 still
yield the same "dropped documents" effect?

Otherwise will refreshing the index change the situation?

Jörg

On Wed, Jul 2, 2014 at 10:14 PM, Joseph Johnson baneofswitches@gmail.com
wrote:

Hello,

I am attempting to set up a large scale ELK setup at work. Here is a
basic setup of what we have so far:
Nodes (approx 150)
[logstash]

  |
  |
  +-----------+
  |           |
Indexer1     Indexer2
[Redis]      [Redis]
[Logstash]   [Logstash]
  |           |
  |           |
  +----+------+
       |
       |
     ES Master ---------- Kibana3
     [Master: yes]
     [Data: no]
       |
       |
     ES Data (4 data nodes)
     [Master: no]
     [Data: yes]
In case the formatting does not hold with the above, I've created a
paste here: https://baneofswitches.privatepaste.com/c8dfc2c30b

The Setup

We have approximately 150 nodes configured to send to a "shuffled"
Redis instance on either Indexer1 or Indexer2. A sanitized version of
the node Logstash config is here:
https://baneofswitches.privatepaste.com/345b94064d

Each indexer is identical. They both run their own independent Redis
service. They then each have a Logstash service that pulls events from
Redis and pushes them to the ES Master. They are using the http
protocol. A sanitized version of their config is here:
https://baneofswitches.privatepaste.com/e19eae690f

The ES Master is configured to only be a Master, and is not set to be
a data node. It has 32 GB of RAM.

There are 4 ES data nodes, configured to be data nodes only, they have
been configured to be ineligible to be elected as Masters. They have 62
GB RAM and the storage for ES is on SSDs

We have Kibana3 configured to search from the ES Master.

Average # of logs generated by all nodes total seems to be
approximately 7k/sec, with peaks up to about 16k/s.

Indexer throughput seems to be good enough that one indexer can work
just fine during normal usage.

We are using the default 5 shards with 1 replica

The Problem

When this setup is loaded as mentioned above, we are noticing that some
logs are being dropped. We were able to test this by running something
like:

seq 1 5000 | xargs -I{} -n 1 -P 40 logger "Testing unqString {} of 5000"

Sometimes we would see all 5000 show up in Kibana, other times a subset
of them (for example 4800 events).

Troubleshooting

We have taken a number of steps to eliminate possibilities. We have
confirmed that logs are being reliably transferred from nodes to Redis
and from Redis through Logstash. We confirmed this by monitoring counts
over many trials. The Redis-> logstash leg was tested by outputting to a
file and comparing counts.

That left the Logstash -> ES leg. We tested this by writing a script
that pushed fake events via the bulk API. We were unable to reproduce
the problem with one request. However, when the cluster is under load
(we let 'real' logs flow) and we push via the bulk API with our script
we occasionally see partial loss of data.

It's important to note that partial loss here means that the request
succeeds (200 return code), and much of the data in the bulk request is
then searchable, however not all will be. For example, if we put the
cluster under load and push a request with a bulk of 5000 events in, we
will see 4968 of the 5000 in our subsequent search.

We have tried increasing the bulk api threadpool as well as giving a
greater percentage (50%) to the indexing buffer. Neither has fixed the
issue.

Conclusion

I am looking for feedback on how to troubleshoot this further and find
the cause. I am also looking for information to see if anyone else out
there is getting these sorts of incoming volume and what sorts of things
they had to do to get their setup working. I appreciate all feedback.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/53B46818.7020005%40gmail.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEtvWkty%2B9MLCLZrEKDLC-mSgYPsR0jaGsjMh%2B9YF8zXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Spikes in usage and Logstash connection issues Elasticsearch	4	460	July 6, 2017
Large Scale elastic Search Logstash collection system Elasticsearch	6	451	July 6, 2017
Corruption when indexing large number of documents (4 billion+) Elasticsearch	6	938	July 6, 2017
3,000 events/sec Architecture Elasticsearch	10	1788	July 6, 2017
Getting Data To Persist Elasticsearch	9	365	July 6, 2017

Documents not being stored

The Setup

The Problem

Troubleshooting

Conclusion

The Setup

The Problem

Troubleshooting

Conclusion

Related topics