We are having some problems with document loss when ingesting data into Elasticsearch using Filebeat. I'll describe our approach to data ingest.
We are running a process. This process generates some data that we send to Filebeat through UDP as a string containing a JSON structure. Filebeat is configured with a UDP input and an Elasticsearch output. The input specifies an Elasticsearch ingest pipeline and the index format. We have not considered using Logstash, since the only processors included by the ingest pipeline are JSON and convert ones to transform the string and extract the documents' fields. The output is configured indicating that the host is 'localhost' since we are running an ingestion Elasticsearch node in the same machine.
When sending information from the generator process, we have observed that the number of documents received is lower than the number of sent documents. For example, when sending 18,000 documents from localhost to Filebeat, we get only 15,000 documents in the Elasticsearch index.
We are considering different causes, for example, that the number of tps is too high since we don't have a delay between documents, however, reading this same forum, we've seen tps of around 95k.
We'd appreciate any insights/recommendations regarding this problem. Is it possible to ingest this amount of information without doc loss? Is there any improvement that we could make to the configuration?
The number of events per second depends on many things, so it may be that in your case you are having issues with it.
You said that when sending 18k documents you are only getting 15k documents in elasticsearch, but this happens every time or sometimes you do not lose documents?
Also what do you have in Filebeat logs when this happens?
Is the json structure always the same?
UDP can lead to data loss in some cases, have you tried to change to TCP and see if things change?
One alternative would be to use a disk queue in Filebeat instead the default memory queue.
It happens every time. We don't have the same loss level, but there's some loss anyways. The structure of the JSON is always the same, the only difference is that sometimes some fields are empty. We configured the ingest pipeline to ingest the document anyway. We have logs in warning mode.
We managed to avoid the loss following this doc recommendations for throughput priority. Particularly setting bulk_max_size: 1600 (previously 800) and queue.mem.events: 12800. Do you consider it to be an appropriate configuration?
Well, if it is working for you, then I do not see any issues.
One thing is that UDP can always lead to data loss in some cases, TCP is more reliable if data loss is a big issue.
Another thing is:
. The structure of the JSON is always the same, the only difference is that sometimes some fields are empty
Depending on what is coming empty Elastic may reject the document, but if you are dealing with it in the Ingest pipeline so I do not see any issue as well.
Your insight regarding TCP seems very reasonable considering the protocol definition. However, I have some concerns about the throughput. Are there any limitations in this regard?
It is a trade-off, UDP does not guarantee any delivery, but it maybe faster and give you some higher throughput, TCP tries to guarantee the delevery, but this can have a smaller overhead and a smaller throughput.
The only way to know is by testing as each case is individual.
I know for example that some Cisco device have issues when sending logs using TCP, so in this case UDP is required and the job of leading with data loss is solved by using some queue or other tools in between.
Also, most of the time the limitation on the throughput will be caused by your destination, in this case Elasticsearch.
For example, if your Elasticsearch can not keep up with the event rate sent by Filebeat, it will tell the filebeat output to backoff a little, filebeat internally will then tell the input to backoff, but this does not work for all the inputs.
In this case, it doesn't matter if you are using TCP or UDP, if filebeat needs to backoff because elasticsearch can not keep up with the event rate, it will start filling the queue, when the queue is full, no more events will be accepted and this could lead to data loss, independent of the protocol.
To try to solve things like that you may need to use a persistent queue, so the events will be written to disk, and it is more common to be able to have spare disk than spare memory.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.