We are having some problems with document loss when ingesting data into Elasticsearch using Filebeat. I'll describe our approach to data ingest.
We are running a process. This process generates some data that we send to Filebeat through UDP as a string containing a JSON structure. Filebeat is configured with a UDP input and an Elasticsearch output. The input specifies an Elasticsearch ingest pipeline and the index format. We have not considered using Logstash, since the only processors included by the ingest pipeline are JSON and convert ones to transform the string and extract the documents' fields. The output is configured indicating that the host is 'localhost' since we are running an ingestion Elasticsearch node in the same machine.
When sending information from the generator process, we have observed that the number of documents received is lower than the number of sent documents. For example, when sending 18,000 documents from localhost to Filebeat, we get only 15,000 documents in the Elasticsearch index.
We are considering different causes, for example, that the number of tps is too high since we don't have a delay between documents, however, reading this same forum, we've seen tps of around 95k.
We'd appreciate any insights/recommendations regarding this problem. Is it possible to ingest this amount of information without doc loss? Is there any improvement that we could make to the configuration?
It happens every time. We don't have the same loss level, but there's some loss anyways. The structure of the JSON is always the same, the only difference is that sometimes some fields are empty. We configured the ingest pipeline to ingest the document anyway. We have logs in warning mode.
We managed to avoid the loss following this doc recommendations for throughput priority. Particularly setting bulk_max_size: 1600 (previously 800) and queue.mem.events: 12800. Do you consider it to be an appropriate configuration?
It is a trade-off, UDP does not guarantee any delivery, but it maybe faster and give you some higher throughput, TCP tries to guarantee the delevery, but this can have a smaller overhead and a smaller throughput.
The only way to know is by testing as each case is individual.
I know for example that some Cisco device have issues when sending logs using TCP, so in this case UDP is required and the job of leading with data loss is solved by using some queue or other tools in between.
Also, most of the time the limitation on the throughput will be caused by your destination, in this case Elasticsearch.
For example, if your Elasticsearch can not keep up with the event rate sent by Filebeat, it will tell the filebeat output to backoff a little, filebeat internally will then tell the input to backoff, but this does not work for all the inputs.
In this case, it doesn't matter if you are using TCP or UDP, if filebeat needs to backoff because elasticsearch can not keep up with the event rate, it will start filling the queue, when the queue is full, no more events will be accepted and this could lead to data loss, independent of the protocol.
To try to solve things like that you may need to use a persistent queue, so the events will be written to disk, and it is more common to be able to have spare disk than spare memory.