Hello - This is a question around best practices for streaming custom data to Elastic.
Currently I have a script to pull 1 event from my API and then post it to /indexname/_doc. This works great, but now i am asking, what is the best way to run this continuously?
I thought that if i pull the data down (in a script) and save it to a JSON document on disk, then configure Filebeat to ship that data to Elastic is one option, but i'm wanting to ask the advice of others before i try take this route.
Obviously there is the Bulk API that could be coded straight into the script, however as Filebeat also uses the Bulk API under the covers, why re-invent the wheel.
What kind of script it is? Java based? Pure unix script?
There are pros and cons in both methods but you surely know it.
Using filebeat is very good because indeed you can benefit from the retry mechanism in case of failure on elasticsearch side for example.
On the other hand, you will have to write/update a file and have to think about cleaning the old files.
My guts feeling is that if I'm using Java for example, I'd use the bulk processor and accumulate documents in it and let it flush to elasticsearch. But in that case you need to deal with errors "manually".
One other solution is to write your documents to a message queue system (like Kafka, Redis, RabbitMQ...) and use Logstash to read the event queue.
Hi David, thank you for the feedback, that is helpful. I may try Filebeat first, see how that goes and what the management side of things will look like.
The script itself is Windows Powershell, but this is more a POC, and will likely move this to python or use the elastic nodejs client.
From an architecture perspective, is running a script either continuously or every 30/60 seconds in a cron a good use case for containers?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.