I have multiple configurations defined in the pipelines.yml. All the configurations are responsible for pushing data from a jDBC source to Elasticsearch. I want to know what is the optimal settings required to do so. As sometimes one of the configuration took more than 12 hours.
One reason why this could be happening is because you are setting the document_id, so each record insert into the Elasticsearch does a full scan of your index user_log_index_%{+YYYY_MM_dd} to check if there are already existing entries with that id, so as your index size grows, the search before index takes longer.
One way to avoid this problem can be to partition your index into smaller partitions. Right now you are likely indexing everything into one index (assuming YYYY_MM_dd does not change during the course of those 12 hours of indexing time). You could instead use a different indexing strategy, to have multiple smaller indexes so that your index scan is faster. While doing so keep in mind:
Your access patterns (how are you going to query your data).
You do not end up creating a large number of smaller indexes.
The other way could be to remove that document_id setting in the Output if you don't care about having duplicate entries in your Elasticsearch.
To answer your question, I am using an alias to query this data. I want to avoid duplicates that is why I have to use document_id
Little info on the data part. This data is of audit log for one of our products, that is why it will be huge. I am thinking to partition the retrieval process by splitting into multiple queries, as given below.
The production system will have a dedicated system for Logstash with 32GB ram, 8 Core CPU and a minimum of 300GB HDD, but is this kind of load still possible with 2 CPU cores, 8GB ram, 70GB HDD?
What is the number that can be said as acceptable number of smaller indexes?
Also, I made further changes after looking here and there, will this increase my chances of getting data faster into Elasticsearch without overwhelming Elasticsearch and crashing either of them?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.