Hi there,
In our application we decided to use elasticsearch create a daily snapshot of some critical application data for visualizations.
The issue we are facing is that we are getting missing records in out etl process - for some reason not all the data we are uploading to elastic gets there.
we have been running tests with various settings and thus far it seems the following settings works best and does not produce missing records
bulk indexing using curl in java code [talend]
refresh interval disabled
Number of Shards: 2
Replicas: 0
5 requests at a time of 300 documents in each request for bulk index (seems to work best)
enabling refresh interval to 30s and setting document count to 100 we would have missign records from elastic index
We wrote a talend job to retried the data from line of business system and user curl inside talend to do bulk inserts of documents to elasticsearch.
can someone shed some light on why we are seeing missing records when chagning the above settings - we do not get any errors though out the process and all the files
are being processed
the highlightes green rows are the only settings that produced no missing records during our test
split rows column is document count in each request for bulk insert
System information:
Elasticsearch deployed in Azure Kubernetes Services.
Nodepool made up of 3 nodes of VM type: Standard_B12ms [12 vcpus and 48 memory].
K8s resource: statefulset, 3 node cluster, version 8.1.0. docker image
Pod cpu limit: 8 cpus
Pod memory limit: 16Gi
jvm settings: -Xms8g -Xmx8g
The talend job runs in as a cronjob in the same cluster
can someone shed some light on why we are seeing missing records - we do not get any errors though out the process and all the files
are being processed