I am using scan_scroll API for data re-indexing using python client. The total data is of 90 GB which contains 40 Million documents. Since it is query based re-indexing, i usually get less than 10000 documents per query. Below are the index and machine configurations.
Elasticsearch version : 1.4.2
No. of primary shards: 8
No. of replica shards: 8
No of total segments: 16
There re two data nodes with 26 GB of RAM and 8 core CPU each. 3 master and 1 client nodes also exist in the cluster.
My problem is scan_scroll API is not consistent at all. on 20% of the time it does not give me the complete data for the same query. The same thing happens with the _count API too. Hitting the same query to get the count of data returns different results many a time.
Have anyone faced this issue?
Please let me know if someone can help.