Issues with scan and scroll as well as count API

Hi,
I am using scan_scroll API for data re-indexing using python client. The total data is of 90 GB which contains 40 Million documents. Since it is query based re-indexing, i usually get less than 10000 documents per query. Below are the index and machine configurations.
Elasticsearch version : 1.4.2
No. of primary shards: 8
No. of replica shards: 8
No of total segments: 16
There re two data nodes with 26 GB of RAM and 8 core CPU each. 3 master and 1 client nodes also exist in the cluster.

My problem is scan_scroll API is not consistent at all. on 20% of the time it does not give me the complete data for the same query. The same thing happens with the _count API too. Hitting the same query to get the count of data returns different results many a time.

Have anyone faced this issue?

Please let me know if someone can help.

Regards,
Bharvi

Hi Bharvi,
that makes me think that maybe some replicas got out of sync with their primaries at some point. What if you use the search api, with search_type count and you specify the preference, so that you always hit the same shards? Does the number of result change every time again?

Also I'm assuming that you haven't been indexing while querying at the moment (although not a problem with the scan/scroll).

Thanks Luca for responding. I haven't tried searching with preference yet.
I will try it soon.
And there is no indexing going on while search. Its complete static data. I
had indexed it using only primary shard and replicated it on other node
after complete indexing is done.
Earlier there were 334 segments in that index. But after optimizing there
are one segment per shard. Still no luck.

But using preference parameter is very valid point. I will let you know the
results. Thanks again for pointinh out.

You can also check how many docs you have on each shard using indices stats api e.g. to figure out if some replicas are out of sync.

Hi Luca,

Tried Everything. Got to know that my documents have not been distributed
equally but there is no replication issue at all.
Here is the document distribution for the index:
Shard 0: 23596919
Shard 1: 23597019
Shard 2: 23593214
Shard 3: 23598522
Shard 4: 15684207
Shard 5: 7294415
Shard 6: 7293062
Shard 7: 7294274

I tried all the parameters of preference : nodes, shards, primary .. But no
luck. There are enough resources in the cluster.
But, the results are still inconsistent.

Regards

Bharvi Dixit
Software Engineer
596, Udyog Vihar Phase V, Sector 19, Gurgaon 122016, India
Tel: +91 (124) 438 4534 Web: www.grownout.com