Issues with scan and scroll as well as count API

d_bharvi · September 10, 2015, 1:18pm

Hi,
I am using scan_scroll API for data re-indexing using python client. The total data is of 90 GB which contains 40 Million documents. Since it is query based re-indexing, i usually get less than 10000 documents per query. Below are the index and machine configurations.
Elasticsearch version : 1.4.2
No. of primary shards: 8
No. of replica shards: 8
No of total segments: 16
There re two data nodes with 26 GB of RAM and 8 core CPU each. 3 master and 1 client nodes also exist in the cluster.

My problem is scan_scroll API is not consistent at all. on 20% of the time it does not give me the complete data for the same query. The same thing happens with the _count API too. Hitting the same query to get the count of data returns different results many a time.

Have anyone faced this issue?

Please let me know if someone can help.

Regards,
Bharvi

javanna · September 10, 2015, 3:58pm

Hi Bharvi,
that makes me think that maybe some replicas got out of sync with their primaries at some point. What if you use the search api, with search_type count and you specify the preference, so that you always hit the same shards? Does the number of result change every time again?

Also I'm assuming that you haven't been indexing while querying at the moment (although not a problem with the scan/scroll).

d_bharvi · September 10, 2015, 6:21pm

Thanks Luca for responding. I haven't tried searching with preference yet.
I will try it soon.
And there is no indexing going on while search. Its complete static data. I
had indexed it using only primary shard and replicated it on other node
after complete indexing is done.
Earlier there were 334 segments in that index. But after optimizing there
are one segment per shard. Still no luck.

But using preference parameter is very valid point. I will let you know the
results. Thanks again for pointinh out.

javanna · September 10, 2015, 6:27pm

You can also check how many docs you have on each shard using indices stats api e.g. to figure out if some replicas are out of sync.

d_bharvi · September 11, 2015, 5:44am

Hi Luca,

Tried Everything. Got to know that my documents have not been distributed
equally but there is no replication issue at all.
Here is the document distribution for the index:
Shard 0: 23596919
Shard 1: 23597019
Shard 2: 23593214
Shard 3: 23598522
Shard 4: 15684207
Shard 5: 7294415
Shard 6: 7293062
Shard 7: 7294274

I tried all the parameters of preference : nodes, shards, primary .. But no
luck. There are enough resources in the cluster.
But, the results are still inconsistent.

Regards

Bharvi Dixit
Software Engineer
596, Udyog Vihar Phase V, Sector 19, Gurgaon 122016, India
Tel: +91 (124) 438 4534 Web: www.grownout.com

Topic		Replies	Views
Occasionally shards failing during scroll API (Scroll request has only succeeded on 270 (+0 skipped) shards out of 280) Elasticsearch	5	511	June 21, 2024
Incomplete results for scan / scroll searches Elasticsearch	3	735	July 6, 2017
Retrieving over a million records in Elasticsearch Elasticsearch	10	28443	July 5, 2017
Scroll and Scan Elasticsearch	4	445	July 6, 2017
Issues with Elasticsearch Scroll API results Elasticsearch	1	718	March 17, 2021

Issues with scan and scroll as well as count API

Regards

Related topics