Document count different based on shards

gtorrance · April 28, 2017, 2:34pm

I am seeing an odd issue. When I load a batch of documents to a 1-node ELK server via Filebeat, the count of documents appears to be correct when loading to indexes with 1 shard. But when indexes with 5 shards are created, the document counts are slightly higher. I've verified the counts for the 1 shard indexes to be correct by using Filebeat to write these same documents locally to a text file -- the counts match with what is in ES.

Here are some sample to provide a sense for how the counts are different.

 index: logs-2017.03.01 / 1 shard: 19424 / 5 shard: 19488 / diff: 64 
 index: logs-2017.03.02 / 1 shard: 18104 / 5 shard: 18169 / diff: 65 
 index: logs-2017.03.03 / 1 shard: 18468 / 5 shard: 18537 / diff: 69 
 index: logs-2017.03.04 / 1 shard: 16210 / 5 shard: 16271 / diff: 61 
 index: logs-2017.03.05 / 1 shard: 21811 / 5 shard: 21881 / diff: 70 
 index: logs-2017.03.06 / 1 shard: 26363 / 5 shard: 26431 / diff: 68 
 index: logs-2017.03.07 / 1 shard: 24939 / 5 shard: 24998 / diff: 59 
 index: logs-2017.03.08 / 1 shard: 29150 / 5 shard: 29213 / diff: 63 
 index: logs-2017.03.09 / 1 shard: 16790 / 5 shard: 16872 / diff: 82 
 index: logs-2017.03.10 / 1 shard: 17516 / 5 shard: 17585 / diff: 69

Thoughts? Is there a technical reason why increasing shard size would affect counts?

Thanks,
Greg

jasontedor · April 28, 2017, 3:30pm

What version of Elasticsearch and Filebeat?

gtorrance · April 28, 2017, 3:35pm

@jasontedor Elasticsearch 5.3.0-1 and Filebeat 5.3.0. Thanks.

jasontedor · April 28, 2017, 3:39pm

Can you describe a little more here? Are you doing a refresh before counting? How are you counting in Elasticsearch?

gtorrance · April 28, 2017, 3:45pm

@jasontedor No problem. Sorry for not being clearer earlier. I did a number of full runs -- some with 1 shard indexes and one with 5 shard indexes. Between each run I deleted all my indexes in Elasticsearch and deleted the Filebeat registry to cause everything to re-process from scratch. The counts I'm using are taken from the results of curl 'localhost:9200/_cat/indices?v'. Thanks.

jasontedor · April 28, 2017, 3:53pm

No worries at all!

Can you also force a refresh (POST /_refresh) and do a match all search (GET /_search -d '{}') and a count (GET /_count)?

Do you have any fields in your documents that should be unique per document so you could do an aggregation to see which documents are duplicated?

gtorrance · April 28, 2017, 4:23pm

@jasontedor Unfortunately I don't have an field in my data that is unique. (It's log data, and I've checked and there are naturally some dups.) Is there a way to add a unique field (say during Logstash or Filebeat processing)? I can do so if you give me some guidance on how. What I've done so far is the result of hundreds of Google searches, so my understanding is hardly "well rounded"

Not totally sure what you're asking for with the GET /_search -d '{}'. That seems to bring black a flood of JSON. Can't make much sense of it.

I ran the other statements and here are the results. (I took the liberty and adjusting the count to limit it to my indexes (as I don't want any X-Pack monitoring indexes to possibly be included in the results).

curl -XPOST 'localhost:9200/_refresh'

{"_shards":{"total":330,"successful":165,"failed":0}}

curl 'localhost:9200/logs-*/_count'

{"count":3170434,"_shards":{"total":142,"successful":142,"failed":0}}```

I'm happy to try and run some more queries. Let me know.

Thanks,
Greg

jasontedor · April 28, 2017, 4:31pm

I'm sorry, my instructions were unclear and incomplete, and this is my fault.

For the search, I would like you just search one of the indices, say logs-2017.03.01; the output should give you the number of hits and we can compare this to the count from cat indices API.

For the count calls, we can do the same, just hit the same index again and compare this too.

No worries on not having a unique field, for log data that is often the case I was only hoping we would be lucky here and it might help identify the source of this puzzle.

gtorrance · April 28, 2017, 4:40pm

OK, I see now. Thanks for clarifying.

The data I currently have loaded is loaded in 1 shard indexes, and all counts (search hits, cat indexes API, and count) are the same for logs-2017.03.01: 19424. I'll have to do a reload now with 5 shard indexes and see if there are any differences. I'll let you know. Thanks.

jasontedor · April 28, 2017, 4:54pm

Thanks so much and sorry, I'm sure it's a pain to go through this, but first trying to get a basic understanding of the situation! If this is reproducible it gives me hope that we can get to the bottom of this.

gtorrance · April 28, 2017, 5:19pm

Absolutely no need to apologize! I'm just grateful that there are experts out there so willing to help out the muddled masses like me

I'm at a loss, though. I ran the above tests with 5 shard indexes (for logs-2017.03.01 through logs-2017.03.10). Everything matched exactly. (I even repeated the test to confirm. Worked perfectly.)

I don't get it. I know I wasn't making up the odd results I was seeing earlier. But for some reason I can't seem to repeat it. (I guess it must not have been related to 1 shard vs. 5 shards after all.)

If you have any further thoughts on things to try, I'm open to given it a shot. But barring that, we should probably consider this "closed" for now.

Again, I appreciate the help!

Greg

jasontedor · April 28, 2017, 6:16pm

My initial thought when I saw your report is that something hiccuped somewhere and documents were retried; with auto-generated IDs we would expect to see duplicate documents in a situation like this. Where that hiccup would have been, I do not know.

If this does reproduce again, please feel free to ping and we will try to diagnose it.

gtorrance · May 1, 2017, 10:51am

OK, will do. Thanks, again, for all your help.

Greg

jasontedor · May 1, 2017, 11:29am

You're welcome.

system · May 29, 2017, 11:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
_count differes (same index different number of shards) Elasticsearch	4	399	July 6, 2017
Records per shard Elasticsearch	7	1006	July 6, 2017
Number of results per shard Elasticsearch	5	371	May 13, 2020
Number of documents in an index Elasticsearch	11	834	July 5, 2017
0 hits, 14693688 total on index with 700 documents Elasticsearch	15	2230	November 7, 2017

Document count different based on shards

Related topics