Document count different based on shards


(Greg T) #1

I am seeing an odd issue. When I load a batch of documents to a 1-node ELK server via Filebeat, the count of documents appears to be correct when loading to indexes with 1 shard. But when indexes with 5 shards are created, the document counts are slightly higher. I've verified the counts for the 1 shard indexes to be correct by using Filebeat to write these same documents locally to a text file -- the counts match with what is in ES.

Here are some sample to provide a sense for how the counts are different.

 index: logs-2017.03.01 / 1 shard: 19424 / 5 shard: 19488 / diff: 64 
 index: logs-2017.03.02 / 1 shard: 18104 / 5 shard: 18169 / diff: 65 
 index: logs-2017.03.03 / 1 shard: 18468 / 5 shard: 18537 / diff: 69 
 index: logs-2017.03.04 / 1 shard: 16210 / 5 shard: 16271 / diff: 61 
 index: logs-2017.03.05 / 1 shard: 21811 / 5 shard: 21881 / diff: 70 
 index: logs-2017.03.06 / 1 shard: 26363 / 5 shard: 26431 / diff: 68 
 index: logs-2017.03.07 / 1 shard: 24939 / 5 shard: 24998 / diff: 59 
 index: logs-2017.03.08 / 1 shard: 29150 / 5 shard: 29213 / diff: 63 
 index: logs-2017.03.09 / 1 shard: 16790 / 5 shard: 16872 / diff: 82 
 index: logs-2017.03.10 / 1 shard: 17516 / 5 shard: 17585 / diff: 69 

Thoughts? Is there a technical reason why increasing shard size would affect counts?

Thanks,
Greg


(Jason Tedor) #2

What version of Elasticsearch and Filebeat?


(Greg T) #3

@jasontedor Elasticsearch 5.3.0-1 and Filebeat 5.3.0. Thanks.


(Jason Tedor) #4

Can you describe a little more here? Are you doing a refresh before counting? How are you counting in Elasticsearch?


(Greg T) #5

@jasontedor No problem. Sorry for not being clearer earlier. I did a number of full runs -- some with 1 shard indexes and one with 5 shard indexes. Between each run I deleted all my indexes in Elasticsearch and deleted the Filebeat registry to cause everything to re-process from scratch. The counts I'm using are taken from the results of curl 'localhost:9200/_cat/indices?v'. Thanks.


(Jason Tedor) #6

No worries at all!

Can you also force a refresh (POST /_refresh) and do a match all search (GET /_search -d '{}') and a count (GET /_count)?

Do you have any fields in your documents that should be unique per document so you could do an aggregation to see which documents are duplicated?


(Greg T) #7

@jasontedor Unfortunately I don't have an field in my data that is unique. (It's log data, and I've checked and there are naturally some dups.) Is there a way to add a unique field (say during Logstash or Filebeat processing)? I can do so if you give me some guidance on how. What I've done so far is the result of hundreds of Google searches, so my understanding is hardly "well rounded" :slight_smile:

Not totally sure what you're asking for with the GET /_search -d '{}'. That seems to bring black a flood of JSON. Can't make much sense of it.

I ran the other statements and here are the results. (I took the liberty and adjusting the count to limit it to my indexes (as I don't want any X-Pack monitoring indexes to possibly be included in the results).

curl -XPOST 'localhost:9200/_refresh'

{"_shards":{"total":330,"successful":165,"failed":0}}

curl 'localhost:9200/logs-*/_count'

{"count":3170434,"_shards":{"total":142,"successful":142,"failed":0}}```

I'm happy to try and run some more queries. Let me know.

Thanks,
Greg


(Jason Tedor) #8

I'm sorry, my instructions were unclear and incomplete, and this is my fault.

For the search, I would like you just search one of the indices, say logs-2017.03.01; the output should give you the number of hits and we can compare this to the count from cat indices API.

For the count calls, we can do the same, just hit the same index again and compare this too.

No worries on not having a unique field, for log data that is often the case I was only hoping we would be lucky here and it might help identify the source of this puzzle.


(Greg T) #9

OK, I see now. Thanks for clarifying.

The data I currently have loaded is loaded in 1 shard indexes, and all counts (search hits, cat indexes API, and count) are the same for logs-2017.03.01: 19424. I'll have to do a reload now with 5 shard indexes and see if there are any differences. I'll let you know. Thanks.


(Jason Tedor) #10

Thanks so much and sorry, I'm sure it's a pain to go through this, but first trying to get a basic understanding of the situation! If this is reproducible it gives me hope that we can get to the bottom of this.


(Greg T) #11

Absolutely no need to apologize! I'm just grateful that there are experts out there so willing to help out the muddled masses like me :slight_smile:

I'm at a loss, though. I ran the above tests with 5 shard indexes (for logs-2017.03.01 through logs-2017.03.10). Everything matched exactly. (I even repeated the test to confirm. Worked perfectly.)

I don't get it. I know I wasn't making up the odd results I was seeing earlier. But for some reason I can't seem to repeat it. (I guess it must not have been related to 1 shard vs. 5 shards after all.)

If you have any further thoughts on things to try, I'm open to given it a shot. But barring that, we should probably consider this "closed" for now.

Again, I appreciate the help!

Greg


(Jason Tedor) #12

My initial thought when I saw your report is that something hiccuped somewhere and documents were retried; with auto-generated IDs we would expect to see duplicate documents in a situation like this. Where that hiccup would have been, I do not know.

If this does reproduce again, please feel free to ping and we will try to diagnose it.


(Greg T) #13

OK, will do. Thanks, again, for all your help.

Greg


(Jason Tedor) #14

You're welcome.


(system) #15

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.