Why does ttf have different values for same term in same index?


(David Steiner) #1

using the termvectors endpoint, I get two different values for ttf for the same term.

For doc #1, I get:
"death": {
"doc_freq": 2069,
"ttf": 14447,
"term_freq": 42

Document #2 in the list is:
"death": {
"doc_freq": 1961,
"ttf": 12227,
"term_freq": 14

And for yet another document, I get:
"death": {
"doc_freq": 1989,
"ttf": 12851,
"term_freq": 8,

So, why isn’t ttf the same for every document in a given index?

I'd like to find out if there's a way to get a count for a term within a document and a count for a term within the index, so if termvectors won't do that, is there something that will?

Thanks!


(Alexander Reelsen) #2

Hey,

is it possible that those documents live on different shards? Have you tried those numbers on an index with one shard? do they differ there?

Keep in mind that each shard is its own lucene index and an index is just a collection of shards.

--Alex


(David Steiner) #3

It's possible, there appear to be 5 shards (0, 1, 2, 3 and 4). I do not have any indexes that have not been created this way - I didn't do anything specific to specify the number of shards when I created them, so that must be some kind of default.
Is there a way to get a count across all shards?


(Alexander Reelsen) #4

Yes, 5 shards and one replica is the default.

There is no way to get a count across all shards. Also keep in mind, that deleted documents are also part of the term vector numbers.

--Alex


(David Steiner) #5

OK. I haven't deleted any documents, but that is definitely interesting. Thanks.


(David Steiner) #6

When I created an index with 1 shard, then used _reindex to populate it. Now, using the _termvectors interface, it's now telling me that there are 489760 documents when I know that there are only 424,883 documents. So, I guess this is a result of using the _reindex interface? And I guess that the extra 64877 documents are actually "deleted" documents since they total count from doing a "Discover" in Kibana shows the correct number on a search for "*" in the index: 424,883.
Am I now "stuck" with these extra document counts (and term counts)?


(Alexander Reelsen) #7

Hey,

first use one of the cat API (not the count api, but the indices API is a good candidate) to check for your assumptions.

You could run the forgemerge API to expunge deletes (note this is not meant to be run with every index operation or all the time, hence its name).

--Alex


(David Steiner) #8

Hey Alex,

Indeed there were deleted documents, even though there weren't any in the index before I reindexed from 5 shards to 1:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open index1 MbWtduzaQnusompMjG0Urg 5 1 424883 0 3.7gb 3.7gb
yellow open index1_v1 t987tbkFSWOeORF5u62SSw 1 1 424883 64877 4.2gb 4.2gb

However, there appears to be 1616 that won't go away with the forcemerge:
POST /index1_v1/_forcemerge?only_expunge_deletes=true

health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open index1 MbWtduzaQnusompMjG0Urg 5 1 424883 0 3.7gb 3.7gb
yellow open index1_v1 t987tbkFSWOeORF5u62SSw 1 1 424883 1616 3.7gb 3.7gb

Is there a reason why deleted documents wouldn't be deleted by forcemerge?
Thanks,
David


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.