Count words/tokens in a field in a document

Hi there,
is there any convenient way to get the count of words/tokens in each field of a document?

I know I can use termvector and sum all term frequency in each field to get this number. but I was wondering if there is a faster way to do it.

for example:
curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
"fullname" : "John Doe",
"text" : "twitter test test test "
}'

then count words that I need from that document are:
fullname: 2
text: 4

One more thing, I need the total number of words that actually stored (i.e. after filtering the stopwords).

Thank you

Hi,

have you looked at the token_count datatype? It looks like it might be doing what you are trying to do. In order to retrieve the values calculated for the count field you might need to set it to "store" : true. If you follow the example in the reference, in order to retrieve the values you can use

GET my_index/_search?stored_fields=name.length

Also it seems to support analyzers.

Hi Christoph,
thank you for your respond. Yeah, I tried the token_count and it works. but it counts the original words count not the one after filtering out the stopwords.

Any idea how to get the count of words after removing stopwords?

Thank you

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.