Word count/frequency per field


(M. Alsioufi) #1

Hi there,
is there any convenient way to get the count of words/tokens in some fields of a document?

for example:
curl -XPUT 'http://localhost:9200/twitter/tweet/1?pretty=true' -d '{
"text1" : "twitter, test, test, test ",
"text2" : "test, test, man, two "
}'

then count words that I need from that document are:
text1:{
"twitter":1,
"test":3
},
text2: {
"test":2,
"man":1,
"two":1
}
or something similar

I know I can use termvector but I could not really understand how this can help me.

Thank you


#2

Hey,

the _termvector API is the best way to access information about term statistics in Elasticsearch after your data has been indexed.

If you want to get the length of a field in tokens, you can use the type token_count in your mapping.

What problem are you trying to solve?


(M. Alsioufi) #3

Thanks for your answer.

I have tried termvector API it does almost what I expect however, I have to run it on one specific document I could not run it on my entire index. so following the same example I showed.
My request looks like:
GET my_index/doc/someid123/_termvectors
{
"fields": ["text1"]

}
and then the reply I get:
{
"_index": "my_index",
"_type": "doc",
"_id": "someid123",
"_version": 2,
"found": true,
"took": 1,
"term_vectors": {
"text1": {
"field_statistics": {
"sum_doc_freq": 56,
"doc_count": 54,
"sum_ttf": 60
},
"terms": {
"test": {
"term_freq": 3,
"tokens": [
{
"position": 0,
"start_offset": 9,
"end_offset": 12
},
{
"position": 1,
"start_offset": 15,
"end_offset": 19
},
{
"position": 2,
"start_offset": 21,
"end_offset": 25
}
]
},
"twitter":{
"term_freq": 1,
"tokens": [
{
"position": 0,
"start_offset": 0,
"end_offset": 6
}
]
}
}
}
}
}

What I want to do is to be able to get this functionality among all my index not a special document, and to be able to show this data in a Kibana visualization