Elasticsearch token_vector analysis over an entire field


(Matthew J Purcell) #1

Hello I'm writing this question in the hopes for some clarity regarding token_vector analysis. I am looking to get a count of unique tokens in a text/keyword field. So for example in the token_vector query below:

GET /twitter/tweet/1/_termvectors
{
"fields" : ["text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

The response I receive is a count of the tokens within the document. Instead what I would like is a count of the different tokens over an entire field in the index -- not a single document. Instead of this response:

{
"_id": "1",
"_index": "twitter",
"_type": "tweet",
"_version": 1,
"found": true,
"took": 6,
"term_vectors": {
"text": {
"field_statistics": {
"doc_count": 2,
"sum_doc_freq": 6,
"sum_ttf": 8
},
"terms": {
"test": {
"doc_freq": 2,
"term_freq": 3,
"tokens": [
{
"end_offset": 12,
"payload": "d29yZA==",
"position": 1,
"start_offset": 8
},
{
"end_offset": 17,
"payload": "d29yZA==",
"position": 2,
"start_offset": 13
},
{
"end_offset": 22,
"payload": "d29yZA==",
"position": 3,
"start_offset": 18
}
],
"ttf": 4
},
"twitter": {
"doc_freq": 2,
"term_freq": 1,
"tokens": [
{
"end_offset": 7,
"payload": "d29yZA==",
"position": 0,
"start_offset": 0
}
],
"ttf": 2
}
}
}
}
}

I would like just term counts over an entire index. In trying to figure this out I have loaded a field as both text and keyword (it's a list of addresses). What I'm looking for is a count of all unique terms within this Address field. My hope is the response from ES would be something like this:

{
"index" : addresses,
"type" : by_your_house"
"terms" : {
"ROAD" {}
"STREET" {}
"LANE" {}
}}

I have tried using kibana for this task but it will not split up the address terms correctly. It will instead show aggregations of entire Street names that are common. So instead of above I, what I see in Kibana is:

{
"1007 Mountain Drive" : 99
"20 Ingram Street" : 55
" 1938 Sullivan Lane" : 11
}

Thanks for any/all help with this.

  • Matt

(Ivan Brusic) #2

The only way to get the term vectors over the entire index is to use an
aggregation over the analyzed field. If you are on a "modern" version of
Elasticsearch, you would need to enabled fielddata on the field for
aggregations to work:
https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

Field data is expensive, so it is not something to use lightly.

Do you just need the number of terms, without their values, you can use the
cardinarlity or value count aggregations:
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/search-aggregations-metrics-cardinality-aggregation.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-valuecount-aggregation.html


(Matthew J Purcell) #3

Great so this brings me about halfway there! I ran a simple aggs query after setting fielddata : True on the City field. The query was as follows:

GET /my_index/_search
{
"size" : 0,
"aggs" : {
"states" : {
"terms" : {
"field" : "City"
}
}
}
}

and the response:

},
"aggregations": {
"states": {
"doc_count_error_upper_bound": 117547,
"sum_other_doc_count": 28080179,
"buckets": [
{
"key": "san",
"doc_count": 427575
},
{
"key": "new",
"doc_count": 414354
},
{
"key": "chicago",
"doc_count": 265567
},
{
"key": "lake",
"doc_count": 245682
},
{
"key": "beach",
"doc_count": 219230
},
{
"key": "park",
"doc_count": 215134
},
{
"key": "york",
"doc_count": 208854
},
{
"key": "west",
"doc_count": 188746
},
{
"key": "portland",
"doc_count": 188657
},
{
"key": "fort",
"doc_count": 183512
}
]
}
}
}

What I would like to do is be able to get all the results and not just the first 10. Visualizing this in Kibana would probably work, but an aggs query might work just as well for finding the terms and their counts that I'm looking for. Is there a best way to go about this? Thanks for all your help.


(Ivan Brusic) #4

Bucket aggregations have a size parameters that will control the number of
results returned. If not, the default is 10. Depending on the size of your
corpus, you might have thousands of unique tokens. Since your first ten
have a minimum of 183K, the number will probably be high.

One thing you might have noticed is that the terms are the analyzed tokens,
which might be a limiting factor if you need the original text.

Cheers,

Ivan


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.