Term Vector With BiGrams


(Vishva Deepak Tewari) #1

Hi,
I have a a document which contains a subdocument, as such I have made the subdocument a nested property of the document. Now i need to find term vectors for the sub document. My terms can be unigrams or bigrams, hence I created an analyzer with shingle filter. The setting for the index is as follows

{
  "settings": {
    "analysis": {
      "filter": {
        "light_english_stemmer": {
		  "type":       "stemmer",
          "language":   "light_english" 
        },
        "filter_shingle":{
		   "type":"shingle",
		   "max_shingle_size":3,
		   "min_shingle_size":2,
		   "output_unigrams":"true",
		   "filler_token" : ""
		}
      },
      "analyzer": {
        "keyword_discovery_analyzer": {
          "tokenizer":  "standard",
          "char_filter":  [ "html_strip" ],
          "filter": [
            "lowercase",
            "filter_shingle",
            "light_english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
  	"doc" : {
  		"properties" : {
  			"name" : {
  				"type" : "text"
  			},
  			"description" : {
  				"type" : "text",
  				"analyzer" : "indexing_analyzer",
  				"search_analyzer": "search_analyzer",
  				"fields" : {
  					"termVec": { 
		              "type" : "text",
		              "term_vector": "yes",
			          "store" : true,
			          "analyzer" : "keyword_discovery_analyzer"
		            }
  				}
  			},
  			"subDoc" : {
  				"type" : "nested",
  				"properties" : {
  					"name" : {
		  				"type" : "text",
		  				"fields" : {
							"termVec": { 
				              "type" : "text",
				              "term_vector": "yes",
					          "store" : true,
					          "analyzer" : "keyword_discovery_analyzer"
				            }
  						}
		  			},
		  			"description" : {
		  				"type" : "text",
		  				"fields" : {
							"termVec": { 
				              "type" : "text",
				              "term_vector": "yes",
					          "store" : true,
					          "analyzer" : "keyword_discovery_analyzer"
				            }
  						}
		  			}
  				}
  			}
  		}
  	}
  }
}

When i execute request
GET /_termvectors
{
"fields" : ["subDoc.name.termVec"],
"offsets" : false,
"payloads" : false,
"positions" : false,
"term_statistics" : true,
"field_statistics" : true,
"filter" :{
"max_num_terms" : 4
}
}

I get empty result. However if instead of the above query i run the following,
GET /12631946/_termvectors
{
"fields" : ["subDoc.name"],
"offsets" : false,
"payloads" : false,
"positions" : false,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer" : {
"name": "keyword_discovery_analyzer"
},
"filter" :{
"max_num_terms" : 15
}
}

ES would evaluate term vectors on the fly, and i get the results, but none of the results contains bigram terms all are unigrams.

My analyzer is working correctly, because when I put the same analyzer on my doc.name variable it gives me bigrams when term vectors are computed and stored, however in doc.name as well, if the term vectors are computed at runtime, it always returns me unigrams.

Please let me know what i am doing wrong.

Thanks
Vishvadeepak Tewari


(Clinton Gormley) #2

Hmm this looks like a bug to me. Want to open an issue?


(Vishva Deepak Tewari) #3

Yep, what do i need to open issue for this ?


(Clinton Gormley) #4

Just go to https://github.com/elastic/elasticsearch/issues/new and provide the info requested.


(Vishva Deepak Tewari) #5

created an issue https://github.com/elastic/elasticsearch/issues/25070


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.