Term Vector With BiGrams

Hi,
I have a a document which contains a subdocument, as such I have made the subdocument a nested property of the document. Now i need to find term vectors for the sub document. My terms can be unigrams or bigrams, hence I created an analyzer with shingle filter. The setting for the index is as follows

{
  "settings": {
    "analysis": {
      "filter": {
        "light_english_stemmer": {
		  "type":       "stemmer",
          "language":   "light_english" 
        },
        "filter_shingle":{
		   "type":"shingle",
		   "max_shingle_size":3,
		   "min_shingle_size":2,
		   "output_unigrams":"true",
		   "filler_token" : ""
		}
      },
      "analyzer": {
        "keyword_discovery_analyzer": {
          "tokenizer":  "standard",
          "char_filter":  [ "html_strip" ],
          "filter": [
            "lowercase",
            "filter_shingle",
            "light_english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
  	"doc" : {
  		"properties" : {
  			"name" : {
  				"type" : "text"
  			},
  			"description" : {
  				"type" : "text",
  				"analyzer" : "indexing_analyzer",
  				"search_analyzer": "search_analyzer",
  				"fields" : {
  					"termVec": { 
		              "type" : "text",
		              "term_vector": "yes",
			          "store" : true,
			          "analyzer" : "keyword_discovery_analyzer"
		            }
  				}
  			},
  			"subDoc" : {
  				"type" : "nested",
  				"properties" : {
  					"name" : {
		  				"type" : "text",
		  				"fields" : {
							"termVec": { 
				              "type" : "text",
				              "term_vector": "yes",
					          "store" : true,
					          "analyzer" : "keyword_discovery_analyzer"
				            }
  						}
		  			},
		  			"description" : {
		  				"type" : "text",
		  				"fields" : {
							"termVec": { 
				              "type" : "text",
				              "term_vector": "yes",
					          "store" : true,
					          "analyzer" : "keyword_discovery_analyzer"
				            }
  						}
		  			}
  				}
  			}
  		}
  	}
  }
}

When i execute request
GET /_termvectors
{
"fields" : ["subDoc.name.termVec"],
"offsets" : false,
"payloads" : false,
"positions" : false,
"term_statistics" : true,
"field_statistics" : true,
"filter" :{
"max_num_terms" : 4
}
}

I get empty result. However if instead of the above query i run the following,
GET /12631946/_termvectors
{
"fields" : ["subDoc.name"],
"offsets" : false,
"payloads" : false,
"positions" : false,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer" : {
"name": "keyword_discovery_analyzer"
},
"filter" :{
"max_num_terms" : 15
}
}

ES would evaluate term vectors on the fly, and i get the results, but none of the results contains bigram terms all are unigrams.

My analyzer is working correctly, because when I put the same analyzer on my doc.name variable it gives me bigrams when term vectors are computed and stored, however in doc.name as well, if the term vectors are computed at runtime, it always returns me unigrams.

Please let me know what i am doing wrong.

Thanks
Vishvadeepak Tewari

Hmm this looks like a bug to me. Want to open an issue?

Yep, what do i need to open issue for this ?

Just go to https://github.com/elastic/elasticsearch/issues/new and provide the info requested.

created an issue https://github.com/elastic/elasticsearch/issues/25070

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.