Term Vector With BiGrams

vishva_deepak_tewari · May 24, 2017, 2:26am

Hi,
I have a a document which contains a subdocument, as such I have made the subdocument a nested property of the document. Now i need to find term vectors for the sub document. My terms can be unigrams or bigrams, hence I created an analyzer with shingle filter. The setting for the index is as follows

{
  "settings": {
    "analysis": {
      "filter": {
        "light_english_stemmer": {
		  "type":       "stemmer",
          "language":   "light_english" 
        },
        "filter_shingle":{
		   "type":"shingle",
		   "max_shingle_size":3,
		   "min_shingle_size":2,
		   "output_unigrams":"true",
		   "filler_token" : ""
		}
      },
      "analyzer": {
        "keyword_discovery_analyzer": {
          "tokenizer":  "standard",
          "char_filter":  [ "html_strip" ],
          "filter": [
            "lowercase",
            "filter_shingle",
            "light_english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
  	"doc" : {
  		"properties" : {
  			"name" : {
  				"type" : "text"
  			},
  			"description" : {
  				"type" : "text",
  				"analyzer" : "indexing_analyzer",
  				"search_analyzer": "search_analyzer",
  				"fields" : {
  					"termVec": { 
		              "type" : "text",
		              "term_vector": "yes",
			          "store" : true,
			          "analyzer" : "keyword_discovery_analyzer"
		            }
  				}
  			},
  			"subDoc" : {
  				"type" : "nested",
  				"properties" : {
  					"name" : {
		  				"type" : "text",
		  				"fields" : {
							"termVec": { 
				              "type" : "text",
				              "term_vector": "yes",
					          "store" : true,
					          "analyzer" : "keyword_discovery_analyzer"
				            }
  						}
		  			},
		  			"description" : {
		  				"type" : "text",
		  				"fields" : {
							"termVec": { 
				              "type" : "text",
				              "term_vector": "yes",
					          "store" : true,
					          "analyzer" : "keyword_discovery_analyzer"
				            }
  						}
		  			}
  				}
  			}
  		}
  	}
  }
}

When i execute request
GET /_termvectors
{
"fields" : ["subDoc.name.termVec"],
"offsets" : false,
"payloads" : false,
"positions" : false,
"term_statistics" : true,
"field_statistics" : true,
"filter" :{
"max_num_terms" : 4
}
}

I get empty result. However if instead of the above query i run the following,
GET /12631946/_termvectors
{
"fields" : ["subDoc.name"],
"offsets" : false,
"payloads" : false,
"positions" : false,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer" : {
"name": "keyword_discovery_analyzer"
},
"filter" :{
"max_num_terms" : 15
}
}

ES would evaluate term vectors on the fly, and i get the results, but none of the results contains bigram terms all are unigrams.

My analyzer is working correctly, because when I put the same analyzer on my doc.name variable it gives me bigrams when term vectors are computed and stored, however in doc.name as well, if the term vectors are computed at runtime, it always returns me unigrams.

Please let me know what i am doing wrong.

Thanks
Vishvadeepak Tewari

Clinton_Gormley · May 26, 2017, 9:07am

Hmm this looks like a bug to me. Want to open an issue?

vishva_deepak_tewari · May 30, 2017, 6:22am

Yep, what do i need to open issue for this ?

Clinton_Gormley · May 30, 2017, 7:03am

Just go to https://github.com/elastic/elasticsearch/issues/new and provide the info requested.

vishva_deepak_tewari · June 6, 2017, 10:28am

created an issue https://github.com/elastic/elasticsearch/issues/25070

system · July 4, 2017, 10:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Term Vectors in Nested Documents Elasticsearch	2	849	July 5, 2017
Querying shingles Elasticsearch	1	293	July 6, 2020
Shingles and terms aggregation not working as expected Elasticsearch	2	841	August 11, 2020
Highlight term issue Elasticsearch	1	335	November 6, 2018
Multi-word Term Vectors with Word nGrams? Elasticsearch	3	766	July 6, 2017

Term Vector With BiGrams

Related topics