Efficient retrieval of stems mapped to original words

Patrick_Lam · September 13, 2015, 6:23am

Hi,

If I have a text field indexed in Elasticsearch with an analyzer applied, I know I can retrieve the analyzed word stems with the term vector API. Is there an efficient way to retrieve all the original words that are associated with a stem from ES?

For example, if my documents contain the words "nationally" and "national" in various documents, an ES analyzer stems them to "nation", which I then retrieve with the termvector API. What I want now is to be able to take the stem "nation" and retrieve the original words "national" and "nationally" (and any other variants that are used within the index).

I know I can probably do this using the start and end offsets to try to map them back to the original words, but is there another more efficient way to do this?

Thanks!

Mark_Harwood · September 14, 2015, 9:27am

I'd normally caution against using the significant_terms aggregation on free text fields but if your index is small you could potentially use this to decode typical values.

GET test/_search
{
	"query":{
		"term":{
			"Body.snowball":"nation"
		}        
	},
	"size":0,

	"aggs": {
		"correlated": {
		   "significant_terms": {
			   "field":"Body.simple"
		   }
		}
	}
}

I ran this on a dataset with the "simple" and "snowball" analyzers defined on the "Body" field in a fashion similar to this:

{
	"mappings":{
		"_default_": {
				"dynamic_templates": [
					{
						"string_fields": {
							"mapping": {
								"index": "analyzed",
								"type": "string",
								"fields": {
									"snowball": {
										"index": "analyzed",
										"analyzer": "snowball",
										"type": "string"
									},
									"simple": {
										"index": "analyzed",
										"analyzer": "simple",
										"type": "string"
									}
								}
							},
							"match": "*"
						}
					}
				]
			}
		}
	}

The results were as follows:

{
	...
   "aggregations": {
	  "correlated": {
		 "doc_count": 1502,
		 "buckets": [
			{
			   "key": "national",
			   "doc_count": 1291,
			   "score": 30.456902115421787,
			   "bg_count": 1291
			},
			{
			   "key": "nations",
			   "doc_count": 144,
			   "score": 3.3972067425412362,
			   "bg_count": 144
			},
			{
			   "key": "eln",
			   "doc_count": 134,
			   "score": 3.16128960764254,
			   "bg_count": 134
			},
			...

The top results were the unstemmed values of interest. Obviously you could trim to only those starting with nation*

Mark_Harwood · September 15, 2015, 8:13am

This version with an include filter removes most of the false positives in my previous post:

GET test/_search
{
   "query": {
	  "match": {
		 "Body.snowball": "nation"
	  }
   },
   "size": 0,
   "aggs": {
	  "correlated": {
		 "significant_terms": {
			"field": "Body.simple",
			"include": "nation.*",
			"percentage": {},
			"shard_min_doc_count": 1,
			"min_doc_count": 1,
			"size": 20
		 }
	  }
   }
}

In the results anything less than a perfect score of "1" means the term is not a stem because less than 100% of instances co-occur with the stemmed form:

{
  ...
   "aggregations": {
	  "correlated": {
		 "doc_count": 1502,
		 "buckets": [
			{
			   "key": "nation",
			   "doc_count": 13,
			   "score": 1,
			   "bg_count": 13
			},
			{
			   "key": "nationality",
			   "doc_count": 17,
			   "score": 1,
			   "bg_count": 17
			},
			{
			   "key": "nationalities",
			   "doc_count": 5,
			   "score": 1,
			   "bg_count": 5
			},
			{
			   "key": "national",
			   "doc_count": 1291,
			   "score": 1,
			   "bg_count": 1291
			},
			{
			   "key": "nationals",
			   "doc_count": 44,
			   "score": 1,
			   "bg_count": 44
			},
			{
			   "key": "nations",
			   "doc_count": 144,
			   "score": 1,
			   "bg_count": 144
			},
			{
			   "key": "nationwide",
			   "doc_count": 1,
			   "score": 0.125,
			   "bg_count": 8
			},
			{
			   "key": "nationalist",
			   "doc_count": 1,
			   "score": 0.01639344262295082,
			   "bg_count": 61
			}
		 ]
	  }
   }
}

Topic		Replies	Views
Is there any way I can keep original words with the stemmed words? Is it a good idea? Elasticsearch	1	411	July 6, 2017
Getting all words used in a document matching a stemmed query Elasticsearch	1	410	July 5, 2017
Stemmer for keyword-fields before index Elasticsearch	5	988	August 27, 2020
Get stem for word in elastic Elasticsearch	5	1712	July 5, 2017
Keywords with spaces and root word stemming Elasticsearch	6	1687	July 5, 2017

Efficient retrieval of stems mapped to original words

Related topics