Efficient retrieval of stems mapped to original words


(Patrick Lam) #1

Hi,

If I have a text field indexed in Elasticsearch with an analyzer applied, I know I can retrieve the analyzed word stems with the term vector API. Is there an efficient way to retrieve all the original words that are associated with a stem from ES?

For example, if my documents contain the words "nationally" and "national" in various documents, an ES analyzer stems them to "nation", which I then retrieve with the termvector API. What I want now is to be able to take the stem "nation" and retrieve the original words "national" and "nationally" (and any other variants that are used within the index).

I know I can probably do this using the start and end offsets to try to map them back to the original words, but is there another more efficient way to do this?

Thanks!


(Mark Harwood) #2

I'd normally caution against using the significant_terms aggregation on free text fields but if your index is small you could potentially use this to decode typical values.

GET test/_search
{
	"query":{
		"term":{
			"Body.snowball":"nation"
		}        
	},
	"size":0,

	"aggs": {
		"correlated": {
		   "significant_terms": {
			   "field":"Body.simple"
		   }
		}
	}
}

I ran this on a dataset with the "simple" and "snowball" analyzers defined on the "Body" field in a fashion similar to this:

{
	"mappings":{
		"_default_": {
				"dynamic_templates": [
					{
						"string_fields": {
							"mapping": {
								"index": "analyzed",
								"type": "string",
								"fields": {
									"snowball": {
										"index": "analyzed",
										"analyzer": "snowball",
										"type": "string"
									},
									"simple": {
										"index": "analyzed",
										"analyzer": "simple",
										"type": "string"
									}
								}
							},
							"match": "*"
						}
					}
				]
			}
		}
	}	

The results were as follows:

{
	...
   "aggregations": {
	  "correlated": {
		 "doc_count": 1502,
		 "buckets": [
			{
			   "key": "national",
			   "doc_count": 1291,
			   "score": 30.456902115421787,
			   "bg_count": 1291
			},
			{
			   "key": "nations",
			   "doc_count": 144,
			   "score": 3.3972067425412362,
			   "bg_count": 144
			},
			{
			   "key": "eln",
			   "doc_count": 134,
			   "score": 3.16128960764254,
			   "bg_count": 134
			},
			...

The top results were the unstemmed values of interest. Obviously you could trim to only those starting with nation*


(Mark Harwood) #3

This version with an include filter removes most of the false positives in my previous post:

GET test/_search
{
   "query": {
	  "match": {
		 "Body.snowball": "nation"
	  }
   },
   "size": 0,
   "aggs": {
	  "correlated": {
		 "significant_terms": {
			"field": "Body.simple",
			"include": "nation.*",
			"percentage": {},
			"shard_min_doc_count": 1,
			"min_doc_count": 1,
			"size": 20
		 }
	  }
   }
}

In the results anything less than a perfect score of "1" means the term is not a stem because less than 100% of instances co-occur with the stemmed form:

{
  ...
   "aggregations": {
	  "correlated": {
		 "doc_count": 1502,
		 "buckets": [
			{
			   "key": "nation",
			   "doc_count": 13,
			   "score": 1,
			   "bg_count": 13
			},
			{
			   "key": "nationality",
			   "doc_count": 17,
			   "score": 1,
			   "bg_count": 17
			},
			{
			   "key": "nationalities",
			   "doc_count": 5,
			   "score": 1,
			   "bg_count": 5
			},
			{
			   "key": "national",
			   "doc_count": 1291,
			   "score": 1,
			   "bg_count": 1291
			},
			{
			   "key": "nationals",
			   "doc_count": 44,
			   "score": 1,
			   "bg_count": 44
			},
			{
			   "key": "nations",
			   "doc_count": 144,
			   "score": 1,
			   "bg_count": 144
			},
			{
			   "key": "nationwide",
			   "doc_count": 1,
			   "score": 0.125,
			   "bg_count": 8
			},
			{
			   "key": "nationalist",
			   "doc_count": 1,
			   "score": 0.01639344262295082,
			   "bg_count": 61
			}
		 ]
	  }
   }
}

(system) #4