Can _termvector return only real english words and ignore everything else?

virtuman · July 28, 2015, 10:12pm

I'm trying to autogenerate tags relevant to the document based on all text known from the document.
Is there an analyzer / filter that we can use to exclude any "non-real" english words such as models, partial words, words that include numeric or special chars, etc.?

ie:

GET data_classification/categories/_termvector
{
  "doc": {
    "text": "wer muffin lear strappy sal made italy color material muffin"
  }
}

and I would want it to return only this subset of words

muffin x 2
lear
italy
made
color
material

Topic		Replies	Views
Termvector with specific keywords from another index Elasticsearch	2	436	July 5, 2017
Pre-filtering for _termvectors API to get statistics on a subset of documents Elasticsearch	1	289	March 25, 2021
Termvector differences between artificial document and indexed document Elasticsearch	1	385	July 18, 2019
Termvector api fails for artificial document Elasticsearch	1	276	February 18, 2021
TermVector of artifical doc + filtering Elasticsearch	1	343	November 21, 2019

Can _termvector return only real english words and ignore everything else?

Related topics