I'm trying to autogenerate tags relevant to the document based on all text known from the document.
Is there an analyzer / filter that we can use to exclude any "non-real" english words such as models, partial words, words that include numeric or special chars, etc.?
ie:
GET data_classification/categories/_termvector
{
"doc": {
"text": "wer muffin lear strappy sal made italy color material muffin"
}
}
and I would want it to return only this subset of words
- muffin x 2
- lear
- italy
- made
- color
- material