Can _termvector return only real english words and ignore everything else?

I'm trying to autogenerate tags relevant to the document based on all text known from the document.
Is there an analyzer / filter that we can use to exclude any "non-real" english words such as models, partial words, words that include numeric or special chars, etc.?

ie:

GET data_classification/categories/_termvector
{
  "doc": {
    "text": "wer muffin lear strappy sal made italy color material muffin"
  }
}

and I would want it to return only this subset of words

  • muffin x 2
  • lear
  • italy
  • made
  • color
  • material