Can _termvector return only real english words and ignore everything else?


(Alex Smirnov) #1

I'm trying to autogenerate tags relevant to the document based on all text known from the document.
Is there an analyzer / filter that we can use to exclude any "non-real" english words such as models, partial words, words that include numeric or special chars, etc.?

ie:

GET data_classification/categories/_termvector
{
  "doc": {
    "text": "wer muffin lear strappy sal made italy color material muffin"
  }
}

and I would want it to return only this subset of words

  • muffin x 2
  • lear
  • italy
  • made
  • color
  • material

(system) #2