How to index a certain number of words from each text?

How can only the first two words of each text be indexed ??

You can use a edge ngram token filter. See Edge n-gram token filter | Elasticsearch Guide [7.16] | Elastic

I want to index words and I do not want to index the first two words into small pieces. Although I still have access to the first two words, I do not need the rest of the indexes provided by ngram.

You probably need to create a custom analyser which discards everything after the two first works and then tokenizes e.g. based on whitespace.

Limit token count token filter will match the purpose.

There are many built-in tokenizers and filters, it's worth see everything for once.

POST _analyze
{
  "tokenizer": "standard", 
  "filter":[
    {
      "type": "limit",
      "max_token_count": 2
    }
  ],
  "text": "aaa bbb ccc ddd"
}
{
  "tokens" : [
    {
      "token" : "aaa",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "bbb",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}
2 Likes

This filter indexes words from the beginning of the text
I said for example the first two words
But I may want to index only the third to fifth words

If it was an example, clear stating your requirement ("only the third to fifth words" is also an example?) would lead earlier solution without bothering those who answer here.

In case to filter tokens in any position, I suppose it is better to filter on the client side before indexing or create ingest pipeline with your own script.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.