How to index a certain number of words from each text?

Amir_Faramarzi · January 16, 2022, 5:40pm

How can only the first two words of each text be indexed ??

dadoonet · January 16, 2022, 7:16pm

You can use a edge ngram token filter. See Edge n-gram token filter | Elasticsearch Guide [7.16] | Elastic

Amir_Faramarzi · January 16, 2022, 10:09pm

I want to index words and I do not want to index the first two words into small pieces. Although I still have access to the first two words, I do not need the rest of the indexes provided by ngram.

Christian_Dahlqvist · January 16, 2022, 10:25pm

You probably need to create a custom analyser which discards everything after the two first works and then tokenizes e.g. based on whitespace.

Tomo_M · January 17, 2022, 3:29am

Limit token count token filter will match the purpose.

There are many built-in tokenizers and filters, it's worth see everything for once.

POST _analyze
{
  "tokenizer": "standard", 
  "filter":[
    {
      "type": "limit",
      "max_token_count": 2
    }
  ],
  "text": "aaa bbb ccc ddd"
}

{
  "tokens" : [
    {
      "token" : "aaa",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "bbb",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]
}

Amir_Faramarzi · January 17, 2022, 5:31am

This filter indexes words from the beginning of the text
I said for example the first two words
But I may want to index only the third to fifth words

Tomo_M · January 17, 2022, 9:54am

If it was an example, clear stating your requirement ("only the third to fifth words" is also an example?) would lead earlier solution without bothering those who answer here.

In case to filter tokens in any position, I suppose it is better to filter on the client side before indexing or create ingest pipeline with your own script.

system · February 14, 2022, 9:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Limit ngram based on the token Elasticsearch	1	374	March 8, 2017
Tokenizer to get combinations of words Elasticsearch	2	860	November 14, 2018
Elasticsearch ngram tokenizer Elasticsearch	4	792	February 10, 2020
Ngram and edgeNgram combined for _all field; or different token filters per field for _all Elasticsearch	1	582	July 6, 2017
Limit ngram tokenizer Elasticsearch	1	522	April 28, 2017

How to index a certain number of words from each text?

Related topics