How to keep only longest token occupying the same positions

Valentin_Pletzer · February 17, 2023, 2:04pm

Is there a way too keep only the longest token if two or more tokens occupy the same positions?

e.g. if I define "fox" and "quick fox" as keep words obviously both would be return when analyzing the sentence "the quick fox jumped of the lazy dog." But since "quick fox" is more specific I would like to keep only "quick fox" and discard "fox".

curl -X PUT "localhost:9200/my-index-000001?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "custom_shingle",
            "keepMe"
          ]
        }
      },
      "filter": {
        "keepMe": {
          "type": "keep",
          "keep_words": [ "quick fox", "fox" ]
        },
        "custom_shingle": {
            "type": "shingle",
            "min_shingle_size": "2",
            "max_shingle_size": "3",
            "output_unigrams_if_no_shingles": true
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" 
      }
    }
  }
}
'

curl -X POST "localhost:9200/my-index-000001/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "analyzer": "std_folded",
  "text": "The quick fox jumped over the lazy dog."
}
'

system · March 17, 2023, 2:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can we do : Analyser->Tokenizer->Token Filter->Re-tokenize and considers only these last tokens Elasticsearch	8	239	July 14, 2022
Multiple words but same token Elasticsearch	2	394	July 6, 2017
Custom analyzer with standard tokenizer is splitting long tokens instead of discarding Elasticsearch	4	1228	July 5, 2017
Match only the same tokens Elasticsearch	3	552	March 9, 2018
Issue with Shingles and Stopwords Elasticsearch	2	1063	December 19, 2018

How to keep only longest token occupying the same positions

Related topics