Word delimiter filter with preserve_original

skowron-line · November 8, 2019, 1:59pm

I have such case
My mapping is

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_length": {
          "type": "length",
          "min": 2
        },
        "my_word": {
          "type": "word_delimiter",
          "preserve_original": true
        }
      },
      "analyzer": {
        "std_folded": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "my_word",
            "my_length"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" 
      }
    }
  }
}

And Im trying to analyze string "W.100"

GET my_index/_analyze 
{
  "analyzer": "std_folded", 
  "text":     ["W.1000"]
}

and Im getting

{
  "tokens" : [
    {
      "token" : "1000",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<NUM>",
      "position" : 1
    }
  ]
}

The question is why I am not receiving also W.1000 token ?

spinscale · November 8, 2019, 4:10pm

Hey,

without debugging too much I think this could be caused by the standard tokenizer. Try this

GET _analyze
{
  "text": [
    "W.1000"
  ],
  "tokenizer": "keyword",
  "filter": [
    { "type" : "lowercase" }, 
    { "type" : "asciifolding" },
    { "type": "length", "min": 2 },
    { "type" : "word_delimiter", "preserve_original": true }
  ]
}

Also, try switching the length and word_delimiter filters around for different results. However the above would only work, if w.1000 was the only string in your field, which I am not sure, it is?

The standard tokenizer would already have tokenized and potentally removed the w.1000 even before the word delimiter filter would be able to kick in. You could try playing around with a different tokenizer like whitespace or classic and see if that helps as well.

skowron-line · November 12, 2019, 6:21am

The output of Your solution is

{
  "tokens" : [
    {
      "token" : "w.1000",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "w",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "1000",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    }
  ]
}

and I dont want w token, only 2 and more token lengths

spinscale · November 20, 2019, 3:05pm

have you tried changing the order of the filters?

system · December 18, 2019, 3:05pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Word_delimiter_graph + preserve_original = token position matters for "match" query Elasticsearch	2	417	April 29, 2020
Issue with using word delimiter filter Elasticsearch	5	540	July 6, 2017
edgeNGram filter not keeping the whole words Elasticsearch	2	1341	July 6, 2017
Issue with using word delimiter Elasticsearch	1	587	July 6, 2017
Word_delimiter behaviour using match query with operator and Elasticsearch	1	203	September 26, 2022

Word delimiter filter with preserve_original

Related topics