Word delimiter filter with preserve_original

I have such case
My mapping is

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_length": {
          "type": "length",
          "min": 2
        },
        "my_word": {
          "type": "word_delimiter",
          "preserve_original": true
        }
      },
      "analyzer": {
        "std_folded": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "my_word",
            "my_length"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_text": {
        "type": "text",
        "analyzer": "std_folded" 
      }
    }
  }
}

And Im trying to analyze string "W.100"

GET my_index/_analyze 
{
  "analyzer": "std_folded", 
  "text":     ["W.1000"]
}

and Im getting

{
  "tokens" : [
    {
      "token" : "1000",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<NUM>",
      "position" : 1
    }
  ]
}

The question is why I am not receiving also W.1000 token ?

Hey,

without debugging too much I think this could be caused by the standard tokenizer. Try this

GET _analyze
{
  "text": [
    "W.1000"
  ],
  "tokenizer": "keyword",
  "filter": [
    { "type" : "lowercase" }, 
    { "type" : "asciifolding" },
    { "type": "length", "min": 2 },
    { "type" : "word_delimiter", "preserve_original": true }
  ]
}

Also, try switching the length and word_delimiter filters around for different results. However the above would only work, if w.1000 was the only string in your field, which I am not sure, it is?

The standard tokenizer would already have tokenized and potentally removed the w.1000 even before the word delimiter filter would be able to kick in. You could try playing around with a different tokenizer like whitespace or classic and see if that helps as well.

The output of Your solution is

{
  "tokens" : [
    {
      "token" : "w.1000",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "w",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "1000",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    }
  ]
}

and I dont want w token, only 2 and more token lengths

have you tried changing the order of the filters?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.