Generate_number_parts not working as expected

askids · March 12, 2018, 2:21am

hi,

I have a requirement to include special characters in search. So for both intake and search, I am trying to create a custom analyzer so that I can retain original character as is. I have set the tokenizer to whitespace and generate_number_parts to false. But when I check the analyzer output, its not working as expected.

PUT /testing
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "word_delimiter_v3_filter": {
            "type": "word_delimiter",
            "generate_number_parts ": false,
            "split_on_numerics": false,
            "preserve_original": true
          }
        },
        "analyzer": {
          "searchword_v3_analyzer": {
            "filter": [
              "lowercase",
              "word_delimiter_v3_filter"
            ],
            "type": "custom",
            "tokenizer": "whitespace"
          }
        }
      }
    }
  },
  "mappings": {
    "testmap": {
      "properties": {
        "fullname": {
          "type": "text",
          "analyzer": "searchword_v3_analyzer",
          "search_analyzer": "searchword_v3_analyzer"
        }
      }
    }
  }
}

Using above analyzer, if I enter "2-10" as search text, I want it to be searched as is.

GET testing/_analyze 
{
  "analyzer": "searchword_v3_analyzer", 
  "text":     "2-10"
}

But when I check the analyzer output, its still splitting 2 and 10 as separate words, even when I am using white space tokenizer and have set generate_number_parts to false and split_on_numerics to false.

I am running v5.5.1 of ES on Windows 2012.

{
  "tokens": [
    {
      "token": "2-10",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "10",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 1
    }
  ]
}

Thanks
askids

dadoonet · March 12, 2018, 8:51am

Why not doing this:

DELETE testing
PUT /testing
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "searchword_v3_analyzer": {
            "filter": [
              "lowercase"
            ],
            "type": "custom",
            "tokenizer": "whitespace"
          }
        }
      }
    }
  },
  "mappings": {
    "testmap": {
      "properties": {
        "fullname": {
          "type": "text",
          "analyzer": "searchword_v3_analyzer",
          "search_analyzer": "searchword_v3_analyzer"
        }
      }
    }
  }
}
GET testing/_analyze 
{
  "analyzer": "searchword_v3_analyzer", 
  "text":     "2-10"
}

It gives:

{
  "tokens": [
    {
      "token": "2-10",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

PS: Thanks for the reproduction script. I wish everybody provide this.

askids · March 14, 2018, 12:44am

Thanks @dadoonet. Actually I figured it out a day after I posted it that I was unnecessarily getting into all that complexity with word delimiter filter, when simply using whitespace tokenizer would meet my requirement. I came back today to just update my post, but I see that you have suggested the same. Thank you!!!

However, I had a secondary question for you. When I was using the whitespace tokenizer along with word delimiter filter, why would it generate more tokens as even the custom word_delimiter filter, though redundant, should have given same result. But now, effectively, looks like its overriding whitespace tokenizer, but still not following all the rules set in the filter. Only rule that worked was the preserver original. Looks like a bug to me.

system · April 11, 2018, 12:44am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.