Word_delimiter with split_on_numerics removes all tokens


#1

When analyzing alpha 1a beta, I want the outcome of tokens to be [alpha 1 a beta]. Why does myAnalyzer not do the trick?

POST myindex
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "myAnalyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : [ "split_on_numerics" ]
        }
      },
      "filter" : {
        "split_on_numerics" : {
          "type" : "word_delimiter",
          "split_on_numerics" : true,
          "split_on_case_change" : false,
          "generate_word_parts" : false,
          "generate_number_parts" : false,
          "catenate_all" : false
        }
      }
    }
  }
}

Now when I run

GET /myindex/_analyze?analyzer=myAnalyzer&text=alpha 1a beta

no tokens are returned. Again, why?


(Jason Wee) #2
curl -XPUT 'http://localhost:9200/myindex/?pretty' -d '
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "myAnalyzer" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : [ "split_on_numerics" ]
        }
      },
      "filter" : {
        "split_on_numerics" : {
          "type" : "word_delimiter",
          "split_on_numerics" : true,
          "split_on_case_change" : false,
          "generate_word_parts" : true,
          "generate_number_parts" : true,
          "catenate_all" : false
        }
      }
    }
  }
}'


curl -XGET 'localhost:9200/myindex/_analyze?pretty&analyzer=myAnalyzer' -d 'alpha 1a beta'
{
  "tokens" : [ {
    "token" : "alpha",
    "start_offset" : 0,
    "end_offset" : 5,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "1",
    "start_offset" : 6,
    "end_offset" : 7,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "a",
    "start_offset" : 7,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "beta",
    "start_offset" : 9,
    "end_offset" : 13,
    "type" : "<ALPHANUM>",
    "position" : 4
  } ]
}

(system) #3