Analyzer unassigned when using 'integer' type


(missinglink) #1

When assigning an analyzer to an integer field, the analyzer definition is removed from the mapping.

In the example below I would like to remove non-numeric characters from a house number and store the results as integer types in order to use a range filter (eg. so ["1A"] => [1] or ["apt 4"] => [4])

When posting the mapping to ES, it is removing the analyzer for the integer field but not for the string field.

This results in an unexpected error such as MapperParsingException[failed to parse [myInteger]]; nested: NumberFormatException[For input string: \"apartment 1A\"];.

I looked through the docs and couldn't find mention of this behaviour.

#!/bin/bash

################################################
# Analyzer unassigned when using 'integer' type
################################################

ES='localhost:9200';

# drop index
curl -XDELETE "$ES/address?pretty=true";

# create index
curl -XPUT "$ES/address?pretty=true" -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "numberify": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": ["convert_non_numeric_chars_to_spaces"]
        }
      },
      "char_filter": {
        "convert_non_numeric_chars_to_spaces": {
          "type": "pattern_replace",
          "pattern": "[^0-9]",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "housenumber": {
      "properties": {
        "myString": {
          "type": "string",
          "analyzer": "numberify"
        },
        "myInteger": {
          "type": "integer",
          "analyzer": "numberify"
        }
      }
    }
  }
}';
# retrieve index mapping
curl -XGET "$ES/address/_mapping?pretty=true"

# !!! analyzer has been removed from 'integer' field but not string field !!!

# "myInteger" : {
#   "type" : "integer"
# },
# "myString" : {
#   "type" : "string",
#   "analyzer" : "numberify"
# }
# index a doc
curl -XPOST "$ES/address/housenumber/1?pretty=true" -d'
{
  "myInteger": "apartment 1A",
  "myString": "apartment 1A"
}';

# {
#   "error" : "MapperParsingException[failed to parse [myInteger]]; nested: NumberFormatException[For input string: \"apartment 1A\"]; ",
#   "status" : 400
# }
# expected for input 'apartment 1A'

"myInteger": 1,
"myString": "1"
{
  "status" : 200,
  "name" : "Mahkizmo",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "1.7.2",
    "build_hash" : "e43676b1385b8125d647f593f7202acbd816e8ec",
    "build_timestamp" : "2015-09-14T09:49:53Z",
    "build_snapshot" : false,
    "lucene_version" : "4.10.4"
  },
  "tagline" : "You Know, for Search"
}

(David Pilato) #2

Analyzers are only used on String types. They are used to parse an input String and provide a Token stream as an output.

You can't generate from "apartment 1A" a number.

You need to do whatever transformation you want before injecting your data in elasticsearch.
For example, you can use Logstash and its Grok filter to extract content from apartment 1A and generate something like:

{
  "label": "apartment 1A",
  "number": 1
}

Then send that to elasticsearch.

Here elasticsearch simply ignores analyzer setting on the integer type.


(missinglink) #3

agh ok thanks for your reply.

I assumed that inputs from the RESTful API arrived as strings, at which point you could manipulate them using string functions until they were put in the index, at which time coercion happens and incompatible types throw errors.

It seems like this is not the case. it sounds like if you set type:integer then you are no longer able to use custom analysis.

Is this documented somewhere? because it stung me and would probably sting others.

Creating a mapping with type:integer and an analyzer set doesn't return any errors or log anything, so you can see why I might be surprised that my mapping entry was stripped.

Looks like we will have to do the field transformation logic in our application later and send both the unmodified and the modified version to elasticsearch, one for analysis and one for _source retrieval.

I'm guessing that requesting this functionality is probably a big no-no as it sounds like a fairly major change to how the numeric types work.

thanks again


(David Pilato) #4

You can read:

You can think about it like what you have with a database.

If you define a field to be an INTEGER, you can't really insert a String into it.


(system) #5