Issues with custom mappings: language, stop word settings, and character replacement

ElasticSearch has been, and continues to be, a joy to use and explore. 100%
of problems have been mine, and I'm sure that my current issue will
continue that trend. But I'm finally stumped.

Up until now, I've successfully configured all string fields to be analyzed
and queried using snowball, and have disabled stop words. My initial
project stores names of people and businesses, and disables stop words
which just get in the way (for example, "A" is not a stop word in the name
"A J Foyt"). Default geo_point analysis and distance queries also worked
perfectly (and continue to work, so I've omitted their default mapping
configuration here).

My next step is to follow the recommended practice of explicitly defining
the mappings for each field in a specified type. I've addressed most of the
problems I had and gotten things to work almost the way I want them to.
Even geo_point distance queries continue to work well: In my new mappings,
I've specified the "pin" field instead of my previous default of "location"
as a geo_point so that I am sure that ElasticSearch queries are using the
custom mappings and not the configured defaults.

Note that in Finnish, W and V are considered to be equivalent for matching
(but not sorting); the same is true for Å (A with an angstrom above it, in
case the UTF-8 doesn't show up in your browser) and O.

Now I have three questions that prevent me from finishing this effort. In
general:

  1. How to disable stop words in a general HTTP PUT _mapping?
  2. How to specify character equivalences in a general HTTP PUT _mapping?
  3. For the query, I am guessing that the "default" analyzer is no longer
    proper, but am not sure exactly which one I should be using.

Here is an excerpt from the default configuration in my elasticsearch.yml
file. Stop words are disabled, and the snowball analyzer is used for for
stemming. Again, this has worked well for setting my preferred default
string matching behavior for the initial project:

index:
analysis:
analyzer:
# set stemming analyzer with no stop words as the default
default:
type: snowball
stopwords: none
filter:
stopWordsFilter:
type: stop
stopwords: none

For testing the custom mappings, I created a small set of documents with an
assortment of various types (including geo_point for near-by queries). I am
adding them as the "person" type in the "sgen" (schema generation) index
name. Here is a subset of them with their action_and_meta_data documents
required by the _bulk API:

{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "5" } }
{ "uid" : 5, "cn" : "Åke Virtanen", "fn" : "Åke Virtanen", "sex" : "M",
"married" : false, "pin" : [ -117.033702, 32.733451 ], "text" : [ "Born in
Tampere", "Lives in Lemon Grove, CA" ] }
{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "6" } }
{ "uid" : 6, "cn" : "Åsa Virtanen", "fn" : "Åsa Virtanen", "sex" : "F",
"married" : true, "pin" : [ -116.910522, 32.804101 ], "text" : [ "Born in
Helsinki", "Lives in Granite Hills, CA" ] }
{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "7" } }
{ "uid" : 7, "cn" : "Debbie Sunny", "fn" : "Debbie Sunny", "sex" : "F",
"married" : false, "pin" : [ -117.033702, 32.733451 ], "text" : [ "Born in
Tangiers", "Lives in a bungalow in Lemon Grove, CA" ] }

After loading these documents but still using the default configured
analyzers and mappings, a phrase query for debby finds the last record, as
does a phrase query for "living in a bungalow". That's good. But...

To dive into the custom mappings, I deleted the sgen index, recreated it,
put the following mapping into _all indices, and then loaded the (small)
sample documents. The "fn" field is the Finnish name but with Finnish
language mapping rules (that's what I intended, anyway). The "text" field
is set up with multiple values, and it's really awesomely cool that
ElasticSearch's "position_offset_gap" setting keeps phrase matches from
spilling across values unless a large enough slop is specified.

{
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "snowball",
"language" : "English",
"stopwords" : "none"
},
"fn" : {
"type" : "string",
"analyzer" : "snowball",
"language" : "Finnish",
"stopwords" : "none"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"sex" : {
"type" : "string",
"analyzer" : "standard",
"language" : "English",
"stopwords" : "none"
},
"text" : {
"type" : "string",
"stopwords" : "none",
"analyzer" : "snowball",
"language" : "English",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
}
}

When I show the mappings for the sgen index, I get the following. I'm not
sure of how much that is omitted is intentionally left out and how much was
ignored due to something I did wrong:

{
"sgen" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "snowball"
},
"fn" : {
"type" : "string",
"analyzer" : "snowball"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"sex" : {
"type" : "string",
"analyzer" : "standard"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"position_offset_gap" : 4,
"search_quote_analyzer" : "snowball"
},
"uid" : {
"type" : "long"
}
}
}
}
}

But now that phrase query no longer works because it contains stop words.
And none of the queries against an indivdual stop word succeed either:

{
"bool" : {
"must" : {
"match" : {
"text" : {
"query" : "living in a bungalow",
"type" : "phrase",
"analyzer" : "default",
"slop" : 0
}
}
}
}
}

I am not sure how to disable stop words on a per-field basis.

I am not sure where to put the Finnish rules for character matching. The
examples show various snippets but nothing that is all-inclusive, nor
self-contained within an HTTP PUT example.

I'm also not sure of the "analyzer" : "default" setting in my query (this
is being generated by the Java API's toString method). I am guessing that
it should be the same as the "analyzer" set of names in the field's mapping
definition. But while I'm here, I'd like a definitive answer instead of my
wild guess!

Thanks in advance for any corrections and suggestions.

--

It's somewhat tricky. Let's start with your analyzer definitions:

index:
analysis:
analyzer:
# set stemming analyzer with no stop words as the default
default:
type: snowball
stopwords: none

What you did here is you created an analyzer called "default" that is based
on snowball analyzer but doesn't have any stopwords. So, when you create a
new field dynamically, this field is using this default analyzer, which
again is a modified version of snowball. When you create field using PUT
mapping, and you specify "snowball" as an analyzer, you are referring to
the unmodified snowball analyzer with stopwords.

So, to fix your issue, you can either stop specifying "snowball" in the
mapping and use the "default" analyzer instead. Or you can modify the
snowball analyzer itself, by changing your setting to

index:
analysis:
analyzer:
snowball:
type: snowball
stopwords: none

The first solution will work for all fields include the "_all" field. The
second solution will work only for fields that you explicitly add to
mapping with snowball analyzer.

On Thursday, January 10, 2013 4:31:04 PM UTC-5, InquiringMind wrote:

ElasticSearch has been, and continues to be, a joy to use and explore.
100% of problems have been mine, and I'm sure that my current issue will
continue that trend. But I'm finally stumped.

Up until now, I've successfully configured all string fields to be
analyzed and queried using snowball, and have disabled stop words. My
initial project stores names of people and businesses, and disables stop
words which just get in the way (for example, "A" is not a stop word in the
name "A J Foyt"). Default geo_point analysis and distance queries also
worked perfectly (and continue to work, so I've omitted their default
mapping configuration here).

My next step is to follow the recommended practice of explicitly defining
the mappings for each field in a specified type. I've addressed most of the
problems I had and gotten things to work almost the way I want them to.
Even geo_point distance queries continue to work well: In my new mappings,
I've specified the "pin" field instead of my previous default of "location"
as a geo_point so that I am sure that ElasticSearch queries are using the
custom mappings and not the configured defaults.

Note that in Finnish, W and V are considered to be equivalent for matching
(but not sorting); the same is true for Å (A with an angstrom above it, in
case the UTF-8 doesn't show up in your browser) and O.

Now I have three questions that prevent me from finishing this effort. In
general:

  1. How to disable stop words in a general HTTP PUT _mapping?
  2. How to specify character equivalences in a general HTTP PUT _mapping?
  3. For the query, I am guessing that the "default" analyzer is no longer
    proper, but am not sure exactly which one I should be using.

Here is an excerpt from the default configuration in my elasticsearch.yml
file. Stop words are disabled, and the snowball analyzer is used for for
stemming. Again, this has worked well for setting my preferred default
string matching behavior for the initial project:

index:
analysis:
analyzer:
# set stemming analyzer with no stop words as the default
default:
type: snowball
stopwords: none
filter:
stopWordsFilter:
type: stop
stopwords: none

For testing the custom mappings, I created a small set of documents with
an assortment of various types (including geo_point for near-by queries). I
am adding them as the "person" type in the "sgen" (schema generation) index
name. Here is a subset of them with their action_and_meta_data documents
required by the _bulk API:

{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "5" } }
{ "uid" : 5, "cn" : "Åke Virtanen", "fn" : "Åke Virtanen", "sex" : "M",
"married" : false, "pin" : [ -117.033702, 32.733451 ], "text" : [ "Born in
Tampere", "Lives in Lemon Grove, CA" ] }
{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "6" } }
{ "uid" : 6, "cn" : "Åsa Virtanen", "fn" : "Åsa Virtanen", "sex" : "F",
"married" : true, "pin" : [ -116.910522, 32.804101 ], "text" : [ "Born in
Helsinki", "Lives in Granite Hills, CA" ] }
{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "7" } }
{ "uid" : 7, "cn" : "Debbie Sunny", "fn" : "Debbie Sunny", "sex" : "F",
"married" : false, "pin" : [ -117.033702, 32.733451 ], "text" : [ "Born in
Tangiers", "Lives in a bungalow in Lemon Grove, CA" ] }

After loading these documents but still using the default configured
analyzers and mappings, a phrase query for debby finds the last record, as
does a phrase query for "living in a bungalow". That's good. But...

To dive into the custom mappings, I deleted the sgen index, recreated it,
put the following mapping into _all indices, and then loaded the (small)
sample documents. The "fn" field is the Finnish name but with Finnish
language mapping rules (that's what I intended, anyway). The "text" field
is set up with multiple values, and it's really awesomely cool that
ElasticSearch's "position_offset_gap" setting keeps phrase matches from
spilling across values unless a large enough slop is specified.

{
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "snowball",
"language" : "English",
"stopwords" : "none"
},
"fn" : {
"type" : "string",
"analyzer" : "snowball",
"language" : "Finnish",
"stopwords" : "none"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"sex" : {
"type" : "string",
"analyzer" : "standard",
"language" : "English",
"stopwords" : "none"
},
"text" : {
"type" : "string",
"stopwords" : "none",
"analyzer" : "snowball",
"language" : "English",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
}
}

When I show the mappings for the sgen index, I get the following. I'm not
sure of how much that is omitted is intentionally left out and how much was
ignored due to something I did wrong:

{
"sgen" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "snowball"
},
"fn" : {
"type" : "string",
"analyzer" : "snowball"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"sex" : {
"type" : "string",
"analyzer" : "standard"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"position_offset_gap" : 4,
"search_quote_analyzer" : "snowball"
},
"uid" : {
"type" : "long"
}
}
}
}
}

But now that phrase query no longer works because it contains stop words.
And none of the queries against an indivdual stop word succeed either:

{
"bool" : {
"must" : {
"match" : {
"text" : {
"query" : "living in a bungalow",
"type" : "phrase",
"analyzer" : "default",
"slop" : 0
}
}
}
}
}

I am not sure how to disable stop words on a per-field basis.

I am not sure where to put the Finnish rules for character matching. The
examples show various snippets but nothing that is all-inclusive, nor
self-contained within an HTTP PUT example.

I'm also not sure of the "analyzer" : "default" setting in my query (this
is being generated by the Java API's toString method). I am guessing that
it should be the same as the "analyzer" set of names in the field's mapping
definition. But while I'm here, I'd like a definitive answer instead of my
wild guess!

Thanks in advance for any corrections and suggestions.

--

Thanks, Igor!

It was tricky until I realized that I should customize the analyzer in the
analyzer's definition and not the mapping's definition. Your response
confirms that.

Now the issue I have is that I cannot seem to define a custom analyzer in
the index creation HTTP API and get my mappings to use any of those custom
analyzers.

My mappings can use the configured custom analyzers just fine. But I can't
seem to fine-tune the JSON to get ES to create analyzers that I can
reference from the API.

Note: This example incorrectly tries to set up the char_filter; that's been
addressed in your response to my other post, and I'll adjust this later.
The problem, though, is that if a field mapping tries to reference the
cn_analyzer instead of my configured english_stemming, ES complains that it
cannot find that cn_analyzer and fails to load the mappings:

{
"analysis": {
"char_filter": {
"finnish_char_mapping": {
"type": "mapping",
"mappings": [
"Å=>O",
"å=>o",
"W=>V",
"w=>v"
]
}
},
"analyzer": {
"cn_analyzer": {
"type": "snowball",
"language": "English",
"stopwords": "none"
},
"fn_analyzer": {
"type": "snowball",
"language": "Finnish",
"filter": [
"finnish_char_mapping"
],
"stopwords": "none"
},
"text_analyzer": {
"type": "snowball",
"language": "English",
"stopwords": "none"
}
}
},
"mappings": {
"person": {
"properties": {
"cn": {
"type": "string",
"analyzer": "english_stemming"
},
"fn": {
"type": "string",
"analyzer": "finnish_stemming"
},
"married": {
"type": "boolean"
},
"pin": {
"type": "geo_point",
"lat_lon": true
},
"sex": {
"type": "string",
"analyzer": "english_standard"
},
"text": {
"type": "string",
"analyzer": "english_stemming",
"position_offset_gap": 4
},
"uid": {
"type": "long"
}
}
}
}
}

On Monday, January 14, 2013 7:08:12 PM UTC-5, Igor Motov wrote:

It's somewhat tricky. Let's start with your analyzer definitions:

index:
analysis:
analyzer:
# set stemming analyzer with no stop words as the default
default:
type: snowball
stopwords: none

What you did here is you created an analyzer called "default" that is
based on snowball analyzer but doesn't have any stopwords. So, when you
create a new field dynamically, this field is using this default analyzer,
which again is a modified version of snowball. When you create field
using PUT mapping, and you specify "snowball" as an analyzer, you
are referring to the unmodified snowball analyzer with stopwords.

So, to fix your issue, you can either stop specifying "snowball" in the
mapping and use the "default" analyzer instead. Or you can modify the
snowball analyzer itself, by changing your setting to

index:
analysis:
analyzer:
snowball:
type: snowball
stopwords: none

The first solution will work for all fields include the "_all" field. The
second solution will work only for fields that you explicitly add to
mapping with snowball analyzer.

--

You need to wrap the "analysis" section into {"settings": {"index":
{.....}} So, it should be like this:

{

  • "settings": {*
  •    "index": {*
          "analysis": {
              "char_filter": {
                  "finnish_char_mapping": {
                      "type": "mapping",
                      "mappings": [
                          "Å=>O",
                          "å=>o",
                          "W=>V",
                          "w=>v"
                      ]
                  }
              },
              "analyzer": {
                  "cn_analyzer": {
                      "type": "snowball",
                      "language": "English",
                      "stopwords": "_none_"
                  },
                  "fn_analyzer": {
                      "type": "snowball",
                      "language": "Finnish",
                      "filter": [
                          "finnish_char_mapping"
                      ],
                      "stopwords": "_none_"
                  },
                  "text_analyzer": {
                      "type": "snowball",
                      "language": "English",
                      "stopwords": "_none_"
                  }
              }
          }
    
  •    }*
    
  • }*,
    "mappings": {
    "person": {
    "properties": {
    "cn": {
    "type": "string",
    "analyzer": "english_stemming"
    },
    "fn": {
    "type": "string",
    "analyzer": "finnish_stemming"
    },
    "married": {
    "type": "boolean"
    },
    "pin": {
    "type": "geo_point",
    "lat_lon": true
    },
    "sex": {
    "type": "string",
    "analyzer": "english_standard"
    },
    "text": {
    "type": "string",
    "analyzer": "english_stemming",
    "position_offset_gap": 4
    },
    "uid": {
    "type": "long"
    }
    }
    }
    }
    }

On Tuesday, January 15, 2013 12:01:43 PM UTC-5, InquiringMind wrote:

Thanks, Igor!

It was tricky until I realized that I should customize the analyzer in the
analyzer's definition and not the mapping's definition. Your response
confirms that.

Now the issue I have is that I cannot seem to define a custom analyzer in
the index creation HTTP API and get my mappings to use any of those custom
analyzers.

My mappings can use the configured custom analyzers just fine. But I can't
seem to fine-tune the JSON to get ES to create analyzers that I can
reference from the API.

Note: This example incorrectly tries to set up the char_filter; that's
been addressed in your response to my other post, and I'll adjust this
later. The problem, though, is that if a field mapping tries to reference
the cn_analyzer instead of my configured english_stemming, ES complains
that it cannot find that cn_analyzer and fails to load the mappings:

{
"analysis": {
"char_filter": {
"finnish_char_mapping": {
"type": "mapping",
"mappings": [
"Å=>O",
"å=>o",
"W=>V",
"w=>v"
]
}
},
"analyzer": {
"cn_analyzer": {
"type": "snowball",
"language": "English",
"stopwords": "none"
},
"fn_analyzer": {
"type": "snowball",
"language": "Finnish",
"filter": [
"finnish_char_mapping"
],
"stopwords": "none"
},
"text_analyzer": {
"type": "snowball",
"language": "English",
"stopwords": "none"
}
}
},
"mappings": {
"person": {
"properties": {
"cn": {
"type": "string",
"analyzer": "english_stemming"
},
"fn": {
"type": "string",
"analyzer": "finnish_stemming"
},
"married": {
"type": "boolean"
},
"pin": {
"type": "geo_point",
"lat_lon": true
},
"sex": {
"type": "string",
"analyzer": "english_standard"
},
"text": {
"type": "string",
"analyzer": "english_stemming",
"position_offset_gap": 4
},
"uid": {
"type": "long"
}
}
}
}
}

On Monday, January 14, 2013 7:08:12 PM UTC-5, Igor Motov wrote:

It's somewhat tricky. Let's start with your analyzer definitions:

index:
analysis:
analyzer:
# set stemming analyzer with no stop words as the default
default:
type: snowball
stopwords: none

What you did here is you created an analyzer called "default" that is
based on snowball analyzer but doesn't have any stopwords. So, when you
create a new field dynamically, this field is using this default analyzer,
which again is a modified version of snowball. When you create field
using PUT mapping, and you specify "snowball" as an analyzer, you
are referring to the unmodified snowball analyzer with stopwords.

So, to fix your issue, you can either stop specifying "snowball" in the
mapping and use the "default" analyzer instead. Or you can modify the
snowball analyzer itself, by changing your setting to

index:
analysis:
analyzer:
snowball:
type: snowball
stopwords: none

The first solution will work for all fields include the "_all" field. The
second solution will work only for fields that you explicitly add to
mapping with snowball analyzer.

--

That's it! Thank you so much for all of your patience and help!

On Tuesday, January 15, 2013 4:48:51 PM UTC-5, Igor Motov wrote:

You need to wrap the "analysis" section into {"settings": {"index":
{.....}} So, it should be like this:

{

  • "settings": {*
  •    "index": {*
          "analysis": {
              "char_filter": {
                  "finnish_char_mapping": {
                      "type": "mapping",
                      "mappings": [
                          "Å=>O",
                          "å=>o",
                          "W=>V",
                          "w=>v"
                      ]
                  }
              },
              "analyzer": {
                  "cn_analyzer": {
                      "type": "snowball",
                      "language": "English",
                      "stopwords": "_none_"
                  },
                  "fn_analyzer": {
                      "type": "snowball",
                      "language": "Finnish",
                      "filter": [
                          "finnish_char_mapping"
                      ],
                      "stopwords": "_none_"
                  },
                  "text_analyzer": {
                      "type": "snowball",
                      "language": "English",
                      "stopwords": "_none_"
                  }
              }
          }
    
  •    }*
    
  • }*,
    "mappings": {
    "person": {
    "properties": {
    "cn": {
    "type": "string",
    "analyzer": "english_stemming"
    },
    "fn": {
    "type": "string",
    "analyzer": "finnish_stemming"
    },
    "married": {
    "type": "boolean"
    },
    "pin": {
    "type": "geo_point",
    "lat_lon": true
    },
    "sex": {
    "type": "string",
    "analyzer": "english_standard"
    },
    "text": {
    "type": "string",
    "analyzer": "english_stemming",
    "position_offset_gap": 4
    },
    "uid": {
    "type": "long"
    }
    }
    }
    }
    }

On Tuesday, January 15, 2013 12:01:43 PM UTC-5, InquiringMind wrote:

Thanks, Igor!

It was tricky until I realized that I should customize the analyzer in
the analyzer's definition and not the mapping's definition. Your response
confirms that.

Now the issue I have is that I cannot seem to define a custom analyzer in
the index creation HTTP API and get my mappings to use any of those custom
analyzers.

My mappings can use the configured custom analyzers just fine. But I
can't seem to fine-tune the JSON to get ES to create analyzers that I can
reference from the API.

Note: This example incorrectly tries to set up the char_filter; that's
been addressed in your response to my other post, and I'll adjust this
later. The problem, though, is that if a field mapping tries to reference
the cn_analyzer instead of my configured english_stemming, ES complains
that it cannot find that cn_analyzer and fails to load the mappings:

{
"analysis": {
"char_filter": {
"finnish_char_mapping": {
"type": "mapping",
"mappings": [
"Å=>O",
"å=>o",
"W=>V",
"w=>v"
]
}
},
"analyzer": {
"cn_analyzer": {
"type": "snowball",
"language": "English",
"stopwords": "none"
},
"fn_analyzer": {
"type": "snowball",
"language": "Finnish",
"filter": [
"finnish_char_mapping"
],
"stopwords": "none"
},
"text_analyzer": {
"type": "snowball",
"language": "English",
"stopwords": "none"
}
}
},
"mappings": {
"person": {
"properties": {
"cn": {
"type": "string",
"analyzer": "english_stemming"
},
"fn": {
"type": "string",
"analyzer": "finnish_stemming"
},
"married": {
"type": "boolean"
},
"pin": {
"type": "geo_point",
"lat_lon": true
},
"sex": {
"type": "string",
"analyzer": "english_standard"
},
"text": {
"type": "string",
"analyzer": "english_stemming",
"position_offset_gap": 4
},
"uid": {
"type": "long"
}
}
}
}
}

On Monday, January 14, 2013 7:08:12 PM UTC-5, Igor Motov wrote:

It's somewhat tricky. Let's start with your analyzer definitions:

index:
analysis:
analyzer:
# set stemming analyzer with no stop words as the default
default:
type: snowball
stopwords: none

What you did here is you created an analyzer called "default" that is
based on snowball analyzer but doesn't have any stopwords. So, when you
create a new field dynamically, this field is using this default analyzer,
which again is a modified version of snowball. When you create field
using PUT mapping, and you specify "snowball" as an analyzer, you
are referring to the unmodified snowball analyzer with stopwords.

So, to fix your issue, you can either stop specifying "snowball" in the
mapping and use the "default" analyzer instead. Or you can modify the
snowball analyzer itself, by changing your setting to

index:
analysis:
analyzer:
snowball:
type: snowball
stopwords: none

The first solution will work for all fields include the "_all" field.
The second solution will work only for fields that you explicitly add to
mapping with snowball analyzer.

--