ElasticSearch has been, and continues to be, a joy to use and explore. 100%
of problems have been mine, and I'm sure that my current issue will
continue that trend. But I'm finally stumped.
Up until now, I've successfully configured all string fields to be analyzed
and queried using snowball, and have disabled stop words. My initial
project stores names of people and businesses, and disables stop words
which just get in the way (for example, "A" is not a stop word in the name
"A J Foyt"). Default geo_point analysis and distance queries also worked
perfectly (and continue to work, so I've omitted their default mapping
configuration here).
My next step is to follow the recommended practice of explicitly defining
the mappings for each field in a specified type. I've addressed most of the
problems I had and gotten things to work almost the way I want them to.
Even geo_point distance queries continue to work well: In my new mappings,
I've specified the "pin" field instead of my previous default of "location"
as a geo_point so that I am sure that ElasticSearch queries are using the
custom mappings and not the configured defaults.
Note that in Finnish, W and V are considered to be equivalent for matching
(but not sorting); the same is true for Å (A with an angstrom above it, in
case the UTF-8 doesn't show up in your browser) and O.
Now I have three questions that prevent me from finishing this effort. In
general:
- How to disable stop words in a general HTTP PUT _mapping?
- How to specify character equivalences in a general HTTP PUT _mapping?
- For the query, I am guessing that the "default" analyzer is no longer
proper, but am not sure exactly which one I should be using.
Here is an excerpt from the default configuration in my elasticsearch.yml
file. Stop words are disabled, and the snowball analyzer is used for for
stemming. Again, this has worked well for setting my preferred default
string matching behavior for the initial project:
index:
analysis:
analyzer:
# set stemming analyzer with no stop words as the default
default:
type: snowball
stopwords: none
filter:
stopWordsFilter:
type: stop
stopwords: none
For testing the custom mappings, I created a small set of documents with an
assortment of various types (including geo_point for near-by queries). I am
adding them as the "person" type in the "sgen" (schema generation) index
name. Here is a subset of them with their action_and_meta_data documents
required by the _bulk API:
{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "5" } }
{ "uid" : 5, "cn" : "Åke Virtanen", "fn" : "Åke Virtanen", "sex" : "M",
"married" : false, "pin" : [ -117.033702, 32.733451 ], "text" : [ "Born in
Tampere", "Lives in Lemon Grove, CA" ] }
{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "6" } }
{ "uid" : 6, "cn" : "Åsa Virtanen", "fn" : "Åsa Virtanen", "sex" : "F",
"married" : true, "pin" : [ -116.910522, 32.804101 ], "text" : [ "Born in
Helsinki", "Lives in Granite Hills, CA" ] }
{ "create" : { "_index" : "sgen", "_type" : "person", "_id" : "7" } }
{ "uid" : 7, "cn" : "Debbie Sunny", "fn" : "Debbie Sunny", "sex" : "F",
"married" : false, "pin" : [ -117.033702, 32.733451 ], "text" : [ "Born in
Tangiers", "Lives in a bungalow in Lemon Grove, CA" ] }
After loading these documents but still using the default configured
analyzers and mappings, a phrase query for debby finds the last record, as
does a phrase query for "living in a bungalow". That's good. But...
To dive into the custom mappings, I deleted the sgen index, recreated it,
put the following mapping into _all indices, and then loaded the (small)
sample documents. The "fn" field is the Finnish name but with Finnish
language mapping rules (that's what I intended, anyway). The "text" field
is set up with multiple values, and it's really awesomely cool that
ElasticSearch's "position_offset_gap" setting keeps phrase matches from
spilling across values unless a large enough slop is specified.
{
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "snowball",
"language" : "English",
"stopwords" : "none"
},
"fn" : {
"type" : "string",
"analyzer" : "snowball",
"language" : "Finnish",
"stopwords" : "none"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"sex" : {
"type" : "string",
"analyzer" : "standard",
"language" : "English",
"stopwords" : "none"
},
"text" : {
"type" : "string",
"stopwords" : "none",
"analyzer" : "snowball",
"language" : "English",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
}
}
When I show the mappings for the sgen index, I get the following. I'm not
sure of how much that is omitted is intentionally left out and how much was
ignored due to something I did wrong:
{
"sgen" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "snowball"
},
"fn" : {
"type" : "string",
"analyzer" : "snowball"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"sex" : {
"type" : "string",
"analyzer" : "standard"
},
"text" : {
"type" : "string",
"analyzer" : "snowball",
"position_offset_gap" : 4,
"search_quote_analyzer" : "snowball"
},
"uid" : {
"type" : "long"
}
}
}
}
}
But now that phrase query no longer works because it contains stop words.
And none of the queries against an indivdual stop word succeed either:
{
"bool" : {
"must" : {
"match" : {
"text" : {
"query" : "living in a bungalow",
"type" : "phrase",
"analyzer" : "default",
"slop" : 0
}
}
}
}
}
I am not sure how to disable stop words on a per-field basis.
I am not sure where to put the Finnish rules for character matching. The
examples show various snippets but nothing that is all-inclusive, nor
self-contained within an HTTP PUT example.
I'm also not sure of the "analyzer" : "default" setting in my query (this
is being generated by the Java API's toString method). I am guessing that
it should be the same as the "analyzer" set of names in the field's mapping
definition. But while I'm here, I'd like a definitive answer instead of my
wild guess!
Thanks in advance for any corrections and suggestions.
--