In general, I prefer to create the settings and mappings when I create the
index. Based on some Java classes I wrote, these are easy to keep around
and specify when creating an index (often) or (rarely) updating the
settings and mappings for an index.
I've found that there really is no one template configuration that makes
sense for me. But because Elasticsearch is just so freakin' amazingly
kick-butt awesome, it's just so very easy to keep many wildly different
indices around: Production mirror, production enhancement, different Big
Data tests, schema creation and update test, analysis tests, and so on.
Even all of them on my little ol' laptop. All active and open at once. Now
how cool is that?!?!
Some background information: The following settings and mappings are for a
very tiny index that I use to explore the analyzers and index creation and
update settings to support various languages. So for example, I define the
number of shards as 1 (to hold a whopping 16 records for testing) which
lets me verify that my Java code was able to set it to a value that is
different from the default of 5. (My initial "big data" index is created
with 16 shards to hold its 100M+ records; that's another great story for
another time.)
I also pass in the pretty-printed JSON strings to the Java API when
creating and updating an index. Since the API is largely undocumented, it
helps tremendously to make the JSON as easy to see and modify during the
required experimentation. Creating and updating indices is not something
one does 100,000 times per second, so whatever overhead is involved with
passing in pretty-printed JSON is not worth worrying about.
So first consider an initial set of filters, analyzers, and mappings for
the "person" and "place" types. I actually wrote some code to take a
simpler schema and automatically generate this and the other pieces. But
that's not important; what's important is what those pieces look like to
the Java API.
In particular, the following JSON string can be passed into the Java API's *
CreateIndexRequestBuilder.setSource(String)* method:
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"analysis" : {
"char_filter" : {
"finnish_char_mapper" : {
"type" : "mapping",
"mappings" : [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
}
},
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
},
"finnish_snowball_filter" : {
"type" : "snowball",
"language" : "Finnish"
},
"Arabic_stemming_filter" : {
"type" : "stemmer",
"name" : "Arabic"
}
},
"analyzer" : {
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter"
]
},
"finnish_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "finnish_snowball_filter"
],
"char_filter" : [ "finnish_char_mapper" ]
},
"arabic_stemming_Arabic_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "Arabic_stemming_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
}
}
}
}
},
"mappings" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"fn" : {
"type" : "string",
"analyzer" : "finnish_stemming_analyzer"
},
"an" : {
"type" : "string",
"analyzer" : "arabic_stemming_Arabic_analyzer"
},
"sex" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
},
"place" : {
"properties" : {
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
}
}
}
Here's the updated version of those settings and mappings if I was to
create a new index that includes the new "thing" type which also contains
the new "tn" field that is analyzed with the Russian stemming rules of the
snowball analyzer.
{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"analysis" : {
"char_filter" : {
"finnish_char_mapper" : {
"type" : "mapping",
"mappings" : [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
}
},
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
},
"russian_snowball_filter" : {
"type" : "snowball",
"language" : "Russian"
},
"finnish_snowball_filter" : {
"type" : "snowball",
"language" : "Finnish"
},
"Arabic_stemming_filter" : {
"type" : "stemmer",
"name" : "Arabic"
}
},
"analyzer" : {
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter"
]
},
"russian_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "russian_snowball_filter"
]
},
"finnish_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "finnish_snowball_filter"
],
"char_filter" : [ "finnish_char_mapper" ]
},
"arabic_stemming_Arabic_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "Arabic_stemming_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
}
}
}
}
},
"mappings" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"fn" : {
"type" : "string",
"analyzer" : "finnish_stemming_analyzer"
},
"an" : {
"type" : "string",
"analyzer" : "arabic_stemming_Arabic_analyzer"
},
"sex" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
},
"place" : {
"properties" : {
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
},
"thing" : {
"properties" : {
"tn" : {
"type" : "string",
"analyzer" : "russian_stemming_analyzer"
},
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
}
}
}
But the JSON above is just for reference; what if you want to update the
existing index from the initial settings and mappings with the settings and
mappings in the second example's new compatible super-set?
An existing index must be updated in parts. There apparently is no
UpdateIndexRequestBuilder with a companion setSource(String) method.
Instead, the settings (including the filters and analyzers must be updated
in one call, and then the mappings for each type must be invidually updated
in separate calls. And then, for consistent safety, the index must first be
closed before making any updates, and then it must be opened again after
the updates are made.
So with that in mind, here is the pretty-printed JSON string of the new
index settings, filters, and analyzers. One thing missing is the
number_of_shards; even though it isn't being changed it's still an error to
specify it on an update. Also this update only adds the
"russian_stemming_analyzer" and the "russian_snowball_filter" that it
references.
The following string is meant to be passed directly into the Java API's *
UpdateSettingsRequestBuilder.setSettings(String)* method:
{
"index" : {
"refresh_interval" : "2s",
"analysis" : {
"char_filter" : {
"finnish_char_mapper" : {
"type" : "mapping",
"mappings" : [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
}
},
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
},
"russian_snowball_filter" : {
"type" : "snowball",
"language" : "Russian"
},
"finnish_snowball_filter" : {
"type" : "snowball",
"language" : "Finnish"
},
"Arabic_stemming_filter" : {
"type" : "stemmer",
"name" : "Arabic"
}
},
"analyzer" : {
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter" ]
},
"russian_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "russian_snowball_filter" ]
},
"finnish_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "finnish_snowball_filter" ],
"char_filter" : [ "finnish_char_mapper" ]
},
"arabic_stemming_Arabic_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "Arabic_stemming_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
}
}
}
}
}
Now (while the index is still closed), the mappings must be updated. Since
my example only adds the "thing" type and does not modify the "person" or
"place" types, only the mapping for the "thing" type is shown below. But
you can certainly iterate across all of your types and update their
mappings; Elasticsearch will verify that the changes are OK or not (and
seems to not mind if there are no changes at all).
So for the new "thing" type with its added "tn" (thing name) field, here is
the pretty-printed JSON string that is meant to be passed into the Java
API's PutMappingRequestBuilder.setSource(String) method:
{
"thing" : {
"properties" : {
"tn" : {
"type" : "string",
"analyzer" : "russian_stemming_analyzer"
},
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
}
}
These strings can also be passed through the HTTP APIs, so it's easy to
test using the HTTP APIs and then write the Java code to match.
I hope this helps!
On Thursday, February 14, 2013 4:55:31 AM UTC-5, Per Ekman wrote:
Hi
I'm trying to figure out how to best configure my cluster as far as
settings for the index are concerned.
I want to have a pretty generic setup, something like this:
{
"settings":{
"index.number_of_shards":"5",
"index.number_of_replicas":"1",
"index.version.created":"200499",
"index.analysis.analyzer.default.type":"standard",
"index":{
"analysis":{
"analyzer":{
"standard":{
"type":"standard"
}
},
}
}
}
}
Into this I will add an language-specific analyzers that reference
stop-words files for different languages. Or do I have to that? Can I
create an analyzer that could be something like this:
"analyzer":{
"standard_${type)":{
"type":"standard"
}
}
What I'm getting is that I have a mapping that will be that same for
different language-dependant data-types, that will use language dependant
analyzer, but I don't want to minimize the mapping and settings
configuration
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.