"dynamic" settings/mappings references

Hi

I'm trying to figure out how to best configure my cluster as far as
settings for the index are concerned.

I want to have a pretty generic setup, something like this:

{
"settings":{
"index.number_of_shards":"5",
"index.number_of_replicas":"1",
"index.version.created":"200499",
"index.analysis.analyzer.default.type":"standard",
"index":{
"analysis":{
"analyzer":{
"standard":{
"type":"standard"
}
},
}
}
}
}

Into this I will add an language-specific analyzers that reference
stop-words files for different languages. Or do I have to that? Can I
create an analyzer that could be something like this:

            "analyzer":{
                "standard_${type)":{
                    "type":"standard"
                }
            }

What I'm getting is that I have a mapping that will be that same for
different language-dependant data-types, that will use language dependant
analyzer, but I don't want to minimize the mapping and settings
configuration

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I could of course make templates and handle this through java code at
indexing time, but I want to know if there is something in ES that supports
this kind of thing?

On Thursday, February 14, 2013 10:55:31 AM UTC+1, Per Ekman wrote:

Hi

I'm trying to figure out how to best configure my cluster as far as
settings for the index are concerned.

I want to have a pretty generic setup, something like this:

{
"settings":{
"index.number_of_shards":"5",
"index.number_of_replicas":"1",
"index.version.created":"200499",
"index.analysis.analyzer.default.type":"standard",
"index":{
"analysis":{
"analyzer":{
"standard":{
"type":"standard"
}
},
}
}
}
}

Into this I will add an language-specific analyzers that reference
stop-words files for different languages. Or do I have to that? Can I
create an analyzer that could be something like this:

            "analyzer":{
                "standard_${type)":{
                    "type":"standard"
                }
            }

What I'm getting is that I have a mapping that will be that same for
different language-dependant data-types, that will use language dependant
analyzer, but I don't want to minimize the mapping and settings
configuration

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

In general, I prefer to create the settings and mappings when I create the
index. Based on some Java classes I wrote, these are easy to keep around
and specify when creating an index (often) or (rarely) updating the
settings and mappings for an index.

I've found that there really is no one template configuration that makes
sense for me. But because Elasticsearch is just so freakin' amazingly
kick-butt awesome, it's just so very easy to keep many wildly different
indices around: Production mirror, production enhancement, different Big
Data tests, schema creation and update test, analysis tests, and so on.
Even all of them on my little ol' laptop. All active and open at once. Now
how cool is that?!?!

Some background information: The following settings and mappings are for a
very tiny index that I use to explore the analyzers and index creation and
update settings to support various languages. So for example, I define the
number of shards as 1 (to hold a whopping 16 records for testing) which
lets me verify that my Java code was able to set it to a value that is
different from the default of 5. (My initial "big data" index is created
with 16 shards to hold its 100M+ records; that's another great story for
another time.)

I also pass in the pretty-printed JSON strings to the Java API when
creating and updating an index. Since the API is largely undocumented, it
helps tremendously to make the JSON as easy to see and modify during the
required experimentation. Creating and updating indices is not something
one does 100,000 times per second, so whatever overhead is involved with
passing in pretty-printed JSON is not worth worrying about.

So first consider an initial set of filters, analyzers, and mappings for
the "person" and "place" types. I actually wrote some code to take a
simpler schema and automatically generate this and the other pieces. But
that's not important; what's important is what those pieces look like to
the Java API.

In particular, the following JSON string can be passed into the Java API's *
CreateIndexRequestBuilder.setSource(String)* method:

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"analysis" : {
"char_filter" : {
"finnish_char_mapper" : {
"type" : "mapping",
"mappings" : [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
}
},
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
},
"finnish_snowball_filter" : {
"type" : "snowball",
"language" : "Finnish"
},
"Arabic_stemming_filter" : {
"type" : "stemmer",
"name" : "Arabic"
}
},
"analyzer" : {
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter"
]
},
"finnish_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "finnish_snowball_filter"
],
"char_filter" : [ "finnish_char_mapper" ]
},
"arabic_stemming_Arabic_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "Arabic_stemming_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
}
}
}
}
},
"mappings" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"fn" : {
"type" : "string",
"analyzer" : "finnish_stemming_analyzer"
},
"an" : {
"type" : "string",
"analyzer" : "arabic_stemming_Arabic_analyzer"
},
"sex" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
},
"place" : {
"properties" : {
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
}
}
}

Here's the updated version of those settings and mappings if I was to
create a new index that includes the new "thing" type which also contains
the new "tn" field that is analyzed with the Russian stemming rules of the
snowball analyzer.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"analysis" : {
"char_filter" : {
"finnish_char_mapper" : {
"type" : "mapping",
"mappings" : [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
}
},
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
},
"russian_snowball_filter" : {
"type" : "snowball",
"language" : "Russian"
},
"finnish_snowball_filter" : {
"type" : "snowball",
"language" : "Finnish"
},
"Arabic_stemming_filter" : {
"type" : "stemmer",
"name" : "Arabic"
}
},
"analyzer" : {
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter"
]
},
"russian_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "russian_snowball_filter"
]
},
"finnish_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "finnish_snowball_filter"
],
"char_filter" : [ "finnish_char_mapper" ]
},
"arabic_stemming_Arabic_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "Arabic_stemming_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
}
}
}
}
},
"mappings" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"fn" : {
"type" : "string",
"analyzer" : "finnish_stemming_analyzer"
},
"an" : {
"type" : "string",
"analyzer" : "arabic_stemming_Arabic_analyzer"
},
"sex" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
},
"place" : {
"properties" : {
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
},
"thing" : {
"properties" : {
"tn" : {
"type" : "string",
"analyzer" : "russian_stemming_analyzer"
},
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
}
}
}

But the JSON above is just for reference; what if you want to update the
existing index from the initial settings and mappings with the settings and
mappings in the second example's new compatible super-set?

An existing index must be updated in parts. There apparently is no
UpdateIndexRequestBuilder with a companion setSource(String) method.
Instead, the settings (including the filters and analyzers must be updated
in one call, and then the mappings for each type must be invidually updated
in separate calls. And then, for consistent safety, the index must first be
closed before making any updates, and then it must be opened again after
the updates are made.

So with that in mind, here is the pretty-printed JSON string of the new
index settings, filters, and analyzers. One thing missing is the
number_of_shards; even though it isn't being changed it's still an error to
specify it on an update. Also this update only adds the
"russian_stemming_analyzer" and the "russian_snowball_filter" that it
references.

The following string is meant to be passed directly into the Java API's *
UpdateSettingsRequestBuilder.setSettings(String)* method:

{
"index" : {
"refresh_interval" : "2s",
"analysis" : {
"char_filter" : {
"finnish_char_mapper" : {
"type" : "mapping",
"mappings" : [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
}
},
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
},
"russian_snowball_filter" : {
"type" : "snowball",
"language" : "Russian"
},
"finnish_snowball_filter" : {
"type" : "snowball",
"language" : "Finnish"
},
"Arabic_stemming_filter" : {
"type" : "stemmer",
"name" : "Arabic"
}
},
"analyzer" : {
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter" ]
},
"russian_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "russian_snowball_filter" ]
},
"finnish_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "finnish_snowball_filter" ],
"char_filter" : [ "finnish_char_mapper" ]
},
"arabic_stemming_Arabic_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "Arabic_stemming_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
}
}
}
}
}

Now (while the index is still closed), the mappings must be updated. Since
my example only adds the "thing" type and does not modify the "person" or
"place" types, only the mapping for the "thing" type is shown below. But
you can certainly iterate across all of your types and update their
mappings; Elasticsearch will verify that the changes are OK or not (and
seems to not mind if there are no changes at all).

So for the new "thing" type with its added "tn" (thing name) field, here is
the pretty-printed JSON string that is meant to be passed into the Java
API's PutMappingRequestBuilder.setSource(String) method:

{
"thing" : {
"properties" : {
"tn" : {
"type" : "string",
"analyzer" : "russian_stemming_analyzer"
},
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
}
}

These strings can also be passed through the HTTP APIs, so it's easy to
test using the HTTP APIs and then write the Java code to match.

I hope this helps!

On Thursday, February 14, 2013 4:55:31 AM UTC-5, Per Ekman wrote:

Hi

I'm trying to figure out how to best configure my cluster as far as
settings for the index are concerned.

I want to have a pretty generic setup, something like this:

{
"settings":{
"index.number_of_shards":"5",
"index.number_of_replicas":"1",
"index.version.created":"200499",
"index.analysis.analyzer.default.type":"standard",
"index":{
"analysis":{
"analyzer":{
"standard":{
"type":"standard"
}
},
}
}
}
}

Into this I will add an language-specific analyzers that reference
stop-words files for different languages. Or do I have to that? Can I
create an analyzer that could be something like this:

            "analyzer":{
                "standard_${type)":{
                    "type":"standard"
                }
            }

What I'm getting is that I have a mapping that will be that same for
different language-dependant data-types, that will use language dependant
analyzer, but I don't want to minimize the mapping and settings
configuration

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks. I think I get what you are saying.

The thing for me is that my mapping (for types "swe" and "de") will be
exactly the same, except for the use of different analysers in some field
(for the use of Swedish or German stop-words). I started out defining
mappings and settings in json-files, and then it seems to me I have to make
to several mappings files, which essentially are the same but are just
using different analysers (at the same place in mapping though)

I was looking for to have one mapping, and use it for several data types,
and then just reference the analysers in a smart way. I can do this making
programs or scripts that generate mappings for each type (with the correct
analyser reference) but I wanted to know if there is something in ES for
this.

I can generate using the XContentBuilder, but it feels like programming a
json-file and is not as readable, so I was just checking if it was possible

On Thursday, February 14, 2013 10:55:31 AM UTC+1, Per Ekman wrote:

Hi

I'm trying to figure out how to best configure my cluster as far as
settings for the index are concerned.

I want to have a pretty generic setup, something like this:

{
"settings":{
"index.number_of_shards":"5",
"index.number_of_replicas":"1",
"index.version.created":"200499",
"index.analysis.analyzer.default.type":"standard",
"index":{
"analysis":{
"analyzer":{
"standard":{
"type":"standard"
}
},
}
}
}
}

Into this I will add an language-specific analyzers that reference
stop-words files for different languages. Or do I have to that? Can I
create an analyzer that could be something like this:

            "analyzer":{
                "standard_${type)":{
                    "type":"standard"
                }
            }

What I'm getting is that I have a mapping that will be that same for
different language-dependant data-types, that will use language dependant
analyzer, but I don't want to minimize the mapping and settings
configuration

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The prettyPrint method makes the JSON readable. For instance:

try
{
XContentBuilder cb = jsonBuilder().prettyPrint();

cb.startObject();

cb.startObject("index");
...
cb.endObject();

cb.endObject();
return cb.string();
}
catch (IOException e)
{
return "{}";
}

Also, my examples did not define any stopwords because I deal mostly with
names, and with names I don't view any initials or names as stop words. But
adding the "stop" filter and defining it should be relatively easy; my
examples just don't have one. (I supppose that could be added for
completeness, though.)

On Thursday, February 14, 2013 12:13:29 PM UTC-5, Per Ekman wrote:

Thanks. I think I get what you are saying.

The thing for me is that my mapping (for types "swe" and "de") will be
exactly the same, except for the use of different analysers in some field
(for the use of Swedish or German stop-words). I started out defining
mappings and settings in json-files, and then it seems to me I have to make
to several mappings files, which essentially are the same but are just
using different analysers (at the same place in mapping though)

I was looking for to have one mapping, and use it for several data types,
and then just reference the analysers in a smart way. I can do this making
programs or scripts that generate mappings for each type (with the correct
analyser reference) but I wanted to know if there is something in ES for
this.

I can generate using the XContentBuilder, but it feels like programming a
json-file and is not as readable, so I was just checking if it was possible

On Thursday, February 14, 2013 10:55:31 AM UTC+1, Per Ekman wrote:

Hi

I'm trying to figure out how to best configure my cluster as far as
settings for the index are concerned.

I want to have a pretty generic setup, something like this:

{
"settings":{
"index.number_of_shards":"5",
"index.number_of_replicas":"1",
"index.version.created":"200499",
"index.analysis.analyzer.default.type":"standard",
"index":{
"analysis":{
"analyzer":{
"standard":{
"type":"standard"
}
},
}
}
}
}

Into this I will add an language-specific analyzers that reference
stop-words files for different languages. Or do I have to that? Can I
create an analyzer that could be something like this:

            "analyzer":{
                "standard_${type)":{
                    "type":"standard"
                }
            }

What I'm getting is that I have a mapping that will be that same for
different language-dependant data-types, that will use language dependant
analyzer, but I don't want to minimize the mapping and settings
configuration

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

By the way, this example includes the stop filter for the "stext" field.
I've highlighted the various additional parts that are needed to enable
it.

The names of analyzers and filters are automatically generated to be
unique, and also to be as descriptive as possible.

{
"settings" : {
"index" : {
"number_of_shards" : 1,
"refresh_interval" : "2s",
"analysis" : {
"char_filter" : {
"finnish_char_mapper" : {
"type" : "mapping",
"mappings" : [ "Å=>O", "å=>o", "W=>V", "w=>v" ]
}
},
"filter" : {
"english_snowball_filter" : {
"type" : "snowball",
"language" : "English"
},
"russian_snowball_filter" : {
"type" : "snowball",
"language" : "Russian"
},
"finnish_snowball_filter" : {
"type" : "snowball",
"language" : "Finnish"
},
"Arabic_stemming_filter" : {
"type" : "stemmer",
"name" : "Arabic"
},
"english_stop_filter" : {
"type" : "stop",
"language" : [ "english" ]
}

},
"analyzer" : {
"english_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_snowball_filter"
]
},
"russian_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "russian_snowball_filter"
]
},
"finnish_stemming_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "finnish_snowball_filter"
],
"char_filter" : [ "finnish_char_mapper" ]
},
"arabic_stemming_Arabic_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "Arabic_stemming_filter" ]
},
"english_standard_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase" ]
},
"english_stemming_stop_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter" : [ "standard", "lowercase", "english_stop_filter",
"english_snowball_filter" ]

}
}
}
}
},
"mappings" : {
"person" : {
"properties" : {
"cn" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"fn" : {
"type" : "string",
"analyzer" : "finnish_stemming_analyzer"
},
"an" : {
"type" : "string",
"analyzer" : "arabic_stemming_Arabic_analyzer"
},
"sex" : {
"type" : "string",
"analyzer" : "english_standard_analyzer"
},
"married" : {
"type" : "boolean"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"uid" : {
"type" : "long"
}
}
},
"place" : {
"properties" : {
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
},
"thing" : {
"properties" : {
"tn" : {
"type" : "string",
"analyzer" : "russian_stemming_analyzer"
},
"city" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer"
},
"text" : {
"type" : "string",
"analyzer" : "english_stemming_analyzer",
"position_offset_gap" : 4
},
"stext" : {
"type" : "string",
"analyzer" : "english_stemming_stop_analyzer",
"position_offset_gap" : 4
}
,
"pin" : {
"type" : "geo_point",
"lat_lon" : true
}
}
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.