Specifying analyzer language at insert time

Kosta · March 5, 2011, 2:28pm

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
elastic search itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"
}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!

Kosta · March 5, 2011, 4:35pm

Just stumbled across this:

github.com/elastic/elasticsearch

Mapper: Dynamic Template Support

opened 02:55PM - 01 Oct 10 UTC

closed 09:56PM - 01 Oct 10 UTC

kimchy

>feature v0.12.0

Dynamic templates allow to define mapping templates that will be applied when dy…namic introduction of fields / objects happens. For example, we might want to have all fields to be stored by default, or all `string` fields to be stored, or have `string` fields to always be indexed as `multi_field` once `analyzed` and once `not_analyzed`. Here is a simple example: ``` { "person" : { "dynamic_templates" : [ { "template_1" : { "match" : "multi*", "mapping" : { "type" : "multi_field", "fields" : { "{name}" : {"type": "{dynamic_tpye}", "index" : "analyzed", "store" : "yes"}, "org" : {"type": "{dynamic_type}", "index" : "not_analyzed", "store" : "yes"} } } } }, { "template_2" : { "match" : "*", "match_mapping_type" : "string", "mapping" : { "type" : "string", "index" : "not_analyzed" } } } ] } } ``` The above mapping will create a `multi_field` mapping for all field names starting with `multi`, and will map all `string` types to be `not_analyzed`. The `dynamic_templates` section can be placed only on the root object. and it is applied to all inner objects / fields. Dynamic templates are named to allow for simple merge behavior. A new mapping, just with a new template can be "put" and that template will be added, or if it has the same name, the template will be replaced. The `match` allow to define matching on the field name. An `unmatch` option is also available to exclude fields if they do match on `match`. The `match_mapping_type` controls if this template will be applied only for dynamic fields of the specified type (as guessed by the json format). The format of all the matching is `simple` format, allowing to use `*` as a matching element supporting simple patterns such as `xxx*`, `*xxx`, `xxx*yyy` (with arbitrary number of pattern types), as well as direct equality. The `match_pattern` can be set to `regex` to allow for regular expression based matching. The `mapping` element provides the actual mapping definition. The `{name}` keyword can be used and will be replaced with the actual dynamic field name being introduced. The `{dynamic_type}` (or `{dynamicType}`) can be used and will be replaced with the mapping derived based on the field type (or the derived type, like `date`). Complete generic settings can also be applied, for example, to have all mappings be stored, just set: ``` { "person" : { "dynamic_templates" : [ { "store_generic" : { "match" : "*", "mapping" : { "store" : "yes" } } } ] } } ```

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
Elasticsearch itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!

kimchy · March 6, 2011, 4:25am

You can also use the analyzer field mapping: Elasticsearch Platform — Find real-time answers at scale | Elastic, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."
}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.
On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
Mapper: Dynamic Template Support · Issue #397 · elastic/elasticsearch · GitHub

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
Elasticsearch itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!

Kosta · March 6, 2011, 3:32pm

Yes, I suppose having metadata attached to a source document like that
would make it both harder to parse and, as you mentioned, make it
unpredictable in terms of what can be expected in the field object.
What seems natural to me is extracting the boost/analyzer metadata
into a separate object within an inserted document. For example:

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"tweet" : {
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "Me gusta Elastic Search!"
}, properties {
"message" : { "_analyzer" : {type:"snowball",
language:"spanish"} }
}
}'

That way the "source" of the message would stay intact, and additional
properties could be set/overridden for each inserted document.

On Mar 6, 4:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

You can also use the analyzer field mapping:Elasticsearch Platform — Find real-time answers at scale | Elastic, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."

}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.

On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
Mapper: Dynamic Template Support · Issue #397 · elastic/elasticsearch · GitHub

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
Elasticsearch itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!

kimchy · March 7, 2011, 4:26am

Interesting. What I tried to do with the notion of _source is to have the ability to always just get it (and possibly other specific fields, like _routing and _parent, which can be provided separately), and reindex the data. Once something is "out" of the _source, then you loose this capability.

Another option is to be able to specify those values by referencing other elements in the _source. For example, specifying in the mapping that the analyzer is taken from _source.message.language (or something similar). This gets tricky with things like array elements, but for simple cases, it can work nicely I guess. It will come with an overhead though... .
On Sunday, March 6, 2011 at 5:32 PM, Kosta wrote:

Yes, I suppose having metadata attached to a source document like that
would make it both harder to parse and, as you mentioned, make it
unpredictable in terms of what can be expected in the field object.
What seems natural to me is extracting the boost/analyzer metadata
into a separate object within an inserted document. For example:

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"tweet" : {
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "Me gusta Elastic Search!"
}, properties {
"message" : { "_analyzer" : {type:"snowball",
language:"spanish"} }
}
}'

That way the "source" of the message would stay intact, and additional
properties could be set/overridden for each inserted document.

On Mar 6, 4:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

You can also use the analyzer field mapping:Elasticsearch Platform — Find real-time answers at scale | Elastic, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."

}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.

On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
Mapper: Dynamic Template Support · Issue #397 · elastic/elasticsearch · GitHub

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
Elasticsearch itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!

Jasper_van_Wanrooy_C · August 25, 2011, 3:37pm

Did this email conversation have any followup? I'm in the process of indexing our objects in several languages. As far as I understood I need to setup several analyzers for each specific language instead of passing on a language flag on index/search time. Is that correct?

Thanks, Jasper

On 7 mrt. 2011, at 05:26, Shay Banon wrote:

Interesting. What I tried to do with the notion of _source is to have the ability to always just get it (and possibly other specific fields, like _routing and _parent, which can be provided separately), and reindex the data. Once something is "out" of the _source, then you loose this capability.

Another option is to be able to specify those values by referencing other elements in the _source. For example, specifying in the mapping that the analyzer is taken from _source.message.language (or something similar). This gets tricky with things like array elements, but for simple cases, it can work nicely I guess. It will come with an overhead though... .
On Sunday, March 6, 2011 at 5:32 PM, Kosta wrote:

Yes, I suppose having metadata attached to a source document like that
would make it both harder to parse and, as you mentioned, make it
unpredictable in terms of what can be expected in the field object.
What seems natural to me is extracting the boost/analyzer metadata
into a separate object within an inserted document. For example:

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"tweet" : {
"user" : "kimchy",
""post_date" : "2009-11-15T14:12:12",
"message" : "Me gusta Elastic Search!"
}, properties {
""message" : { "_analyzer" : {type:"snowball",
language:"spanish"} }
}
}'

That way the "source" of the message would stay intact, and additional
properties could be set/overridden for each inserted document.

On Mar 6, 4:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

You can also use the analyzer field mapping:Elasticsearch Platform — Find real-time answers at scale | Elastic, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."

}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.

On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
Mapper: Dynamic Template Support · Issue #397 · elastic/elasticsearch · GitHub

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
Elasticsearch itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!

Topic		Replies	Views
Index with documents in multiple languages Elasticsearch	6	1098	July 6, 2017
Creating new indices with mapping and analyzers via config file Elasticsearch	3	394	July 5, 2017
Using Dynamic Mappings for language related Fields Elasticsearch	2	386	July 6, 2017
Multiple analyzers for dynamic template mapping Elasticsearch	3	1600	December 12, 2018
"dynamic" settings/mappings references Elasticsearch	6	328	July 6, 2017

Specifying analyzer language at insert time

Related topics