Specifying analyzer language at insert time


(Kosta) #1

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
elastic search itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"
}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!


(Kosta) #2

Just stumbled across this:

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish :slight_smile:

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
elastic search itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!


(Shay Banon) #3

You can also use the analyzer field mapping: http://www.elasticsearch.org/guide/reference/mapping/analyzer-field.html, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."
}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.
On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
https://github.com/elasticsearch/elasticsearch/issues/397

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish :slight_smile:

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
elastic search itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!


(Kosta) #4

Yes, I suppose having metadata attached to a source document like that
would make it both harder to parse and, as you mentioned, make it
unpredictable in terms of what can be expected in the field object.
What seems natural to me is extracting the boost/analyzer metadata
into a separate object within an inserted document. For example:

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"tweet" : {
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "Me gusta Elastic Search!"
}, properties {
"message" : { "_analyzer" : {type:"snowball",
language:"spanish"} }
}
}'

That way the "source" of the message would stay intact, and additional
properties could be set/overridden for each inserted document.

On Mar 6, 4:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

You can also use the analyzer field mapping:http://www.elasticsearch.org/guide/reference/mapping/analyzer-field.html, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."

}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.

On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
https://github.com/elasticsearch/elasticsearch/issues/397

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish :slight_smile:

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
elastic search itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!


(Shay Banon) #5

Interesting. What I tried to do with the notion of _source is to have the ability to always just get it (and possibly other specific fields, like _routing and _parent, which can be provided separately), and reindex the data. Once something is "out" of the _source, then you loose this capability.

Another option is to be able to specify those values by referencing other elements in the _source. For example, specifying in the mapping that the analyzer is taken from _source.message.language (or something similar). This gets tricky with things like array elements, but for simple cases, it can work nicely I guess. It will come with an overhead though... .
On Sunday, March 6, 2011 at 5:32 PM, Kosta wrote:

Yes, I suppose having metadata attached to a source document like that
would make it both harder to parse and, as you mentioned, make it
unpredictable in terms of what can be expected in the field object.
What seems natural to me is extracting the boost/analyzer metadata
into a separate object within an inserted document. For example:

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"tweet" : {
"user" : "kimchy",
"post_date" : "2009-11-15T14:12:12",
"message" : "Me gusta Elastic Search!"
}, properties {
"message" : { "_analyzer" : {type:"snowball",
language:"spanish"} }
}
}'

That way the "source" of the message would stay intact, and additional
properties could be set/overridden for each inserted document.

On Mar 6, 4:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

You can also use the analyzer field mapping:http://www.elasticsearch.org/guide/reference/mapping/analyzer-field.html, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."

}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.

On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
https://github.com/elasticsearch/elasticsearch/issues/397

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish :slight_smile:

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
elastic search itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!


(Jasper van Wanrooy - Chatventure) #6

Did this email conversation have any followup? I'm in the process of indexing our objects in several languages. As far as I understood I need to setup several analyzers for each specific language instead of passing on a language flag on index/search time. Is that correct?

Thanks, Jasper

On 7 mrt. 2011, at 05:26, Shay Banon wrote:

Interesting. What I tried to do with the notion of _source is to have the ability to always just get it (and possibly other specific fields, like _routing and _parent, which can be provided separately), and reindex the data. Once something is "out" of the _source, then you loose this capability.

Another option is to be able to specify those values by referencing other elements in the _source. For example, specifying in the mapping that the analyzer is taken from _source.message.language (or something similar). This gets tricky with things like array elements, but for simple cases, it can work nicely I guess. It will come with an overhead though... .
On Sunday, March 6, 2011 at 5:32 PM, Kosta wrote:

Yes, I suppose having metadata attached to a source document like that
would make it both harder to parse and, as you mentioned, make it
unpredictable in terms of what can be expected in the field object.
What seems natural to me is extracting the boost/analyzer metadata
into a separate object within an inserted document. For example:

$ curl -XPUT 'http://localhost:9200/twitter/tweet/1' -d '{
"tweet" : {
"user" : "kimchy",
""post_date" : "2009-11-15T14:12:12",
"message" : "Me gusta Elastic Search!"
}, properties {
""message" : { "_analyzer" : {type:"snowball",
language:"spanish"} }
}
}'

That way the "source" of the message would stay intact, and additional
properties could be set/overridden for each inserted document.

On Mar 6, 4:25 am, Shay Banon shay.ba...@elasticsearch.com wrote:

You can also use the analyzer field mapping:http://www.elasticsearch.org/guide/reference/mapping/analyzer-field.html, but note this controls the index analyzer to all fields in the doc, unless some are explicitly set.

That raises an interesting question, which I have been thinking about a bit lately. Which is how can a field have custom boost / analyzer / other be set per document. Its a bit tricky, since I would like to maintain the "domain" drive aspect of the json document, but it should still be allowed somehow. Still thinking on how best to provide that, maybe someting liek this:

{
"my_field" : {
"value" : "text here",
"_analyzer" : "..."

}
}

But, then, this looses the "pureness" aspect of the indexed doc. On the other hand, its good to have that option.

On Saturday, March 5, 2011 at 6:35 PM, Kosta wrote:

Just stumbled across this:
https://github.com/elasticsearch/elasticsearch/issues/397

I suppose I could have a field representing the document's source
language, and based on that dynamic field the analyzer language would
be set, but that's a bit hackish :slight_smile:

On Mar 5, 2:28 pm, Kosta kosta.kra...@gmail.com wrote:

I looked at the docs for mapping types, and as far as I can see it's
possible to configure analyzers for index fields either in json config
files, or by running the PUT command with appropriate config on
elastic search itself. In one of the examples I saw the following;

"my_analyzer" : {
"type" : "snowball",
"language" : "English"

}

Which basically hardcodes the language of a particular field to
English. However, in my case I know the language at insert time, so I
would like to specify the analyzer language dynamically with each
insert. Would something like this be possible? Thanks for any
suggestions in advance!


(system) #7