Mapping question


(Eks Dev) #1

Hi,

I am trying achieve the following:
Input document has 2 fields, A1 and A2:
I want to have them indexed, but I would also like to have 3rd field
FULL that is union of A1 and A2

The problem is that I can map two+ fields to one indexed field by
using "index_name" : "FULL" property, but when I go with "type" :
"multi_field" , they end up mapped to "local namespace" "A1.FULL" and
"A2.FULL", is there any way to specify absolute "path" (whatever
terminology is appropriate here)?

curl -XPUT 'http://localhost:9201/ix/any/1' -d '{
"A1" : "aaaa",
"A2" : "bbbb"
}'

curl -XPUT 'http://localhost:9201/ix/any/2' -d '{
"A1" : "cccc",
"A2" : "dddd"
}'

with mapping like this:

curl -XPOST localhost:9200/ix -d '{
"mappings" : {
"any" : {
"_source" : { "enabled" : true },
"properties" : {
"A1" : {
"type" : "multi_field",
"fields" : {
"A1" : { "type" : "string",
"index" : "analyzed"},
"FULL" : { "type" :
"string", "index" : "analyzed", "index_name" : "FULL" }
}
},
"A2" : {
"type" : "multi_field",
"fields" : {
"A2" : { "type" : "string",
"index" : "analyzed"},
"FULL" : { "type" :
"string", "index" : "analyzed", "index_name" : "FULL" }
}
}
}
}
}
}'


(Clinton Gormley) #2

Hi

I am trying achieve the following:
Input document has 2 fields, A1 and A2:
I want to have them indexed, but I would also like to have 3rd field
FULL that is union of A1 and A2

Your suggested mapping won't achieve your goal. If you want 'FULL' to be
the union of A1 and A2, then you should store 'FULL' as a separate field
and set the value to A1 concatenated with A2, ie A1 + ' ' + A2

Alternatively, by default, A1 and A2 are already stored in the _all
field (as are any other indexed fields, unless you explicitly disable
them). Perhaps this will suit your needs?

http://www.elasticsearch.org/guide/reference/mapping/all-field.html

clint


(Eks Dev) #3

thanks Clint,

_all is not indexed, and is not helping here.
I need (A1 + A2 searchable in one field, think e.g. NGrams). Searching
on two separate fields is also not the same, as frequencies are "per
field"...

Concatanating fields on a client side would do the job, but this is
something I am trying not to do as it:

  • Duplicates the size of messages (A1, A2 and A1 + A2)
  • Makes me inflexible as I need to touch clients in order to change
    the way things get indexed
    ...

This mapping is possible on ES with "index_name" : "FULL", but it does
not work as expected if you use "multi_field"

This mapping here will produce one "concatenated" field, "FULL" from
A1 and A2, but will not index A1 and A2 as separate fields (A1 and A2
are going to be used as "synonums" I guess?) See last Query here

I am not sure I understand how mapping works, still in learning
modus ...

try this:
curl -XPOST localhost:9200/ix -d '{
"mappings" : {
"any" : {
"_source" : { "enabled" : true },
"properties" : {
"A1" : { "type" : "string", "index" : "analyzed",
"index_name" : "FULL" },
"A2" : { "type" : "string", "index" : "analyzed",
"index_name" : "FULL" }
}
}
}
}'

curl -XPUT 'http://localhost:9201/ix/any/1' -d '{
"A1" : "aaaa",
"A2" : "bbbb"
}'

curl -XPUT 'http://localhost:9201/ix/any/2' -d '{
"A1" : "cccc",
"A2" : "dddd"
}'
curl -XPOST localhost:9200/_search -d '{
"query":{"term":{"FULL":"dddd"}}
}'

#Shold be no match, but it is
curl -XPOST localhost:9200/_search -d '{
"query":{"term":{"A2":"aaaa"}}
}'

On 22 Mrz., 10:33, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi

I am trying achieve the following:
Input document has 2 fields, A1 and A2:
I want to have them indexed, but I would also like to have 3rd field
FULL that is union of A1 and A2

Your suggested mapping won't achieve your goal. If you want 'FULL' to be
the union of A1 and A2, then you should store 'FULL' as a separate field
and set the value to A1 concatenated with A2, ie A1 + ' ' + A2

Alternatively, by default, A1 and A2 are already stored in the _all
field (as are any other indexed fields, unless you explicitly disable
them). Perhaps this will suit your needs?

http://www.elasticsearch.org/guide/reference/mapping/all-field.html

clint


(Clinton Gormley) #4

Hiya

_all is not indexed, and is not helping here.
I need (A1 + A2 searchable in one field, think e.g. NGrams). Searching
on two separate fields is also not the same, as frequencies are "per
field"...

The only way that I can see of you getting the frequencies of terms in
the union of A1+A2 is to make that a distinct field, which means storing
A1, A2, A1+A2. Or by using _all.

I don't know if anybody else has a better suggestion, but I think you're
stuck with this.

index_name won't help you here.

clint


(Shay Banon) #5

The index_name is not meant to allow to concatenate several fields. Its just controls the name of the field created (if it will be appended with the path to the object or not). And you, most times, don't have to change it.

Having the ability to create a field that is an aggregation of other several fields is valid, I agree (aside from _all). You can open an issue for that, though, its a bit tricky to implement (as always :wink: ).
On Tuesday, March 22, 2011 at 12:50 PM, eks dev wrote:

thanks Clint,

_all is not indexed, and is not helping here.
I need (A1 + A2 searchable in one field, think e.g. NGrams). Searching
on two separate fields is also not the same, as frequencies are "per
field"...

Concatanating fields on a client side would do the job, but this is
something I am trying not to do as it:

  • Duplicates the size of messages (A1, A2 and A1 + A2)
  • Makes me inflexible as I need to touch clients in order to change
    the way things get indexed
    ...

This mapping is possible on ES with "index_name" : "FULL", but it does
not work as expected if you use "multi_field"

This mapping here will produce one "concatenated" field, "FULL" from
A1 and A2, but will not index A1 and A2 as separate fields (A1 and A2
are going to be used as "synonums" I guess?) See last Query here

I am not sure I understand how mapping works, still in learning
modus ...

try this:
curl -XPOST localhost:9200/ix -d '{
"mappings" : {
"any" : {
"_source" : { "enabled" : true },
"properties" : {
"A1" : { "type" : "string", "index" : "analyzed",
"index_name" : "FULL" },
"A2" : { "type" : "string", "index" : "analyzed",
"index_name" : "FULL" }
}
}
}
}'

curl -XPUT 'http://localhost:9201/ix/any/1' -d '{
"A1" : "aaaa",
"A2" : "bbbb"
}'

curl -XPUT 'http://localhost:9201/ix/any/2' -d '{
"A1" : "cccc",
"A2" : "dddd"
}'
curl -XPOST localhost:9200/_search -d '{
"query":{"term":{"FULL":"dddd"}}
}'

#Shold be no match, but it is
curl -XPOST localhost:9200/_search -d '{
"query":{"term":{"A2":"aaaa"}}
}'

On 22 Mrz., 10:33, Clinton Gormley clin...@iannounce.co.uk wrote:

Hi

I am trying achieve the following:
Input document has 2 fields, A1 and A2:
I want to have them indexed, but I would also like to have 3rd field
FULL that is union of A1 and A2

Your suggested mapping won't achieve your goal. If you want 'FULL' to be
the union of A1 and A2, then you should store 'FULL' as a separate field
and set the value to A1 concatenated with A2, ie A1 + ' ' + A2

Alternatively, by default, A1 and A2 are already stored in the _all
field (as are any other indexed fields, unless you explicitly disable
them). Perhaps this will suit your needs?

http://www.elasticsearch.org/guide/reference/mapping/all-field.html

clint


(Eks Dev) #6

Hi Shay,
it is worth spending some time discussing mapping, far to many "on top
of the lucene" projects map lucene analyzers 1-1, It looks to me ES
goes further in that sense.

Another example:

"conditional analysis", mapping M to N fields:
Map(STREET, HNO)->{street_name, hno}

Example:

STREET = "Sleepy street 21"
HNO = "null"
should be mapped to
street_name = "sleepy street"
hno = "21"

OR

STREET = "22nd street"
HNO = "10"
should be mapped to
street_name = "22nd street"
hno = "10"

If we were to support something like this, we would have to stretch ES
config capabilities quite a lot, maybe some "request mutators" would
be better.
Something like Transform(Document) -> NEW_Document extensibility plug-
in, so we are not pushing ES to provide all possible transformations,
we just make it possible for users to "do whatever they
want" (something they could anyhow do in client code).

The problem with this is that NEW_DOCUMENT must conform to minimum
standards, (e.g. have the same ID as original Document not to screw up
routing...)

Due to great dynamic mapping in ES, I think it is not a problem to
have it relatively simple. At the end of a day, I think of it as
opening possibility to "push client code to ES" With all its gotcha-s
(User should not wonder if original fields get transformed by
"transformer" )....

What I am trying to say, instead of stretching ES mapping features for
this, maybe it would be better provide "extension points" for users to
reformulate document with their own "transformers".

benefit:
My example with A1, A2 and FULL and any similar would be my concern, I
could indeed make transformer that maps Transform({A1, A2})->{A1, A2,
FULL, NGramFULL, whatever} as intercepted operation...

And it would be my responsibility to know what fields I can use to
search.

Sometimes are these operation quite complicated, imagine some
extractions from unstructured documents where I need to index more
then one index document like:
TransformBulkText("some longish text")-> should get transformed to two
or more documents (Named Entities e.g.)

I do not think it is too hard to make such "API extensions" and make
ES even more usable.... but take it with reserve, I still have to dig
into ES.... Just analyzing if it could help with problems I already
met

Thanks,
Eks

On 22 Mrz., 12:09, Shay Banon shay.ba...@elasticsearch.com wrote:

The index_name is not meant to allow to concatenate several fields. Its just controls the name of the field created (if it will be appended with the path to the object or not). And you, most times, don't have to change it.

Having the ability to create a field that is an aggregation of other several fields is valid, I agree (aside from _all). You can open an issue for that, though, its a bit tricky to implement (as always :wink: ).


(system) #7