Word order in _all (or a muti-field produced combined field) and phrase queries

Hello,

I have a question regarding order in which JSON source data is rolled into
_all filed or a field combining data from multiple multifield definitions.
In general JSON does not guarantee order of properties in an object. Will
ES process JSON properties in the order they appear in the source object?
Is it guaranteed? If order of processing of JSON properties is undefined,
phrase queries against _all or composite field is not very
meaningful beyond individual JSON property as there is no way to guarantee
that related properties (say first and last name or person name and his
organization etc) will be indexed next to each other and will be scored
higher by phrase queries or other proximity based queries.

I personally would love to have more control over how _all field is being
composed and be able to include various gaps between individual fields or
use some smart strategies on gap calculation for example if I want to favor
matches with data coming from the same JSON property I could have a gap
between it and the next one equal by number of tokens in the first property
plus number of tokens in the second property (optionally multiplied by
certain number) which will effectively lower score of a match spread across
those two properties

Another thing I would love is to be able to have a notation passing an
array of field values (think special synthetic not stored search field as
part of my JSON which puts data from my domain object together in a way I
think the best for phrase searching) with embedded gap values - this way I
could use my knowledge of data to influence position search quality (big
gaps or overlapping positions or gaps designed to emphasize matches within
one element or neghboring elements)

What do you think?

Thank you,
Alex

--

I think a better approach would be to combine queries such as match,
multi_match, and query_string and explicitly configure those queries to
search the fields you want vs. relying on the _all field. Using this
approach you could even use use boosting on hits within specific fields
(query_string, and multi_match) to setup scoring exactly as you want.

On Fri, Jan 25, 2013 at 1:51 PM, AlexR roytmana@gmail.com wrote:

Hello,

I have a question regarding order in which JSON source data is rolled into
_all filed or a field combining data from multiple multifield definitions.
In general JSON does not guarantee order of properties in an object. Will
ES process JSON properties in the order they appear in the source object?
Is it guaranteed? If order of processing of JSON properties is undefined,
phrase queries against _all or composite field is not very
meaningful beyond individual JSON property as there is no way to guarantee
that related properties (say first and last name or person name and his
organization etc) will be indexed next to each other and will be scored
higher by phrase queries or other proximity based queries.

I personally would love to have more control over how _all field is being
composed and be able to include various gaps between individual fields or
use some smart strategies on gap calculation for example if I want to favor
matches with data coming from the same JSON property I could have a gap
between it and the next one equal by number of tokens in the first property
plus number of tokens in the second property (optionally multiplied by
certain number) which will effectively lower score of a match spread across
those two properties

Another thing I would love is to be able to have a notation passing an
array of field values (think special synthetic not stored search field as
part of my JSON which puts data from my domain object together in a way I
think the best for phrase searching) with embedded gap values - this way I
could use my knowledge of data to influence position search quality (big
gaps or overlapping positions or gaps designed to emphasize matches within
one element or neghboring elements)

What do you think?

Thank you,
Alex

--

--

Hi Matt,

Thanks for your suggestion. I have a pretty complex domain object - a
couple hundred of pretty small fields many 1-N and/or deeply nested (for
this particular domain object more than 30 DB tables are used and other
part of the object I will model as parent/child later on).

One of the requirements is google like "search on everything". I suspect
some complex query when combined with multifield expansion (for couple of
hundreds of fields) will be very slow
Also I have some doubts about quality of scoring when executing queries
across many short fields

The trouble is that I am fairly new to text searching and do not yet have
good feel for what works well and what does not. I would appreciate some
architectural advice....

BTW I was thinking about your API. It is very good for direct access from
rich ajax UIs (we use Ext heavily) but one concern I had was security -
you just can't open it to the user - ES has no built in security. I put it
behind apache (another proxy could be used) that could give me
authentication and authorization per URL but still pretty messy as it needs
to enforce not only based on URL pattern but also no http method. Finally I
found jetty http transport which is more integrated and lot easier to
manage. Now opening ES to browser JavaScript is less questionable but it
will need browser single sign on to be useful....

Alex

On Fri, Jan 25, 2013 at 5:12 PM, Matt Weber matt.weber@gmail.com wrote:

I think a better approach would be to combine queries such as match,
multi_match, and query_string and explicitly configure those queries to
search the fields you want vs. relying on the _all field. Using this
approach you could even use use boosting on hits within specific fields
(query_string, and multi_match) to setup scoring exactly as you want.

On Fri, Jan 25, 2013 at 1:51 PM, AlexR roytmana@gmail.com wrote:

Hello,

I have a question regarding order in which JSON source data is rolled
into _all filed or a field combining data from multiple multifield
definitions. In general JSON does not guarantee order of properties in an
object. Will ES process JSON properties in the order they appear in the
source object? Is it guaranteed? If order of processing of JSON properties
is undefined, phrase queries against _all or composite field is not very
meaningful beyond individual JSON property as there is no way to guarantee
that related properties (say first and last name or person name and his
organization etc) will be indexed next to each other and will be scored
higher by phrase queries or other proximity based queries.

I personally would love to have more control over how _all field is being
composed and be able to include various gaps between individual fields or
use some smart strategies on gap calculation for example if I want to favor
matches with data coming from the same JSON property I could have a gap
between it and the next one equal by number of tokens in the first property
plus number of tokens in the second property (optionally multiplied by
certain number) which will effectively lower score of a match spread across
those two properties

Another thing I would love is to be able to have a notation passing an
array of field values (think special synthetic not stored search field as
part of my JSON which puts data from my domain object together in a way I
think the best for phrase searching) with embedded gap values - this way I
could use my knowledge of data to influence position search quality (big
gaps or overlapping positions or gaps designed to emphasize matches within
one element or neghboring elements)

What do you think?

Thank you,
Alex

--

Hi Alex,

you are not tied to the _all field. By using 'index_name', you can
aggregate fields into a new field, and you can set the attributes in the
new field you like. Together with the 'multi_field' mapping, you have a
very powerful feature: custom '_all' fields.

Example

"my_all_field" {
"type" : "string,
"analyzer" : "..."
},
"field1" : {
"type" : "multi_field",
"fields" : {
"field1" : {
"type" : "string",
"analyzer" : "..."
},
"my_all_field1" : {
"type" : "string",
"index_name" : "my_all_field",
"position_offset_gap": 100,
"include_in_all" : false
}
}
},
"field2" : {
"type" : "multi_field",
"fields" : {
"field2" : {
"type" : "string",
"analyzer" : "..."
},
"my_all_field2" : {
"type" : "string",
"index_name" : "my_all_field",
"position_offset_gap": 100,
"include_in_all" : false
}
}
},
...

According to phrase queries, you are right, you can cross field borders
when searching on 'my_all_field' and get false hits in rare cases, but
you can add position_offset_gap to avoid it.

When being indexed, the fields are processed in the order they appear in
the JSON source document, top to down. So, you are in full control of
the order of field values being applied to a 'my_all_field'.

Multifields are not slow, there are as fast as other fields, or even
faster because they save you from indexing values twice or more. The
multifield attribute is applied at mapping time, preparing the effective
field names just before the indexer starts the work. At search time, the
mapping is always used for lookup, also for non-multifield fields, too.

I think the regular scoring works fine. If you want higher scoring for
short queries on short fields, search on 'field1', 'field2' and so on.
If you want lower scoring, search on 'my_all_field'. You know that
scoring depends on term frequency, the term position, and the length of
a field where a term occurs, to boost/penalize a term (roughly spoken).
But you can also change the scoring algorithm (this is not specific to
the multifield/index_name feature, it is an advanced topic).

I have built long JSON mapping documents with multifield/index_name to
perform bibliographic search (believe me, with dozens of such
constructions), I'm satisfied with the result.

Your question about array field values I'm afraid I do not fully
understand. Fields are multivalued by default, just add a
'position_offset_gap' and send a JSON array to this field.

Best regards,

Jörg

Am 26.01.13 01:53, schrieb Alex Roytman:

Hi Matt,

Thanks for your suggestion. I have a pretty complex domain object - a
couple hundred of pretty small fields many 1-N and/or deeply nested
(for this particular domain object more than 30 DB tables are used and
other part of the object I will model as parent/child later on).

One of the requirements is google like "search on everything". I
suspect some complex query when combined with multifield expansion
(for couple of hundreds of fields) will be very slow
Also I have some doubts about quality of scoring when executing
queries across many short fields

The trouble is that I am fairly new to text searching and do not yet
have good feel for what works well and what does not. I would
appreciate some architectural advice....

BTW I was thinking about your API. It is very good for direct access
from rich ajax UIs (we use Ext heavily) but one concern I had was
security - you just can't open it to the user - ES has no built in
security. I put it behind apache (another proxy could be used) that
could give me authentication and authorization per URL but still
pretty messy as it needs to enforce not only based on URL pattern but
also no http method. Finally I found jetty http transport which is
more integrated and lot easier to manage. Now opening ES to browser
JavaScript is less questionable but it will need browser single sign
on to be useful....

Alex

On Fri, Jan 25, 2013 at 5:12 PM, Matt Weber <matt.weber@gmail.com
mailto:matt.weber@gmail.com> wrote:

I think a better approach would be to combine queries such as
match, multi_match, and query_string and explicitly configure
those queries to search the fields you want vs. relying on the
_all field.  Using this approach you could even use use boosting
on hits within specific fields (query_string, and multi_match) to
setup scoring exactly as you want.


On Fri, Jan 25, 2013 at 1:51 PM, AlexR <roytmana@gmail.com
<mailto:roytmana@gmail.com>> wrote:

    Hello,

    I have a question regarding order in which JSON source data is
    rolled into _all filed or a field combining data from multiple
    multifield definitions. In general JSON does
    not guarantee order of properties in an object. Will ES
    process JSON properties in the order they appear in the source
    object? Is it guaranteed? If order of processing of JSON
    properties is undefined, phrase queries against _all or
    composite field is not very meaningful beyond individual JSON
    property as there is no way to guarantee that related
    properties (say first and last name or person name and his
    organization etc) will be indexed next to each other and will
    be scored higher by phrase queries or other proximity based
    queries.

    I personally would love to have more control over how _all
    field is being composed and be able to include various gaps
    between individual fields or use some smart strategies on gap
    calculation for example if I want to favor matches with data
    coming from the same JSON property I could have a gap between
    it and the next one equal by number of tokens in the first
    property plus number of tokens in the second property
    (optionally multiplied by certain number) which will
    effectively lower score of a match spread across those two
    properties

    Another thing I would love is to be able to have a notation
    passing an array of field values (think special synthetic not
    stored search field as part of my JSON which puts data from my
    domain object together in a way I think the best for phrase
    searching) with embedded gap values - this way I could use my
    knowledge of data to influence position search quality (big
    gaps or overlapping positions or gaps designed to emphasize
    matches within one element or neghboring elements)

    What do you think?

    Thank you,
    Alex
    -- 

--

Thank you Jörg!
Good news that JSON it processed in the order of source - I will have
control over placement of data

I guess my terminology was not clear. When I said multifield queries may be
slow I meant query against multiple fields (in my case over a hundred) not
query against field produced via multifield mapping. I understand concept
of multifield mappings I just mixed up terminology I guess.
Matt suggested not to use _all or custom aggregated field but query against
multiple source fields instead. My concern was that I have over a hundred
of short fields I would need to include into my query using myrecord.* (or
explicit list of fields) thus I was worried

-Performance will be bad
-Scoring will be unreliable with so many small fields to calculate score
from, proximity based relevance will be questionable ....

  • On plus side I will get highlighting which otherwise I would need to do
    myself without ES help in my application

As for the "smart" gap between fields or array elements, Lucene if I recall
allow different gap for each value of a multivalued field but ES does not
have such semantics. My goal is not prevent cross fiels/array element
position query matches but rather slightly emphasize single field or
neighboring field matches over fields in different areas of the field. To
do so I have this idea of calculating gap between values of a multivalued
field based on previous and current value number of tokens. having such gap
will guarantee that two words within a single value of any two sequential
values will always need less position flips to make the phrase than if the
phrase is spread across both of the values. So this requires dynamic gap
calculation by I guess a Filter
With multifield mapping onto common "my_all" field by specifying position
gap I can inject static (same for all documents) gap between. Also I can
specify gap for array elements but not sure how to do it in in multifield
mapping (syntax of "position_offset_gap": 100 for the entire field and
array elements seem to clash)?

Also few specific questions if you do not mind

  • Is built-in _all a multivalued field or just aggregation of tokens from
    all fields?
  • Is there any way/tools to inspect position offsets of tokens of a field
    in already indexed document?
  • Since my source object has so many fields, during development, I load my
    data then get default mapping and customize it. Then I re-create my index
    with customized mapping.
    Is there any way (template) to influence default mapping to for example
    make all string fields called "code" not analyzed, make all string fields
    to be multifield (so I can easily add second field) make all fields called
    ID not included in _all etc

Thank you,

Alex

On Sat, Jan 26, 2013 at 7:48 AM, Jörg Prante joergprante@gmail.com wrote:

Hi Alex,

you are not tied to the _all field. By using 'index_name', you can
aggregate fields into a new field, and you can set the attributes in the
new field you like. Together with the 'multi_field' mapping, you have a
very powerful feature: custom '_all' fields.

Example

"my_all_field" {
"type" : "string,
"analyzer" : "..."
},
"field1" : {
"type" : "multi_field",
"fields" : {
"field1" : {
"type" : "string",
"analyzer" : "..."
},
"my_all_field1" : {
"type" : "string",
"index_name" : "my_all_field",
"position_offset_gap": 100,
"include_in_all" : false
}
}
},
"field2" : {
"type" : "multi_field",
"fields" : {
"field2" : {
"type" : "string",
"analyzer" : "..."
},
"my_all_field2" : {
"type" : "string",
"index_name" : "my_all_field",
"position_offset_gap": 100,
"include_in_all" : false
}
}
},
...

According to phrase queries, you are right, you can cross field borders
when searching on 'my_all_field' and get false hits in rare cases, but you
can add position_offset_gap to avoid it.

When being indexed, the fields are processed in the order they appear in
the JSON source document, top to down. So, you are in full control of the
order of field values being applied to a 'my_all_field'.

Multifields are not slow, there are as fast as other fields, or even
faster because they save you from indexing values twice or more. The
multifield attribute is applied at mapping time, preparing the effective
field names just before the indexer starts the work. At search time, the
mapping is always used for lookup, also for non-multifield fields, too.

I think the regular scoring works fine. If you want higher scoring for
short queries on short fields, search on 'field1', 'field2' and so on. If
you want lower scoring, search on 'my_all_field'. You know that scoring
depends on term frequency, the term position, and the length of a field
where a term occurs, to boost/penalize a term (roughly spoken). But you can
also change the scoring algorithm (this is not specific to the
multifield/index_name feature, it is an advanced topic).

I have built long JSON mapping documents with multifield/index_name to
perform bibliographic search (believe me, with dozens of such
constructions), I'm satisfied with the result.

Your question about array field values I'm afraid I do not fully
understand. Fields are multivalued by default, just add a
'position_offset_gap' and send a JSON array to this field.

Best regards,

Jörg

Am 26.01.13 01:53, schrieb Alex Roytman:

Hi Matt,

Thanks for your suggestion. I have a pretty complex domain object - a
couple hundred of pretty small fields many 1-N and/or deeply nested (for
this particular domain object more than 30 DB tables are used and other
part of the object I will model as parent/child later on).

One of the requirements is google like "search on everything". I suspect
some complex query when combined with multifield expansion (for couple of
hundreds of fields) will be very slow
Also I have some doubts about quality of scoring when executing queries
across many short fields

The trouble is that I am fairly new to text searching and do not yet have
good feel for what works well and what does not. I would appreciate some
architectural advice....

BTW I was thinking about your API. It is very good for direct access from
rich ajax UIs (we use Ext heavily) but one concern I had was security -
you just can't open it to the user - ES has no built in security. I put it
behind apache (another proxy could be used) that could give me
authentication and authorization per URL but still pretty messy as it needs
to enforce not only based on URL pattern but also no http method. Finally I
found jetty http transport which is more integrated and lot easier to
manage. Now opening ES to browser JavaScript is less questionable but it
will need browser single sign on to be useful....

Alex

On Fri, Jan 25, 2013 at 5:12 PM, Matt Weber <matt.weber@gmail.com<mailto:
matt.weber@gmail.com>> wrote:

I think a better approach would be to combine queries such as
match, multi_match, and query_string and explicitly configure
those queries to search the fields you want vs. relying on the
_all field.  Using this approach you could even use use boosting
on hits within specific fields (query_string, and multi_match) to
setup scoring exactly as you want.


On Fri, Jan 25, 2013 at 1:51 PM, AlexR <roytmana@gmail.com
<mailto:roytmana@gmail.com>> wrote:

    Hello,

    I have a question regarding order in which JSON source data is
    rolled into _all filed or a field combining data from multiple
    multifield definitions. In general JSON does
    not guarantee order of properties in an object. Will ES
    process JSON properties in the order they appear in the source
    object? Is it guaranteed? If order of processing of JSON
    properties is undefined, phrase queries against _all or
    composite field is not very meaningful beyond individual JSON
    property as there is no way to guarantee that related
    properties (say first and last name or person name and his
    organization etc) will be indexed next to each other and will
    be scored higher by phrase queries or other proximity based
    queries.

    I personally would love to have more control over how _all
    field is being composed and be able to include various gaps
    between individual fields or use some smart strategies on gap
    calculation for example if I want to favor matches with data
    coming from the same JSON property I could have a gap between
    it and the next one equal by number of tokens in the first
    property plus number of tokens in the second property
    (optionally multiplied by certain number) which will
    effectively lower score of a match spread across those two
    properties

    Another thing I would love is to be able to have a notation
    passing an array of field values (think special synthetic not
    stored search field as part of my JSON which puts data from my
    domain object together in a way I think the best for phrase
    searching) with embedded gap values - this way I could use my
    knowledge of data to influence position search quality (big
    gaps or overlapping positions or gaps designed to emphasize
    matches within one element or neghboring elements)

    What do you think?

    Thank you,
    Alex
    --

--