Hi Alex,
you are not tied to the _all field. By using 'index_name', you can
aggregate fields into a new field, and you can set the attributes in the
new field you like. Together with the 'multi_field' mapping, you have a
very powerful feature: custom '_all' fields.
Example
"my_all_field" {
"type" : "string,
"analyzer" : "..."
},
"field1" : {
"type" : "multi_field",
"fields" : {
"field1" : {
"type" : "string",
"analyzer" : "..."
},
"my_all_field1" : {
"type" : "string",
"index_name" : "my_all_field",
"position_offset_gap": 100,
"include_in_all" : false
}
}
},
"field2" : {
"type" : "multi_field",
"fields" : {
"field2" : {
"type" : "string",
"analyzer" : "..."
},
"my_all_field2" : {
"type" : "string",
"index_name" : "my_all_field",
"position_offset_gap": 100,
"include_in_all" : false
}
}
},
...
According to phrase queries, you are right, you can cross field borders
when searching on 'my_all_field' and get false hits in rare cases, but
you can add position_offset_gap to avoid it.
When being indexed, the fields are processed in the order they appear in
the JSON source document, top to down. So, you are in full control of
the order of field values being applied to a 'my_all_field'.
Multifields are not slow, there are as fast as other fields, or even
faster because they save you from indexing values twice or more. The
multifield attribute is applied at mapping time, preparing the effective
field names just before the indexer starts the work. At search time, the
mapping is always used for lookup, also for non-multifield fields, too.
I think the regular scoring works fine. If you want higher scoring for
short queries on short fields, search on 'field1', 'field2' and so on.
If you want lower scoring, search on 'my_all_field'. You know that
scoring depends on term frequency, the term position, and the length of
a field where a term occurs, to boost/penalize a term (roughly spoken).
But you can also change the scoring algorithm (this is not specific to
the multifield/index_name feature, it is an advanced topic).
I have built long JSON mapping documents with multifield/index_name to
perform bibliographic search (believe me, with dozens of such
constructions), I'm satisfied with the result.
Your question about array field values I'm afraid I do not fully
understand. Fields are multivalued by default, just add a
'position_offset_gap' and send a JSON array to this field.
Best regards,
Jörg
Am 26.01.13 01:53, schrieb Alex Roytman:
Hi Matt,
Thanks for your suggestion. I have a pretty complex domain object - a
couple hundred of pretty small fields many 1-N and/or deeply nested
(for this particular domain object more than 30 DB tables are used and
other part of the object I will model as parent/child later on).
One of the requirements is google like "search on everything". I
suspect some complex query when combined with multifield expansion
(for couple of hundreds of fields) will be very slow
Also I have some doubts about quality of scoring when executing
queries across many short fields
The trouble is that I am fairly new to text searching and do not yet
have good feel for what works well and what does not. I would
appreciate some architectural advice....
BTW I was thinking about your API. It is very good for direct access
from rich ajax UIs (we use Ext heavily) but one concern I had was
security - you just can't open it to the user - ES has no built in
security. I put it behind apache (another proxy could be used) that
could give me authentication and authorization per URL but still
pretty messy as it needs to enforce not only based on URL pattern but
also no http method. Finally I found jetty http transport which is
more integrated and lot easier to manage. Now opening ES to browser
JavaScript is less questionable but it will need browser single sign
on to be useful....
Alex
On Fri, Jan 25, 2013 at 5:12 PM, Matt Weber <matt.weber@gmail.com
mailto:matt.weber@gmail.com> wrote:
I think a better approach would be to combine queries such as
match, multi_match, and query_string and explicitly configure
those queries to search the fields you want vs. relying on the
_all field. Using this approach you could even use use boosting
on hits within specific fields (query_string, and multi_match) to
setup scoring exactly as you want.
On Fri, Jan 25, 2013 at 1:51 PM, AlexR <roytmana@gmail.com
<mailto:roytmana@gmail.com>> wrote:
Hello,
I have a question regarding order in which JSON source data is
rolled into _all filed or a field combining data from multiple
multifield definitions. In general JSON does
not guarantee order of properties in an object. Will ES
process JSON properties in the order they appear in the source
object? Is it guaranteed? If order of processing of JSON
properties is undefined, phrase queries against _all or
composite field is not very meaningful beyond individual JSON
property as there is no way to guarantee that related
properties (say first and last name or person name and his
organization etc) will be indexed next to each other and will
be scored higher by phrase queries or other proximity based
queries.
I personally would love to have more control over how _all
field is being composed and be able to include various gaps
between individual fields or use some smart strategies on gap
calculation for example if I want to favor matches with data
coming from the same JSON property I could have a gap between
it and the next one equal by number of tokens in the first
property plus number of tokens in the second property
(optionally multiplied by certain number) which will
effectively lower score of a match spread across those two
properties
Another thing I would love is to be able to have a notation
passing an array of field values (think special synthetic not
stored search field as part of my JSON which puts data from my
domain object together in a way I think the best for phrase
searching) with embedded gap values - this way I could use my
knowledge of data to influence position search quality (big
gaps or overlapping positions or gaps designed to emphasize
matches within one element or neghboring elements)
What do you think?
Thank you,
Alex
--
--