Searching on _all vs bool query on a (large) number of fields

The _all field is super-convenient for doing a google-style search across
all fields of a document. The mapping makes it easy to include/exclude
fields from _all in a declarative fashion - including boost levels etc.
It is a pattern I have been using for many years - by hand using lucene
(before elastic), using zzz_all (or whatever it was called) in Compass and
it has served me an my users well.

However, one significant shortcoming is that you cannot do any
highlighting. (And there are probably some other shortcomings I am not
aware of...)

I read in the
documentation: Elasticsearch Platform — Find real-time answers at scale | Elastic

TIP:The _all field is a useful feature while you are getting started with a

new application. Later, you will find that you have more control over your
search results if you query specific fields instead of the _all field.
When the _all field is no longer useful to you, you can disable it, as
explained in Metadata: _all field
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/root-object.html#all-field
.

And this lead me to re-thinking how it might be done.

I can turn the problem inside-out. Instead of using _all, I could instead
create a more complicated query across all of the fields that are included
in _all. I can dynamically generate the query from my field metadata (an
enum).

My question mark is what kind of effect that will have on performance given
I am indexing structured data (not text, per se).
To give an example, one of my documents has 115 (!) fields.

So my query will go from:

{
"query" : {
"prefix" : {
"_all" : {
"prefix" : "moo"
}
}
}
}

To

{
"bool" : {
"should" : [
{"prefix" : { "field1" : "moo" }},
{"prefix" : { "field2" : "moo" }},
{"prefix" : { "field3" : "moo" }},
{"prefix" : { "field4" : "moo" }},
{"prefix" : { "field5" : "moo" }},
{"prefix" : { "field6" : "moo" }},
.
.
.
{"prefix" : { "field114" : "moo" }},
{"prefix" : { "field115" : "moo" }},
]
}
}

Ouch. No?
What kind of performance impact will this have? Diabolical?

The upside is that I now have the flexibility to get some meaningful
highlighting .... but at what cost?
Anyone have experience with this?

Cheers,
-N

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/512274e5-249f-482c-bf9c-27baa879a253%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I would not say "Diabolical". Perhaps not optimal based on Lucene's
internal design.

But I do something similar with table-based synonyms. In other words, when
matching a synonym of a word, I do not pre-build the database index with
synonyms. Instead, I maintain a table (index/type) of words and their
synonyms, query that table, retrieve the synonyms, and then create the
second and final query that basically does an OR search across the word and
its synonyms. (It's basically a group of should clauses, just like yours).

I find that performance is fine. And accuracy and usefulness is superior.
For example, a user query for synonym of the wild-carded BIG* might find
BIG, LARGE, HUGE and also BIGHORN, SHEEP. And so on; some of the synonym
lists are rather long and with multiple words there are many should terms
in the final query.

And even with the multiple queries (first to resolve the synonyms, and the
second to OR across them), performance is remarkably fast. It might be
pushing Lucene a little, but I like the improved accuracy, and the ability
to easily and regularly modify my synonym lists without any need to rebuild
the hundreds of millions of documents that I am querying.

So for your question, my suggestion is to go for it and it should perform
well enough.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4d6ba249-c8b6-4870-af96-ed71ee1b2f7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

one significant shortcoming is that you cannot do any highlighting.

Not necessarily true - see this feature which is primarily for the use case
of searching on an "all" type field but highlighting results using detailed
fields:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Bear in mind that when searching across multiple fields Lucene's natural
tendency to favour rare terms will do odd things like ranking the most
bizarre interpretation of field choice e.g. firstName:Smith ranks higher
than lastName:Smith.
The multi-match query has some special features built into the cross-field
matching modes to counter-act this tendency.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Cheers
Mark

On Thursday, September 25, 2014 11:08:44 AM UTC+1, mooky wrote:

The _all field is super-convenient for doing a google-style search across
all fields of a document. The mapping makes it easy to include/exclude
fields from _all in a declarative fashion - including boost levels etc.
It is a pattern I have been using for many years - by hand using lucene
(before elastic), using zzz_all (or whatever it was called) in Compass and
it has served me an my users well.

However, one significant shortcoming is that you cannot do any
highlighting. (And there are probably some other shortcomings I am not
aware of...)

I read in the documentation:
Elasticsearch Platform — Find real-time answers at scale | Elastic

TIP:The _all field is a useful feature while you are getting started with

a new application. Later, you will find that you have more control over
your search results if you query specific fields instead of the _all field.
When the _all field is no longer useful to you, you can disable it, as
explained in Metadata: _all field
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/root-object.html#all-field
.

And this lead me to re-thinking how it might be done.

I can turn the problem inside-out. Instead of using _all, I could instead
create a more complicated query across all of the fields that are included
in _all. I can dynamically generate the query from my field metadata (an
enum).

My question mark is what kind of effect that will have on performance
given I am indexing structured data (not text, per se).
To give an example, one of my documents has 115 (!) fields.

So my query will go from:

{
"query" : {
"prefix" : {
"_all" : {
"prefix" : "moo"
}
}
}
}

To

{
"bool" : {
"should" : [
{"prefix" : { "field1" : "moo" }},
{"prefix" : { "field2" : "moo" }},
{"prefix" : { "field3" : "moo" }},
{"prefix" : { "field4" : "moo" }},
{"prefix" : { "field5" : "moo" }},
{"prefix" : { "field6" : "moo" }},
.
.
.
{"prefix" : { "field114" : "moo" }},
{"prefix" : { "field115" : "moo" }},
]
}
}

Ouch. No?
What kind of performance impact will this have? Diabolical?

The upside is that I now have the flexibility to get some meaningful
highlighting .... but at what cost?
Anyone have experience with this?

Cheers,
-N

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cc4ccc93-0d69-4f64-b7d0-db878d0c0b9b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

one significant shortcoming is that you cannot do any highlighting.
Not necessarily true - see this feature which is primarily for the use
case of searching on an "all" type field but highlighting results using
detailed fields:

Hm ok. That looks interesting, but what exactly do I have to do? I cant
seem to get any highlighting ...
I added highlighted field "*" (ie all of them - but just to be sure I added
just 1 known field with no wildcarding and queried based on a value within
it).
I set require_field_match to false (and to true, just in case I
misunderstood)
I set force_source to true (because I am not storing any fields).
But I get no highlighting ... what have I missed?

The multi-match query has some special features built into the
cross-field matching modes to counter-act this tendency.

Ah. Multi-match looks much better than constructing a big bool query.
Thanks for the pointer.

Thanks.
-N

On Thursday, 2 October 2014 10:06:39 UTC+1, Mark Harwood wrote:

one significant shortcoming is that you cannot do any highlighting.

Not necessarily true - see this feature which is primarily for the use
case of searching on an "all" type field but highlighting results using
detailed fields:

Elasticsearch Platform — Find real-time answers at scale | Elastic

Bear in mind that when searching across multiple fields Lucene's natural
tendency to favour rare terms will do odd things like ranking the most
bizarre interpretation of field choice e.g. firstName:Smith ranks higher
than lastName:Smith.
The multi-match query has some special features built into the cross-field
matching modes to counter-act this tendency.

Elasticsearch Platform — Find real-time answers at scale | Elastic

Cheers
Mark

On Thursday, September 25, 2014 11:08:44 AM UTC+1, mooky wrote:

The _all field is super-convenient for doing a google-style search across
all fields of a document. The mapping makes it easy to include/exclude
fields from _all in a declarative fashion - including boost levels etc.
It is a pattern I have been using for many years - by hand using lucene
(before elastic), using zzz_all (or whatever it was called) in Compass and
it has served me an my users well.

However, one significant shortcoming is that you cannot do any
highlighting. (And there are probably some other shortcomings I am not
aware of...)

I read in the documentation:
Elasticsearch Platform — Find real-time answers at scale | Elastic

TIP:The _all field is a useful feature while you are getting started

with a new application. Later, you will find that you have more control
over your search results if you query specific fields instead of the
_all field. When the _all field is no longer useful to you, you can
disable it, as explained in Metadata: _all field
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/root-object.html#all-field
.

And this lead me to re-thinking how it might be done.

I can turn the problem inside-out. Instead of using _all, I could instead
create a more complicated query across all of the fields that are included
in _all. I can dynamically generate the query from my field metadata (an
enum).

My question mark is what kind of effect that will have on performance
given I am indexing structured data (not text, per se).
To give an example, one of my documents has 115 (!) fields.

So my query will go from:

{
"query" : {
"prefix" : {
"_all" : {
"prefix" : "moo"
}
}
}
}

To

{
"bool" : {
"should" : [
{"prefix" : { "field1" : "moo" }},
{"prefix" : { "field2" : "moo" }},
{"prefix" : { "field3" : "moo" }},
{"prefix" : { "field4" : "moo" }},
{"prefix" : { "field5" : "moo" }},
{"prefix" : { "field6" : "moo" }},
.
.
.
{"prefix" : { "field114" : "moo" }},
{"prefix" : { "field115" : "moo" }},
]
}
}

Ouch. No?
What kind of performance impact will this have? Diabolical?

The upside is that I now have the flexibility to get some meaningful
highlighting .... but at what cost?
Anyone have experience with this?

Cheers,
-N

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7eb88cc-f463-448c-9d4c-70028b60753f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.