Assume that I want to be able to flag documents in an index according to
their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then add a
field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case, if the
document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where one
wants to check the attributes of the documents in the index. In particular,
if I want to find the documents that are either Foo or Bar I can either
(a) In case (1): Use a Boolean "should" filter the surrounds two "exists"'s
filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the existence of
the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
Lucene / Elasticsearch is pretty much insignificant to this as long as you
use filters. You should prefer not_analyzed fields with string values to
represent those flags vs having dedicated boolean fields if you will have
more than a few such flags.
On Wed, Dec 10, 2014 at 10:22 AM, Dror Atariah drorata@gmail.com wrote:
Assume that I want to be able to flag documents in an index according to
their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then add a
field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case, if
the document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where one
wants to check the attributes of the documents in the index. In particular,
if I want to find the documents that are either Foo or Bar I can either
(a) In case (1): Use a Boolean "should" filter the surrounds two
"exists"'s filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the existence of
the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
In space complexity, there is a difference. The more fields you use in a
search, the more Lucene must do heavy lifting and you need bigger caches
for filter.
The solution 2 with one field is more compact and therefore, faster.
Jörg
On Wed, Dec 10, 2014 at 4:26 PM, Itamar Syn-Hershko itamar@code972.com
wrote:
Lucene / Elasticsearch is pretty much insignificant to this as long as you
use filters. You should prefer not_analyzed fields with string values to
represent those flags vs having dedicated boolean fields if you will have
more than a few such flags.
On Wed, Dec 10, 2014 at 10:22 AM, Dror Atariah drorata@gmail.com wrote:
Assume that I want to be able to flag documents in an index according to
their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then add
a field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case, if
the document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where one
wants to check the attributes of the documents in the index. In particular,
if I want to find the documents that are either Foo or Bar I can either
(a) In case (1): Use a Boolean "should" filter the surrounds two
"exists"'s filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the existence
of the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
Can you please elaborate on the matter? Why/how does the number of fields
relevant here?
On Wednesday, December 10, 2014 4:26:16 PM UTC+1, Itamar Syn-Hershko wrote:
Lucene / Elasticsearch is pretty much insignificant to this as long as you
use filters. You should prefer not_analyzed fields with string values to
represent those flags vs having dedicated boolean fields if you will have
more than a few such flags.
On Wed, Dec 10, 2014 at 10:22 AM, Dror Atariah <dro...@gmail.com
<javascript:>> wrote:
Assume that I want to be able to flag documents in an index according to
their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then add
a field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case, if
the document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where one
wants to check the attributes of the documents in the index. In particular,
if I want to find the documents that are either Foo or Bar I can either
(a) In case (1): Use a Boolean "should" filter the surrounds two
"exists"'s filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the existence
of the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
@Itamar: Can you please elaborate on the matter? Why/how does the number of
fields relevant here?
On Wednesday, December 10, 2014 4:26:16 PM UTC+1, Itamar Syn-Hershko wrote:
Lucene / Elasticsearch is pretty much insignificant to this as long as you
use filters. You should prefer not_analyzed fields with string values to
represent those flags vs having dedicated boolean fields if you will have
more than a few such flags.
On Wed, Dec 10, 2014 at 10:22 AM, Dror Atariah <dro...@gmail.com
<javascript:>> wrote:
Assume that I want to be able to flag documents in an index according to
their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then add
a field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case, if
the document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where one
wants to check the attributes of the documents in the index. In particular,
if I want to find the documents that are either Foo or Bar I can either
(a) In case (1): Use a Boolean "should" filter the surrounds two
"exists"'s filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the existence
of the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
Basically, you will have to maintain more filters. Also Lucene supports up
to certain amount of fields, it wasn't designed to handle unlimited number
of them
On Wed, Dec 10, 2014 at 10:35 AM, Dror Atariah drorata@gmail.com wrote:
@Itamar: Can you please elaborate on the matter? Why/how does the number
of fields relevant here?
On Wednesday, December 10, 2014 4:26:16 PM UTC+1, Itamar Syn-Hershko wrote:
Lucene / Elasticsearch is pretty much insignificant to this as long as
you use filters. You should prefer not_analyzed fields with string values
to represent those flags vs having dedicated boolean fields if you will
have more than a few such flags.
On Wed, Dec 10, 2014 at 10:22 AM, Dror Atariah dro...@gmail.com wrote:
Assume that I want to be able to flag documents in an index according to
their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then add
a field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case, if
the document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where one
wants to check the attributes of the documents in the index. In particular,
if I want to find the documents that are either Foo or Bar I can either
(a) In case (1): Use a Boolean "should" filter the surrounds two
"exists"'s filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the existence
of the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
Is there any difference or any implications if there is also need of
aggregations?
On Wednesday, December 10, 2014 4:57:10 PM UTC+1, Itamar Syn-Hershko wrote:
Basically, you will have to maintain more filters. Also Lucene supports up
to certain amount of fields, it wasn't designed to handle unlimited number
of them
On Wed, Dec 10, 2014 at 10:35 AM, Dror Atariah <dro...@gmail.com
<javascript:>> wrote:
@Itamar: Can you please elaborate on the matter? Why/how does the number
of fields relevant here?
On Wednesday, December 10, 2014 4:26:16 PM UTC+1, Itamar Syn-Hershko
wrote:
Lucene / Elasticsearch is pretty much insignificant to this as long as
you use filters. You should prefer not_analyzed fields with string values
to represent those flags vs having dedicated boolean fields if you will
have more than a few such flags.
On Wed, Dec 10, 2014 at 10:22 AM, Dror Atariah dro...@gmail.com wrote:
Assume that I want to be able to flag documents in an index according
to their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then
add a field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case, if
the document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where
one wants to check the attributes of the documents in the index. In
particular, if I want to find the documents that are either Foo or Bar I
can either
(a) In case (1): Use a Boolean "should" filter the surrounds two
"exists"'s filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the existence
of the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
On Wed, Dec 10, 2014 at 11:03 AM, Dror Atariah drorata@gmail.com wrote:
Is there any difference or any implications if there is also need of
aggregations?
On Wednesday, December 10, 2014 4:57:10 PM UTC+1, Itamar Syn-Hershko wrote:
Basically, you will have to maintain more filters. Also Lucene supports
up to certain amount of fields, it wasn't designed to handle unlimited
number of them
On Wed, Dec 10, 2014 at 10:35 AM, Dror Atariah dro...@gmail.com wrote:
@Itamar: Can you please elaborate on the matter? Why/how does the number
of fields relevant here?
On Wednesday, December 10, 2014 4:26:16 PM UTC+1, Itamar Syn-Hershko
wrote:
Lucene / Elasticsearch is pretty much insignificant to this as long as
you use filters. You should prefer not_analyzed fields with string values
to represent those flags vs having dedicated boolean fields if you will
have more than a few such flags.
On Wed, Dec 10, 2014 at 10:22 AM, Dror Atariah dro...@gmail.com
wrote:
Assume that I want to be able to flag documents in an index according
to their attributes: isFoo and isBar [1]. As far as I understand, there are
two approaches:
Use dedicated fields for the flags: If the document is a Foo then
add a field named isFoo. Similarly, for isBar.
Use a flags field that will be an array of strings. In this case,
if the document is Foo then "flags" will contain the string "isFoo".
What are the pros and cons in terms of space and runtime complexities?
Bear in mind the following queries examples: Consider the case where
one wants to check the attributes of the documents in the index. In
particular, if I want to find the documents that are either Foo or Bar I
can either
(a) In case (1): Use a Boolean "should" filter the surrounds two
"exists"'s filters checking whether either isFoo or isBar exist.
(b) In case (2): Use a single "exists" filter that checks the
existence of the field "flags".
A different case, is if I want to find the documents that are both Foo and Bar:
(a) In case (1): Like before, replace the "should" with a "must".
(b) In case (2): Surround two "term"s filters with a "must" Boolean
one.
Lastly, finding the documents that are Foo but not Bar.
In the bottom line, In case (1) all queries boil down to mixture of
Boolean, exists and missing filters. In case (2), one has to process the
strings in the array of strings named "flags". My intuition is that it is
faster to use method (1). In terms of space complexity I believe there is
no difference.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.