More on Solr vs ES faceting

On Sep 9, 2011, at 12:11 AM, Jason Rutherglen wrote:

We need more data to answer that. Solr intersects [cached] bit sets
which can be very fast! I think ES uses a field cache mechanism, I
don't know if it implements bit sets. Per-segment faceting is
possible with bit sets. It's just software.

This was my original hypothesis. Solr faceting algorithms are more efficients.
I have tested on my desktop PC Solr with 11M docs doing faceting in a second with huge record sets.
In that scenario Solr with 5G RAM is able to work, ES with 7G is continue to give OOM.
Also memory usage in ES during faceting is larger than Solr and I presume that "values" of facets are loaded and intersected.

I presume re-implementing how ES does faceting isn't an easy task and I like a lot ES features.
If we where doing simple searches I will switch to ES from SOLR because it's so easy to handle a cluster, distributing searches and creating new databases and configure via "software" instead via xml file.
I hope this discussion will help improve ES.
Maybe Yonik, one of the fathers of Solr, can give some input idea to Shay about how to make ES faceting better.

Ciao

On Thu, Sep 8, 2011 at 6:07 PM, Andy selforganized@gmail.com wrote:

Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.

So is ES's per-segment faceting the reason why Dario found it to be so
much slower than Solr once indexing is finished (4-7 seconds for ES
vs. sub-seconds for Solr)?

Is there any way to tune ES to speed that up?

Dario Rigolin
drigolin@gmail.com

just my two cents:

Maybe Yonik, one of the fathers of Solr, can give some input idea to Shay about how to make ES faceting better.

Shay needs a recreation of this scenario - otherwise it will be really
hard to track down the problem(s) and its just "guessing into the
wild". At the moment it could simply mean that your data or your
config is just a special case where Solr is 'better' than ES.

@Dario maybe you can use some public data and a bit curl or Java code
to reproduce this behaviour "ES vs Solr"? That would be really
great :slight_smile: !

Regards,
Peter.

--

http://jetsli.de news reader for geeks

On Sep 9, 2011, at 11:02 AM, Karussell wrote:

just my two cents:

Maybe Yonik, one of the fathers of Solr, can give some input idea to Shay about how to make ES faceting better.

Shay needs a recreation of this scenario - otherwise it will be really
hard to track down the problem(s) and its just "guessing into the
wild". At the moment it could simply mean that your data or your
config is just a special case where Solr is 'better' than ES.

Yes I agree.

@Dario maybe you can use some public data and a bit curl or Java code
to reproduce this behaviour "ES vs Solr"? That would be really
great :slight_smile: !

I have sent to Shay a full backup of my installation.

Regards,
Peter.

--

http://jetsli.de news reader for geeks

Dario Rigolin
drigolin@gmail.com

I have sent to Shay a full backup of my installation.

cool, this is great! Thanks!

Regards,
Peter.

Let me try and explain it again, does not seem like what Jason said was
understood.

All elasticsearch facets computations are geared towards, what is called,
near real time search. Most solr facets are not. The ones that do (based on
quick glance at the code), are very similar in terms of implementation to
how elasticsearch does that (there is no real magic here).

Facets that do not work in a "near real time" manner (or per segment) can be
optimized in different aspects compared to ones that do. They will use less
memory (assuming static index), and will, generally, execute faster (again,
assuming static index). On the other hand, if the index keeps changing, they
will be problematic when it comes to performance, and very problematic when
it comes to memory usage (long run).

Obviously, things can always be improved, but, if a good comparison would be
to compare the facets solr has that are implemented for near real time,
and see how to match.

On Fri, Sep 9, 2011 at 9:26 AM, Dario Rigolin drigolin@gmail.com wrote:

On Sep 9, 2011, at 12:11 AM, Jason Rutherglen wrote:

We need more data to answer that. Solr intersects [cached] bit sets
which can be very fast! I think ES uses a field cache mechanism, I
don't know if it implements bit sets. Per-segment faceting is
possible with bit sets. It's just software.

This was my original hypothesis. Solr faceting algorithms are more
efficients.
I have tested on my desktop PC Solr with 11M docs doing faceting in a
second with huge record sets.
In that scenario Solr with 5G RAM is able to work, ES with 7G is continue
to give OOM.
Also memory usage in ES during faceting is larger than Solr and I presume
that "values" of facets are loaded and intersected.

I presume re-implementing how ES does faceting isn't an easy task and I
like a lot ES features.
If we where doing simple searches I will switch to ES from SOLR because
it's so easy to handle a cluster, distributing searches and creating new
databases and configure via "software" instead via xml file.
I hope this discussion will help improve ES.
Maybe Yonik, one of the fathers of Solr, can give some input idea to Shay
about how to make ES faceting better.

Ciao

On Thu, Sep 8, 2011 at 6:07 PM, Andy selforganized@gmail.com wrote:

Solr does have some per-segment faceting capabilites. They are not
used by
default because it's slower unless you are rapidly updating the index.

So is ES's per-segment faceting the reason why Dario found it to be so
much slower than Solr once indexing is finished (4-7 seconds for ES
vs. sub-seconds for Solr)?

Is there any way to tune ES to speed that up?

Dario Rigolin
drigolin@gmail.com

Shay,

Is ES using DocTermOrds for faceting?

On Fri, Sep 9, 2011 at 8:33 AM, Shay Banon kimchy@gmail.com wrote:

Let me try and explain it again, does not seem like what Jason said was
understood.
All elasticsearch facets computations are geared towards, what is called,
near real time search. Most solr facets are not. The ones that do (based on
quick glance at the code), are very similar in terms of implementation to
how elasticsearch does that (there is no real magic here).
Facets that do not work in a "near real time" manner (or per segment) can be
optimized in different aspects compared to ones that do. They will use less
memory (assuming static index), and will, generally, execute faster (again,
assuming static index). On the other hand, if the index keeps changing, they
will be problematic when it comes to performance, and very problematic when
it comes to memory usage (long run).
Obviously, things can always be improved, but, if a good comparison would be
to compare the facets solr has that are implemented for near real time,
and see how to match.

Hi Dario,

I have played with some facet queries on our ES nodes.

I did some considerations on fields which are selected for faceting.
Facet fields have the attributes: "index":"not_analyzed",
"omit_term_freq_and_positions":true, "omit_norms":true. They should
contain only controlled vocabularies, e.g. date facet are integers,
almost ~300 values for the era of issued printed material, language
codes (ISO 639) should not be more than 200-500 values, in my sample
they are mixed by accident of ISO-639-1 and -2 values.

With such constraints, I can create faceted results from over 18
million hits in 2-3 seconds for most of common use cases.

I need to emphasize that I performed a quick'n'dirty test. I never
took time to optimize the configuration, this is just "out of the
box".

In short, what I observed is the more facets I add, the slower ES will
respond. If I add facets with author names, which contain several
thousands and even millions(!) of different values, ES takes easily
5-10 seconds for a first initiating query. That is not surprising to
me, it is equivalent to response times I obtained from a former
commercial search engine product we used. The caching will improve
speed a little, but I will never get down to 2-3 seconds. If I remove
such "heavy facets", I almost instantly get back to response times of
2-3 seconds.

Each node is a AMD Opteron Processor 6172 server with 2 x 12 cores (=
24 cores each node) and 16 GB RAM. ES JVM is 64bit,
set.default.ES_MIN_MEM=1024, set.default.ES_MAX_MEM=8192, I configured
10 shards and 1 replica which is kind of low but enough for us right
now. And, this is ES 0.16.1. I never encountered OOMs.

BTW our users will certainly never see ten facets, maybe three or
four. On our presentation system the facet displays will be limited by
screen design, triggered by context. I'm always stressing the
importance of the overall search index design, the facet
cardinalities, to reduce complex library classifications in the index
etc. I put much work in simplifying library catalog data. It is
obvious that nobody can navigate over complex classifications with too
many selectable values sucessfully.

Jörg

On Fri, Sep 9, 2011 at 8:33 AM, Shay Banon kimchy@gmail.com wrote:

All elasticsearch facets computations are geared towards, what is called,
near real time search. Most solr facets are not. The ones that do (based on
quick glance at the code), are very similar in terms of implementation to
how elasticsearch does that (there is no real magic here).
Facets that do not work in a "near real time" manner (or per segment) can be
optimized in different aspects compared to ones that do. They will use less
memory (assuming static index), and will, generally, execute faster (again,
assuming static index). On the other hand, if the index keeps changing, they
will be problematic when it comes to performance, and very problematic when
it comes to memory usage (long run).

One statistic that is missing is how accurate are the facet counts
while indexing is occurring? Of course this statistic would be hard to
capture, but perhaps a miscount might be obvious.

A configurable setting for non-real-time/real-time faceting would be
interesting, but probably impossible.

Ivan

One statistic that is missing is how accurate are the facet counts
while indexing is occurring? Of course this statistic would be hard to
capture, but perhaps a miscount might be obvious

Why would it be inaccurate? The IndexReader is static. Even with
realtime search / LUCENE-2312, each reader will be static and we can
implement truly realtime faceting using the same system (on top of
Lucene) as we have currently (in ES, Solr, or anything else).

On Fri, Sep 9, 2011 at 4:27 PM, Ivan Brusic ivan@brusic.com wrote:

On Fri, Sep 9, 2011 at 8:33 AM, Shay Banon kimchy@gmail.com wrote:

All elasticsearch facets computations are geared towards, what is called,
near real time search. Most solr facets are not. The ones that do (based on
quick glance at the code), are very similar in terms of implementation to
how elasticsearch does that (there is no real magic here).
Facets that do not work in a "near real time" manner (or per segment) can be
optimized in different aspects compared to ones that do. They will use less
memory (assuming static index), and will, generally, execute faster (again,
assuming static index). On the other hand, if the index keeps changing, they
will be problematic when it comes to performance, and very problematic when
it comes to memory usage (long run).

One statistic that is missing is how accurate are the facet counts
while indexing is occurring? Of course this statistic would be hard to
capture, but perhaps a miscount might be obvious.

A configurable setting for non-real-time/real-time faceting would be
interesting, but probably impossible.

Ivan

No (its part of upcoming 4.0), though it won't help much based on a quick
review... (at least not perf wise).

On Fri, Sep 9, 2011 at 3:59 PM, Jason Rutherglen <jason.rutherglen@gmail.com

wrote:

Shay,

Is ES using DocTermOrds for faceting?

On Fri, Sep 9, 2011 at 8:33 AM, Shay Banon kimchy@gmail.com wrote:

Let me try and explain it again, does not seem like what Jason said was
understood.
All elasticsearch facets computations are geared towards, what is called,
near real time search. Most solr facets are not. The ones that do (based
on
quick glance at the code), are very similar in terms of implementation to
how elasticsearch does that (there is no real magic here).
Facets that do not work in a "near real time" manner (or per segment) can
be
optimized in different aspects compared to ones that do. They will use
less
memory (assuming static index), and will, generally, execute faster
(again,
assuming static index). On the other hand, if the index keeps changing,
they
will be problematic when it comes to performance, and very problematic
when
it comes to memory usage (long run).
Obviously, things can always be improved, but, if a good comparison would
be
to compare the facets solr has that are implemented for near real time,
and see how to match.

I think there are ways to increase performance of the un-inverted
model used by DocTermOrds. Eg, encode the doc ids in a compressed bit
form that does not necessitate the decoding cost of variable integer
compression. It would use more RAM but be far faster.

On Sat, Sep 10, 2011 at 6:10 PM, Shay Banon kimchy@gmail.com wrote:

No (its part of upcoming 4.0), though it won't help much based on a quick
review... (at least not perf wise).

On Fri, Sep 9, 2011 at 3:59 PM, Jason Rutherglen
jason.rutherglen@gmail.com wrote:

Shay,

Is ES using DocTermOrds for faceting?

On Fri, Sep 9, 2011 at 8:33 AM, Shay Banon kimchy@gmail.com wrote:

Let me try and explain it again, does not seem like what Jason said was
understood.
All elasticsearch facets computations are geared towards, what is
called,
near real time search. Most solr facets are not. The ones that do (based
on
quick glance at the code), are very similar in terms of implementation
to
how elasticsearch does that (there is no real magic here).
Facets that do not work in a "near real time" manner (or per segment)
can be
optimized in different aspects compared to ones that do. They will use
less
memory (assuming static index), and will, generally, execute faster
(again,
assuming static index). On the other hand, if the index keeps changing,
they
will be problematic when it comes to performance, and very problematic
when
it comes to memory usage (long run).
Obviously, things can always be improved, but, if a good comparison
would be
to compare the facets solr has that are implemented for near real
time,
and see how to match.

I think there are ways to increase performance of the un-inverted
model used by DocTermOrds. Eg, encode the doc ids in a compressed bit
form that does not necessitate the decoding cost of variable integer
compression. It would use more RAM but be far faster.

On Sat, Sep 10, 2011 at 6:10 PM, Shay Banon kimchy@gmail.com wrote:

No (its part of upcoming 4.0), though it won't help much based on a quick
review... (at least not perf wise).

On Fri, Sep 9, 2011 at 3:59 PM, Jason Rutherglen
jason.rutherglen@gmail.com wrote:

Shay,

Is ES using DocTermOrds for faceting?

On Fri, Sep 9, 2011 at 8:33 AM, Shay Banon kimchy@gmail.com wrote:

Let me try and explain it again, does not seem like what Jason said was
understood.
All elasticsearch facets computations are geared towards, what is
called,
near real time search. Most solr facets are not. The ones that do (based
on
quick glance at the code), are very similar in terms of implementation
to
how elasticsearch does that (there is no real magic here).
Facets that do not work in a "near real time" manner (or per segment)
can be
optimized in different aspects compared to ones that do. They will use
less
memory (assuming static index), and will, generally, execute faster
(again,
assuming static index). On the other hand, if the index keeps changing,
they
will be problematic when it comes to performance, and very problematic
when
it comes to memory usage (long run).
Obviously, things can always be improved, but, if a good comparison
would be
to compare the facets solr has that are implemented for near real
time,
and see how to match.