More on Solr vs ES faceting


(drigolin) #1

During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:

  1. Faceting on small result sets
  2. Faceting on all records
  3. Faceting on large result sets (40-60% of all collection)

I tested many aspects and here are my results:

  1. Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
  2. Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
  3. Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
    and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.

In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.

The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.

Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!

I'm looking how to have ES answering in subsecond a faceting query on our records sets.

ciao.

Dario Rigolin
drigolin@gmail.com


(Shay Banon) #2

Heya,

I'll try again, though it does not seem like I manage to get answers based
on previous mails you sent. I would like to see how you use elasticsearch,
so information on:

  1. Index settings and configuration.
  2. Mapping.
  3. Number of nodes.
  4. Sample documents.
  5. Number of possible values for the field you facet on.
  6. Flow of the test, including how do you optimize the index in es?
  7. Which client lib do you use to talk to elasticsearch?

On Wed, Sep 7, 2011 at 1:40 PM, Dario Rigolin drigolin@gmail.com wrote:

During last week I did lots of testing about faceting medium size documents
sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure
index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:

  1. Faceting on small result sets
  2. Faceting on all records
  3. Faceting on large result sets (40-60% of all collection)

I tested many aspects and here are my results:

  1. Indexing speed is 20%-30% higher on ES and this is speed is constant
    during all indexing.
  2. Simple Searching during indexing ES is 50% faster than Solr. Search for
    both is subsecond and is aceptable for our use case.
  3. Faceting during indexing is the key point for us. All our queries has
    faceting and we need sub second query response time. ES is fast as Solr
    during indexing
    and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on
    5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is
    double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is
    finished and all 2M docs are indexed and both index are optimized search
    response change completly and SOLR reply to all query in subsecond and ES
    still reply in 2-7 secs. We tested different queries to SOLR and we disabled
    the HTTP cache but faceting algorithms and caching in SOLR seems more memory
    and speed efficient. If we send one new doc every second the we see that
    SOLR needs 4-13 secs and ES replies in 4-7sec.

In SOLR world is common problem slow faceting (lots of seconds!) during
index update because all caches are deleted at every commit. For that reason
we use to have a second instance working on the same lucene index in "read
only" mode. In that way we use one instance to do indexing and one instance
to do query. One a while (every hour or more) we send a fake "commit" to
read only instance to force reload of index update. In this scenario
faceting is subsecond most of the time but we don't see in the mean time
changes.

The issue with ES that I'm looking to solve is this very high time
(2-7secs) for faceting on 2M docs. For libraries (our customers) we have to
handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have
very good performances. Using SSD we can reduce the initial cache reload
penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.

Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!

I'm looking how to have ES answering in subsecond a faceting query on our
records sets.

ciao.

Dario Rigolin
drigolin@gmail.com


(drigolin) #3

On Sep 7, 2011, at 5:45 PM, Shay Banon wrote:

Heya,

I'll try again, though it does not seem like I manage to get answers based on previous mails you sent. I would like to see how you use elasticsearch, so information on:

  1. Index settings and configuration.

One shard no replica. index_analyzer and search_analyzer to do word split lowercase and asciifolding

  1. Mapping.

No _source and excluding from _all
template mapping using my analyzer
facet fiels are strings not analyzed.
facet field one and two are multivalue third is single valued.

  1. Number of nodes.

Single node

  1. Sample documents.

I will provide one... Collection size used for testing 2M of a 11M collection.
They are bibliographic records

  1. Number of possible values for the field you facet on.

One field (book's authors) has 1M different values the second (books subjects) has 300K different values and last one (book types) has 10 different values

  1. Flow of the test, including how do you optimize the index in es?

I had inserted docs in ES using a revised version of solr indexer script and doc structure is the same as solr. Same fields and same analyzer.
I used "optimize" command to do optimization.

  1. Which client lib do you use to talk to elasticsearch?

I'm using Elastica PHP client lib.

I have compressed the elastica installation in a tarball of 1.3G and I will send you the dropbox link when upload is finished.

On Wed, Sep 7, 2011 at 1:40 PM, Dario Rigolin drigolin@gmail.com wrote:
During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:

  1. Faceting on small result sets
  2. Faceting on all records
  3. Faceting on large result sets (40-60% of all collection)

I tested many aspects and here are my results:

  1. Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
  2. Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
  3. Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
    and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.

In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.

The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.

Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!

I'm looking how to have ES answering in subsecond a faceting query on our records sets.

ciao.

Dario Rigolin
drigolin@gmail.com

Dario Rigolin
drigolin@gmail.com


(Shay Banon) #4

On Wed, Sep 7, 2011 at 7:36 PM, Dario Rigolin drigolin@gmail.com wrote:

On Sep 7, 2011, at 5:45 PM, Shay Banon wrote:

Heya,

I'll try again, though it does not seem like I manage to get answers
based on previous mails you sent. I would like to see how you use
elasticsearch, so information on:

  1. Index settings and configuration.

One shard no replica. index_analyzer and search_analyzer to do word split
lowercase and asciifolding

Can you gist it?

  1. Mapping.

No _source and excluding from _all
template mapping using my analyzer
facet fiels are strings not analyzed.
facet field one and two are multivalue third is single valued.

Can you gist it?

  1. Number of nodes.

Single node

  1. Sample documents.

I will provide one... Collection size used for testing 2M of a 11M
collection.
They are bibliographic records

Waiting for it.

  1. Number of possible values for the field you facet on.

One field (book's authors) has 1M different values the second (books
subjects) has 300K different values and last one (book types) has 10
different values

And the facets stats you posted, on which field did you run it?

  1. Flow of the test, including how do you optimize the index in es?

I had inserted docs in ES using a revised version of solr indexer script
and doc structure is the same as solr. Same fields and same analyzer.
I used "optimize" command to do optimization.

Whats the command you executed to optimize? To do full optimization in
elasticsearch, you need to specify the max_num_segments and set it to 1:
http://www.elasticsearch.org/guide/reference/api/admin-indices-optimize.html
.

  1. Which client lib do you use to talk to elasticsearch?

I'm using Elastica PHP client lib.

I have compressed the elastica installation in a tarball of 1.3G and I will
send you the dropbox link when upload is finished.

The elastica installation? maybe you mean the data directory of
elasticsearch?

On Wed, Sep 7, 2011 at 1:40 PM, Dario Rigolin drigolin@gmail.com wrote:

During last week I did lots of testing about faceting medium size
documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure
index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:

  1. Faceting on small result sets
  2. Faceting on all records
  3. Faceting on large result sets (40-60% of all collection)

I tested many aspects and here are my results:

  1. Indexing speed is 20%-30% higher on ES and this is speed is constant
    during all indexing.
  2. Simple Searching during indexing ES is 50% faster than Solr. Search for
    both is subsecond and is aceptable for our use case.
  3. Faceting during indexing is the key point for us. All our queries has
    faceting and we need sub second query response time. ES is fast as Solr
    during indexing
    and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on
    5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is
    double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is
    finished and all 2M docs are indexed and both index are optimized search
    response change completly and SOLR reply to all query in subsecond and ES
    still reply in 2-7 secs. We tested different queries to SOLR and we disabled
    the HTTP cache but faceting algorithms and caching in SOLR seems more memory
    and speed efficient. If we send one new doc every second the we see that
    SOLR needs 4-13 secs and ES replies in 4-7sec.

In SOLR world is common problem slow faceting (lots of seconds!) during
index update because all caches are deleted at every commit. For that reason
we use to have a second instance working on the same lucene index in "read
only" mode. In that way we use one instance to do indexing and one instance
to do query. One a while (every hour or more) we send a fake "commit" to
read only instance to force reload of index update. In this scenario
faceting is subsecond most of the time but we don't see in the mean time
changes.

The issue with ES that I'm looking to solve is this very high time
(2-7secs) for faceting on 2M docs. For libraries (our customers) we have to
handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have
very good performances. Using SSD we can reduce the initial cache reload
penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.

Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!

I'm looking how to have ES answering in subsecond a faceting query on our
records sets.

ciao.

Dario Rigolin
drigolin@gmail.com

Dario Rigolin
drigolin@gmail.com


(drigolin) #5

On Sep 7, 2011, at 7:00 PM, Shay Banon wrote:

On Wed, Sep 7, 2011 at 7:36 PM, Dario Rigolin drigolin@gmail.com wrote:

On Sep 7, 2011, at 5:45 PM, Shay Banon wrote:

Heya,

I'll try again, though it does not seem like I manage to get answers based on previous mails you sent. I would like to see how you use elasticsearch, so information on:

  1. Index settings and configuration.

One shard no replica. index_analyzer and search_analyzer to do word split lowercase and asciifolding

Can you gist it?

I don't think analyzer and tokenizer can impact so much on the faceting.
git://gist.github.com/1201204.git

  1. Mapping.

No _source and excluding from _all
template mapping using my analyzer
facet fiels are strings not analyzed.
facet field one and two are multivalue third is single valued.

Can you gist it?

  1. Number of nodes.

Single node

  1. Sample documents.

I will provide one... Collection size used for testing 2M of a 11M collection.
They are bibliographic records

Waiting for it.
git://gist.github.com/1201232.git

  1. Number of possible values for the field you facet on.

One field (book's authors) has 1M different values the second (books subjects) has 300K different values and last one (book types) has 10 different values

And the facets stats you posted, on which field did you run it?

One field named facets_author has 1M different values and I did faceting on query like ":" and other with frequent words like (milano, roma) and other with not so frequent words like "cane, gatto etc..:"
Other field is named facets_subject and has 300M different values
other field is facets_bibtype and has 10 different values.
All my queries was requesting faceting on all those 3 fields.
I was unable to have replies from ES under 3 secs.

  1. Flow of the test, including how do you optimize the index in es?

I had inserted docs in ES using a revised version of solr indexer script and doc structure is the same as solr. Same fields and same analyzer.
I used "optimize" command to do optimization.

Whats the command you executed to optimize? To do full optimization in elasticsearch, you need to specify the max_num_segments and set it to 1: http://www.elasticsearch.org/guide/reference/api/admin-indices-optimize.html.

On
ly optimize not specific segment

curl -XPOST http://localhost:9200/sbn/_optimize (used default values).

  1. Which client lib do you use to talk to elasticsearch?

I'm using Elastica PHP client lib.

I have compressed the elastica installation in a tarball of 1.3G and I will send you the dropbox link when upload is finished.

The elastica installation? maybe you mean the data directory of elasticsearch?

conf and data all elasticsearch dir, all folders.

On Wed, Sep 7, 2011 at 1:40 PM, Dario Rigolin drigolin@gmail.com wrote:
During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:

  1. Faceting on small result sets
  2. Faceting on all records
  3. Faceting on large result sets (40-60% of all collection)

I tested many aspects and here are my results:

  1. Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
  2. Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
  3. Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
    and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.

In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.

The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.

Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!

I'm looking how to have ES answering in subsecond a faceting query on our records sets.

ciao.

Dario Rigolin
drigolin@gmail.com

Dario Rigolin
drigolin@gmail.com

Dario Rigolin
drigolin@gmail.com


(Jörg Prante) #6

Hi Dario,

thank you for your numbers comparing Solr and ES.

On an ES cluster of three nodes at a local site I have indexed ~18
million bibliographic records of a german library union catalog. So I
think our uses cases seem very similar.

If you can give some samples of the documents, DSL queries, and the
mapping you use, I can maybe offer some additional hints.

Best regards,

Jörg


(Jörg Prante) #7

Hi Dario,

On Sep 7, 7:45 pm, Dario Rigolin drigo...@gmail.com wrote:

https://gist.github.com/1201204

what I can quickly guess from that gist is

  • you declare ten facets, all of those facets will "sum up" and slow
    down the single ES node. Do you really need ten facets or do they
    replicate same data? Will you need to present all of the ten facets to
    the user at once? E.g. biblevel_full and class_desc look like
    candidates for removal.
  • one shard is much too few if you have a multi core cpu, think about
    offering "at least one shard per core", it's my rule of thumb. Then,
    the facet computing resource consumption will spread over the cores
    more easily.

More analysis is surely possible with some statistics about the facets
(result length, values, cardinalities), and the documents and queries
you use.

Best regards,

Jörg


(drigolin) #8

On Sep 7, 2011, at 8:40 PM, jprante wrote:

Hi Dario,

On Sep 7, 7:45 pm, Dario Rigolin drigo...@gmail.com wrote:

https://gist.github.com/1201204

what I can quickly guess from that gist is

  • you declare ten facets, all of those facets will "sum up" and slow
    down the single ES node. Do you really need ten facets or do they
    replicate same data? Will you need to present all of the ten facets to
    the user at once? E.g. biblevel_full and class_desc look like
    candidates for removal.

We usually need more than those 10.
We full index every unimarc subfield and we create sort and facet fields.
I cannot remove them. We also cannot use stop words because librarians need to find books with titles like "The and or not"...

Jörg you indexed 18M records on 3 ES nodes what's the speed of a facet query on author fields like match_all or "berlin"?
What's your nodes hw configurations?

  • one shard is much too few if you have a multi core cpu, think about
    offering "at least one shard per core", it's my rule of thumb. Then,
    the facet computing resource consumption will spread over the cores
    more easily.

using 5 shard I was running out of memory in my previous tests.
I can try to use 2 as CPU is a dual core.

More analysis is surely possible with some statistics about the facets
(result length, values, cardinalities), and the documents and queries
you use.

My tests was very simple but I was looking to have numbers about ES performance compared to SOLR.
I know that faceting on large sets is very memory and CPU intensive task and caching is a key point for have good performances.
I was expecting that ES was fast as SOLR doing faceting and looking at others good things ES is able to do we was planning to move from SOLR to ES in our OPAC application but faceting performance on medium recordsets (> 1.5M) make me thinking carefully.
ES scaling is very nice, I can add more nodes and performances increase (In SOLR this cannot be done so easy) but comparing a "single node" pure performance this make me thinking that:

  1. I need to know better how ES faceting works and how can be optimized.
  2. At the moment 11M records are handled easy by a SOLR single node with 8G RAM. If moving to ES means adding more HW and more RAM this is not a simple process for us.

Best regards,

Jörg

Dario Rigolin
drigolin@gmail.com


(Andy-2) #9

Dario,

Which version of Solr did you use for your test? Were you using the
Solr trunk that has near real time support or Solr 3.x that doesn't
have NRT?

On Sep 7, 6:40 am, Dario Rigolin drigo...@gmail.com wrote:

During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:

  1. Faceting on small result sets
  2. Faceting on all records
  3. Faceting on large result sets (40-60% of all collection)

I tested many aspects and here are my results:

  1. Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
  2. Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
  3. Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
    and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.

In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.

The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.

Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!

I'm looking how to have ES answering in subsecond a faceting query on our records sets.

ciao.

Dario Rigolin
drigo...@gmail.com


(drigolin) #10

On Sep 8, 2011, at 8:29 AM, Andy wrote:

Dario,

Which version of Solr did you use for your test? Were you using the
Solr trunk that has near real time support or Solr 3.x that doesn't
have NRT?

I'm using 3.3 no NRT for the moment. We usually have two cores one Writing and one Reading and sending fake commits to Read core every 30-60 minutes.

On Sep 7, 6:40 am, Dario Rigolin drigo...@gmail.com wrote:

During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:

  1. Faceting on small result sets
  2. Faceting on all records
  3. Faceting on large result sets (40-60% of all collection)

I tested many aspects and here are my results:

  1. Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
  2. Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
  3. Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
    and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.

In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.

The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.

Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!

I'm looking how to have ES answering in subsecond a faceting query on our records sets.

ciao.

Dario Rigolin
drigo...@gmail.com

Dario Rigolin
drigolin@gmail.com


(Andy-2) #11

I'm using 3.3 no NRT for the moment. We usually have two cores one Writing and one Reading and sending fake commits to Read core every 30-60 minutes.

I see. So when you said "Simple Searching during indexing ES is 50%
faster than Solr" was Solr really doing indexing and searching at the
same time?

It just seems that ES should be more than 50% faster than Solr 3.3
(without NRT) in a concurrent searching/indexing use case.


(drigolin) #12

On Sep 8, 2011, at 9:39 AM, Andy wrote:

I'm using 3.3 no NRT for the moment. We usually have two cores one Writing and one Reading and sending fake commits to Read core every 30-60 minutes.

I see. So when you said "Simple Searching during indexing ES is 50%
faster than Solr" was Solr really doing indexing and searching at the
same time?

Yes. Simple search no faceting. I have "cache autowarming = 0" on solr and search response is good but ES is faster.

It just seems that ES should be more than 50% faster than Solr 3.3
(without NRT) in a concurrent searching/indexing use case.

Yes. In my simple tests with 2M docs into the collection and one new doc added every second.

I'm thinking if ES "missing" and "others" calculation on facets are expensive or not. Maybe I can disable them?

Dario Rigolin
drigolin@gmail.com


(Andy-2) #13

I see. So when you said "Simple Searching during indexing ES is 50%
faster than Solr" was Solr really doing indexing and searching at the
same time?

Yes. Simple search no faceting. I have "cache autowarming = 0" on solr and search response is good but ES is faster.

It just seems that ES should be more than 50% faster than Solr 3.3
(without NRT) in a concurrent searching/indexing use case.

Yes. In my simple tests with 2M docs into the collection and one new doc added every second.

So your test is similar to this one:
http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/

According to the 3rd graph in that test, when 1 document was being
added per 3 seconds to an existing index, simple (no facet) search on
ES was at least 30-40 times faster than Solr. Yet you only measured
50% faster in your test.

I wondered why such a huge discrepancy between your test and his test.


(drigolin) #14

On Sep 8, 2011, at 10:11 AM, Andy wrote:

I see. So when you said "Simple Searching during indexing ES is 50%
faster than Solr" was Solr really doing indexing and searching at the
same time?

Yes. Simple search no faceting. I have "cache autowarming = 0" on solr and search response is good but ES is faster.

It just seems that ES should be more than 50% faster than Solr 3.3
(without NRT) in a concurrent searching/indexing use case.

Yes. In my simple tests with 2M docs into the collection and one new doc added every second.

So your test is similar to this one:
http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/

According to the 3rd graph in that test, when 1 document was being
added per 3 seconds to an existing index, simple (no facet) search on
ES was at least 30-40 times faster than Solr. Yet you only measured
50% faster in your test.

I wondered why such a huge discrepancy between your test and his test.

Andy I did some simple query not a real workload and not on a dedicated server. I ran 10 queries a look at result time.
I think in simple search scenario this guy did a better statistics than me.
I didn't see 30-40 times difference but numbers was very low and maybe Solr spike on committing bigger batches maybe can impact on overall benchmark.

Dario Rigolin
drigolin@gmail.com


(Jason Rutherglen) #15

Solr doesn't do per-segment faceting. So any rapid commit / flush '
ing is going to completely overtake the heap space and eventually the
GC.

On Thu, Sep 8, 2011 at 4:58 AM, Dario Rigolin drigolin@gmail.com wrote:

On Sep 8, 2011, at 10:11 AM, Andy wrote:

I see. So when you said "Simple Searching during indexing ES is 50%
faster than Solr" was Solr really doing indexing and searching at the
same time?

Yes. Simple search no faceting. I have "cache autowarming = 0" on solr and search response is good but ES is faster.

It just seems that ES should be more than 50% faster than Solr 3.3
(without NRT) in a concurrent searching/indexing use case.

Yes. In my simple tests with 2M docs into the collection and one new doc added every second.

So your test is similar to this one:
http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/

According to the 3rd graph in that test, when 1 document was being
added per 3 seconds to an existing index, simple (no facet) search on
ES was at least 30-40 times faster than Solr. Yet you only measured
50% faster in your test.

I wondered why such a huge discrepancy between your test and his test.

Andy I did some simple query not a real workload and not on a dedicated server. I ran 10 queries a look at result time.
I think in simple search scenario this guy did a better statistics than me.
I didn't see 30-40 times difference but numbers was very low and maybe Solr spike on committing bigger batches maybe can impact on overall benchmark.

Dario Rigolin
drigolin@gmail.com


(Yonik Seeley) #16

bq. Solr doesn't do per-segment faceting.

Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.

Does elasticsearch's per-segment faceting yield 100% accurate results
guaranteed, or does it have some of the same issues that distributed
faceting does?

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference


(Shay Banon) #17

On Thu, Sep 8, 2011 at 8:16 PM, Yonik Seeley yseeley@gmail.com wrote:

bq. Solr doesn't do per-segment faceting.

Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.

Does elasticsearch's per-segment faceting yield 100% accurate results
guaranteed, or does it have some of the same issues that distributed
faceting does?

It gives 100% accuracy on the shard level, and all facets are "segments
based". Some facets (like terms) don't when executed in distributed manner.

-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference


(Jason Rutherglen) #18

Solr does have some per-segment faceting capabilites

SOME is the keyword. Eg, it's unknown to the user when that's used,
and when it's not. It's something that can be fixed, eg, distributed
facets (between servers) seems to work, no?

On Thu, Sep 8, 2011 at 1:16 PM, Yonik Seeley yseeley@gmail.com wrote:

bq. Solr doesn't do per-segment faceting.

Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.
Does elasticsearch's per-segment faceting yield 100% accurate results
guaranteed, or does it have some of the same issues that distributed
faceting does?
-Yonik
http://www.lucene-eurocon.com - The Lucene/Solr User Conference


(Andy-2) #19

Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.

So is ES's per-segment faceting the reason why Dario found it to be so
much slower than Solr once indexing is finished (4-7 seconds for ES
vs. sub-seconds for Solr)?

Is there any way to tune ES to speed that up?


(Jason Rutherglen) #20

We need more data to answer that. Solr intersects [cached] bit sets
which can be very fast! I think ES uses a field cache mechanism, I
don't know if it implements bit sets. Per-segment faceting is
possible with bit sets. It's just software.

On Thu, Sep 8, 2011 at 6:07 PM, Andy selforganized@gmail.com wrote:

Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.

So is ES's per-segment faceting the reason why Dario found it to be so
much slower than Solr once indexing is finished (4-7 seconds for ES
vs. sub-seconds for Solr)?

Is there any way to tune ES to speed that up?