During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:
Faceting on small result sets
Faceting on all records
Faceting on large result sets (40-60% of all collection)
I tested many aspects and here are my results:
Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.
In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.
The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.
Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!
I'm looking how to have ES answering in subsecond a faceting query on our records sets.
I'll try again, though it does not seem like I manage to get answers based
on previous mails you sent. I would like to see how you use elasticsearch,
so information on:
Index settings and configuration.
Mapping.
Number of nodes.
Sample documents.
Number of possible values for the field you facet on.
Flow of the test, including how do you optimize the index in es?
Which client lib do you use to talk to elasticsearch?
During last week I did lots of testing about faceting medium size documents
sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure
index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:
Faceting on small result sets
Faceting on all records
Faceting on large result sets (40-60% of all collection)
I tested many aspects and here are my results:
Indexing speed is 20%-30% higher on ES and this is speed is constant
during all indexing.
Simple Searching during indexing ES is 50% faster than Solr. Search for
both is subsecond and is aceptable for our use case.
Faceting during indexing is the key point for us. All our queries has
faceting and we need sub second query response time. ES is fast as Solr
during indexing
and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on
5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is
double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is
finished and all 2M docs are indexed and both index are optimized search
response change completly and SOLR reply to all query in subsecond and ES
still reply in 2-7 secs. We tested different queries to SOLR and we disabled
the HTTP cache but faceting algorithms and caching in SOLR seems more memory
and speed efficient. If we send one new doc every second the we see that
SOLR needs 4-13 secs and ES replies in 4-7sec.
In SOLR world is common problem slow faceting (lots of seconds!) during
index update because all caches are deleted at every commit. For that reason
we use to have a second instance working on the same lucene index in "read
only" mode. In that way we use one instance to do indexing and one instance
to do query. One a while (every hour or more) we send a fake "commit" to
read only instance to force reload of index update. In this scenario
faceting is subsecond most of the time but we don't see in the mean time
changes.
The issue with ES that I'm looking to solve is this very high time
(2-7secs) for faceting on 2M docs. For libraries (our customers) we have to
handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have
very good performances. Using SSD we can reduce the initial cache reload
penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.
Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!
I'm looking how to have ES answering in subsecond a faceting query on our
records sets.
I'll try again, though it does not seem like I manage to get answers based on previous mails you sent. I would like to see how you use elasticsearch, so information on:
Index settings and configuration.
One shard no replica. index_analyzer and search_analyzer to do word split lowercase and asciifolding
Mapping.
No _source and excluding from _all
template mapping using my analyzer
facet fiels are strings not analyzed.
facet field one and two are multivalue third is single valued.
Number of nodes.
Single node
Sample documents.
I will provide one... Collection size used for testing 2M of a 11M collection.
They are bibliographic records
Number of possible values for the field you facet on.
One field (book's authors) has 1M different values the second (books subjects) has 300K different values and last one (book types) has 10 different values
Flow of the test, including how do you optimize the index in es?
I had inserted docs in ES using a revised version of solr indexer script and doc structure is the same as solr. Same fields and same analyzer.
I used "optimize" command to do optimization.
Which client lib do you use to talk to elasticsearch?
I'm using Elastica PHP client lib.
I have compressed the elastica installation in a tarball of 1.3G and I will send you the dropbox link when upload is finished.
On Wed, Sep 7, 2011 at 1:40 PM, Dario Rigolin drigolin@gmail.com wrote:
During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:
Faceting on small result sets
Faceting on all records
Faceting on large result sets (40-60% of all collection)
I tested many aspects and here are my results:
Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.
In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.
The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.
Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!
I'm looking how to have ES answering in subsecond a faceting query on our records sets.
I'll try again, though it does not seem like I manage to get answers
based on previous mails you sent. I would like to see how you use
elasticsearch, so information on:
Index settings and configuration.
One shard no replica. index_analyzer and search_analyzer to do word split
lowercase and asciifolding
Can you gist it?
Mapping.
No _source and excluding from _all
template mapping using my analyzer
facet fiels are strings not analyzed.
facet field one and two are multivalue third is single valued.
Can you gist it?
Number of nodes.
Single node
Sample documents.
I will provide one... Collection size used for testing 2M of a 11M
collection.
They are bibliographic records
Waiting for it.
Number of possible values for the field you facet on.
One field (book's authors) has 1M different values the second (books
subjects) has 300K different values and last one (book types) has 10
different values
And the facets stats you posted, on which field did you run it?
Flow of the test, including how do you optimize the index in es?
I had inserted docs in ES using a revised version of solr indexer script
and doc structure is the same as solr. Same fields and same analyzer.
I used "optimize" command to do optimization.
Whats the command you executed to optimize? To do full optimization in
elasticsearch, you need to specify the max_num_segments and set it to 1:
.
Which client lib do you use to talk to elasticsearch?
I'm using Elastica PHP client lib.
I have compressed the elastica installation in a tarball of 1.3G and I will
send you the dropbox link when upload is finished.
The elastica installation? maybe you mean the data directory of
elasticsearch?
During last week I did lots of testing about faceting medium size
documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure
index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:
Faceting on small result sets
Faceting on all records
Faceting on large result sets (40-60% of all collection)
I tested many aspects and here are my results:
Indexing speed is 20%-30% higher on ES and this is speed is constant
during all indexing.
Simple Searching during indexing ES is 50% faster than Solr. Search for
both is subsecond and is aceptable for our use case.
Faceting during indexing is the key point for us. All our queries has
faceting and we need sub second query response time. ES is fast as Solr
during indexing
and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on
5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is
double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is
finished and all 2M docs are indexed and both index are optimized search
response change completly and SOLR reply to all query in subsecond and ES
still reply in 2-7 secs. We tested different queries to SOLR and we disabled
the HTTP cache but faceting algorithms and caching in SOLR seems more memory
and speed efficient. If we send one new doc every second the we see that
SOLR needs 4-13 secs and ES replies in 4-7sec.
In SOLR world is common problem slow faceting (lots of seconds!) during
index update because all caches are deleted at every commit. For that reason
we use to have a second instance working on the same lucene index in "read
only" mode. In that way we use one instance to do indexing and one instance
to do query. One a while (every hour or more) we send a fake "commit" to
read only instance to force reload of index update. In this scenario
faceting is subsecond most of the time but we don't see in the mean time
changes.
The issue with ES that I'm looking to solve is this very high time
(2-7secs) for faceting on 2M docs. For libraries (our customers) we have to
handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have
very good performances. Using SSD we can reduce the initial cache reload
penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.
Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!
I'm looking how to have ES answering in subsecond a faceting query on our
records sets.
I'll try again, though it does not seem like I manage to get answers based on previous mails you sent. I would like to see how you use elasticsearch, so information on:
Index settings and configuration.
One shard no replica. index_analyzer and search_analyzer to do word split lowercase and asciifolding
Can you gist it?
I don't think analyzer and tokenizer can impact so much on the faceting.
git://gist.github.com/1201204.git
Mapping.
No _source and excluding from _all
template mapping using my analyzer
facet fiels are strings not analyzed.
facet field one and two are multivalue third is single valued.
Can you gist it?
Number of nodes.
Single node
Sample documents.
I will provide one... Collection size used for testing 2M of a 11M collection.
They are bibliographic records
Waiting for it.
git://gist.github.com/1201232.git
Number of possible values for the field you facet on.
One field (book's authors) has 1M different values the second (books subjects) has 300K different values and last one (book types) has 10 different values
And the facets stats you posted, on which field did you run it?
One field named facets_author has 1M different values and I did faceting on query like ":" and other with frequent words like (milano, roma) and other with not so frequent words like "cane, gatto etc..:"
Other field is named facets_subject and has 300M different values
other field is facets_bibtype and has 10 different values.
All my queries was requesting faceting on all those 3 fields.
I was unable to have replies from ES under 3 secs.
Flow of the test, including how do you optimize the index in es?
I had inserted docs in ES using a revised version of solr indexer script and doc structure is the same as solr. Same fields and same analyzer.
I used "optimize" command to do optimization.
Which client lib do you use to talk to elasticsearch?
I'm using Elastica PHP client lib.
I have compressed the elastica installation in a tarball of 1.3G and I will send you the dropbox link when upload is finished.
The elastica installation? maybe you mean the data directory of elasticsearch?
conf and data all elasticsearch dir, all folders.
On Wed, Sep 7, 2011 at 1:40 PM, Dario Rigolin drigolin@gmail.com wrote:
During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:
Faceting on small result sets
Faceting on all records
Faceting on large result sets (40-60% of all collection)
I tested many aspects and here are my results:
Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.
In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.
The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.
Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!
I'm looking how to have ES answering in subsecond a faceting query on our records sets.
On an ES cluster of three nodes at a local site I have indexed ~18
million bibliographic records of a german library union catalog. So I
think our uses cases seem very similar.
If you can give some samples of the documents, DSL queries, and the
mapping you use, I can maybe offer some additional hints.
you declare ten facets, all of those facets will "sum up" and slow
down the single ES node. Do you really need ten facets or do they
replicate same data? Will you need to present all of the ten facets to
the user at once? E.g. biblevel_full and class_desc look like
candidates for removal.
one shard is much too few if you have a multi core cpu, think about
offering "at least one shard per core", it's my rule of thumb. Then,
the facet computing resource consumption will spread over the cores
more easily.
More analysis is surely possible with some statistics about the facets
(result length, values, cardinalities), and the documents and queries
you use.
you declare ten facets, all of those facets will "sum up" and slow
down the single ES node. Do you really need ten facets or do they
replicate same data? Will you need to present all of the ten facets to
the user at once? E.g. biblevel_full and class_desc look like
candidates for removal.
We usually need more than those 10.
We full index every unimarc subfield and we create sort and facet fields.
I cannot remove them. We also cannot use stop words because librarians need to find books with titles like "The and or not"...
Jörg you indexed 18M records on 3 ES nodes what's the speed of a facet query on author fields like match_all or "berlin"?
What's your nodes hw configurations?
one shard is much too few if you have a multi core cpu, think about
offering "at least one shard per core", it's my rule of thumb. Then,
the facet computing resource consumption will spread over the cores
more easily.
using 5 shard I was running out of memory in my previous tests.
I can try to use 2 as CPU is a dual core.
More analysis is surely possible with some statistics about the facets
(result length, values, cardinalities), and the documents and queries
you use.
My tests was very simple but I was looking to have numbers about ES performance compared to SOLR.
I know that faceting on large sets is very memory and CPU intensive task and caching is a key point for have good performances.
I was expecting that ES was fast as SOLR doing faceting and looking at others good things ES is able to do we was planning to move from SOLR to ES in our OPAC application but faceting performance on medium recordsets (> 1.5M) make me thinking carefully.
ES scaling is very nice, I can add more nodes and performances increase (In SOLR this cannot be done so easy) but comparing a "single node" pure performance this make me thinking that:
I need to know better how ES faceting works and how can be optimized.
At the moment 11M records are handled easy by a SOLR single node with 8G RAM. If moving to ES means adding more HW and more RAM this is not a simple process for us.
During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:
Faceting on small result sets
Faceting on all records
Faceting on large result sets (40-60% of all collection)
I tested many aspects and here are my results:
Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.
In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.
The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.
Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!
I'm looking how to have ES answering in subsecond a faceting query on our records sets.
During last week I did lots of testing about faceting medium size documents sets (2M bibliographic records) using both solr and ES.
After some issue faced at the beginning I spent some time to configure index and mapping in ES to be same as Solr schema definition.
We used 3 kind of queries:
Faceting on small result sets
Faceting on all records
Faceting on large result sets (40-60% of all collection)
I tested many aspects and here are my results:
Indexing speed is 20%-30% higher on ES and this is speed is constant during all indexing.
Simple Searching during indexing ES is 50% faster than Solr. Search for both is subsecond and is aceptable for our use case.
Faceting during indexing is the key point for us. All our queries has faceting and we need sub second query response time. ES is fast as Solr during indexing
and after 1.5M records ES speed is greater than Solr 30-40%. ES reply on 5-7 sec and Solr on 6-10 sec. Solr is faster only query 1. Memory usage is double on ES than solr: ES use 23% RAM and SOLR 13% . When indexing is finished and all 2M docs are indexed and both index are optimized search response change completly and SOLR reply to all query in subsecond and ES still reply in 2-7 secs. We tested different queries to SOLR and we disabled the HTTP cache but faceting algorithms and caching in SOLR seems more memory and speed efficient. If we send one new doc every second the we see that SOLR needs 4-13 secs and ES replies in 4-7sec.
In SOLR world is common problem slow faceting (lots of seconds!) during index update because all caches are deleted at every commit. For that reason we use to have a second instance working on the same lucene index in "read only" mode. In that way we use one instance to do indexing and one instance to do query. One a while (every hour or more) we send a fake "commit" to read only instance to force reload of index update. In this scenario faceting is subsecond most of the time but we don't see in the mean time changes.
The issue with ES that I'm looking to solve is this very high time (2-7secs) for faceting on 2M docs. For libraries (our customers) we have to handle indexes of 2-11M docs and at the moment with 2 SOLR instances we have very good performances. Using SSD we can reduce the initial cache reload penalties.
I'm waiting to see how ES works on SSD and if this affect faceting speed.
Last note...
After a fresh restart of both memory usage are 13% in ES and 6% in SOLR.
ES reply to first query type 2 in 26 secs and SOLR in 10 secs
Subsequent query type 1 3 ES reply in 3-5 secs and SOLR in 0.1-0.6 secs!
I'm looking how to have ES answering in subsecond a faceting query on our records sets.
According to the 3rd graph in that test, when 1 document was being
added per 3 seconds to an existing index, simple (no facet) search on
ES was at least 30-40 times faster than Solr. Yet you only measured
50% faster in your test.
I wondered why such a huge discrepancy between your test and his test.
According to the 3rd graph in that test, when 1 document was being
added per 3 seconds to an existing index, simple (no facet) search on
ES was at least 30-40 times faster than Solr. Yet you only measured
50% faster in your test.
I wondered why such a huge discrepancy between your test and his test.
Andy I did some simple query not a real workload and not on a dedicated server. I ran 10 queries a look at result time.
I think in simple search scenario this guy did a better statistics than me.
I didn't see 30-40 times difference but numbers was very low and maybe Solr spike on committing bigger batches maybe can impact on overall benchmark.
According to the 3rd graph in that test, when 1 document was being
added per 3 seconds to an existing index, simple (no facet) search on
ES was at least 30-40 times faster than Solr. Yet you only measured
50% faster in your test.
I wondered why such a huge discrepancy between your test and his test.
Andy I did some simple query not a real workload and not on a dedicated server. I ran 10 queries a look at result time.
I think in simple search scenario this guy did a better statistics than me.
I didn't see 30-40 times difference but numbers was very low and maybe Solr spike on committing bigger batches maybe can impact on overall benchmark.
Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.
Does elasticsearch's per-segment faceting yield 100% accurate results
guaranteed, or does it have some of the same issues that distributed
faceting does?
On Thu, Sep 8, 2011 at 8:16 PM, Yonik Seeley yseeley@gmail.com wrote:
bq. Solr doesn't do per-segment faceting.
Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.
Does elasticsearch's per-segment faceting yield 100% accurate results
guaranteed, or does it have some of the same issues that distributed
faceting does?
It gives 100% accuracy on the shard level, and all facets are "segments
based". Some facets (like terms) don't when executed in distributed manner.
Solr does have some per-segment faceting capabilites
SOME is the keyword. Eg, it's unknown to the user when that's used,
and when it's not. It's something that can be fixed, eg, distributed
facets (between servers) seems to work, no?
On Thu, Sep 8, 2011 at 1:16 PM, Yonik Seeley yseeley@gmail.com wrote:
bq. Solr doesn't do per-segment faceting.
Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.
Does elasticsearch's per-segment faceting yield 100% accurate results
guaranteed, or does it have some of the same issues that distributed
faceting does?
-Yonik http://www.lucene-eurocon.com - The Lucene/Solr User Conference
Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.
So is ES's per-segment faceting the reason why Dario found it to be so
much slower than Solr once indexing is finished (4-7 seconds for ES
vs. sub-seconds for Solr)?
We need more data to answer that. Solr intersects [cached] bit sets
which can be very fast! I think ES uses a field cache mechanism, I
don't know if it implements bit sets. Per-segment faceting is
possible with bit sets. It's just software.
Solr does have some per-segment faceting capabilites. They are not used by
default because it's slower unless you are rapidly updating the index.
So is ES's per-segment faceting the reason why Dario found it to be so
much slower than Solr once indexing is finished (4-7 seconds for ES
vs. sub-seconds for Solr)?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.