Hi Lukáš,
first of all, thanks for your reply.
Lukáš VlÄek wrote:
Do not know anything about Sensei but given how fast Elasticsearch and
also Solr are changing and developing then I would not expect there is
any easy answer to this question. Also I am not sure Solr and
Elasticsearch are using the exact same version of Lucene all the time
(and this can be quite important).
If there was an easy answer, I wouldn't asked for advice or help.
According to Solr's pom.xml [1], the latest-stable release of Solr
(v1.4) uses Lucene v2.9.1.
According to ElastsicSearch's pom.xml [2], the latest release of
Elasticsearch (v0.6.0) uses Lucene v3.0.1.
According to Sensei's ivy.xml [3], the development version of
Sensei uses Lucene v3.0.0.
The fact that these projects are using different (sometimes
only slightly different) versions of Lucene is important in
term of "fairness" for a comparison, but it does not diminish
the value for users of having a "common" easy way to benchmark
these projects.
Are there big differences, in term of performances, between
Lucene v2.9.1 and Lucene v3.0.1?
Users could also use the Lucene benchmark contrib to compare
different versions of Lucene and therefore have an idea of
the effect that this might have on the benchmark for Solr,
Elasticsearch and Sensei.
Finally, if the benchmark is easy to use, we could use it to
benchmark latest/stable releases as well as development
trunks/versions.
[1]
http://repo1.maven.org/maven2/org/apache/solr/solr-core/1.4.0/solr-core-1.4.0.pom
[2]
http://oss.sonatype.org/content/repositories/releases/org/elasticsearch/elasticsearch/0.6.0/elasticsearch-0.6.0.pom
[3] http://github.com/javasoze/sensei/blob/master/ivy.xml
And not only performance of the
search server should be the main criteria (how about deployment and
maintenance of the server!). I think it is very hard to produce simple
fair and general metrics to compare all three search servers. Feature
matrix can make sense but again I think it is not all just about features.
I didn't claimed performances should be the only or main criteria.
But, I think it's valuable for the users to be able to easily compare
performances, if they want/need to. I'd like to be able to do so.
What columns would you put on a feature matrix?
If you need to find the best solution for your project then I would
recommend doing some evaluation, i.e.: taking your data (or sample of
it), index it, put expected load on your search server, try moving index
from one server to other (or similar emulation of production use cases),
try restoring the index from source data (emulation of crash) ... etc
... and then you can easily find what best suits your needs.
Everybody need to find the best solution for their projects.
Everybody will use different criteria.
It would be good to have a common/sharable/public dataset that others
can (re)use to run the same benchmark and, eventually, if they want
compare different software on the same hardware or different hardware
using the same software.
Would email archives from Apache Software Foundation or W3C be a good
dataset? I could use Tika for parsing mbox files.
I don't disagree with anything you wrote, I still think that providing
people an easy way to run the same benchmark over Solr, Elasticsearch
and Sensei is valuable.
So, I am still searching for advices, suggestions and help.
Thanks again for your reply,
Paolo
Regards,
Lukas
On Thu, Apr 15, 2010 at 10:10 AM, Paolo Castagna
<castagna.lists@googlemail.com mailto:castagna.lists@googlemail.com>
wrote:
Is there a fair and not biased comparison in terms of features and
performances between Solr [1], ElasticSearch [2] and Sensei [3]?
Is there someone interested in helping (just advices on what would
need to be done is fine!) to construct a "common" benchmark, perhaps
using JMeter?, which would allow people to easily and quickly compare
Solr and ElasticSearch and Sensei performances using the same hardware?
More specifically, I am searching help and advices on:
- what dataset(s), publicly available, I could use
- what fields/schema (tokenizers, analyzers, etc.) I should use
- a common set of queries
- help with tools (JMeter?, something else?)
- help with configuration (in particularly with Sensei, since
it's the one I have less familiarity with)
- ...
More general questions:
Is there a simple and pragmatic benchmark for "information retrieval"
systems (please, don't point me at TREC, see: simple and pragmatic)?
Since, all these projects are using Lucene, could the Lucene benchmark
contrib [4] be used/adapted to test Solr, ElasticSearch and Sensei?
Sorry for the crossposting, but, in this case, I think it's appropriate.
Thanks,
Paolo
[1] http://lucene.apache.org/solr/
[2] http://www.elasticsearch.com/
[3] http://sna-projects.com/sensei/
[4]
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/