Is there a "common" benchmark for Solr, ElasticSearch and Sensei?

Is there a fair and not biased comparison in terms of features and
performances between Solr [1], ElasticSearch [2] and Sensei [3]?

Is there someone interested in helping (just advices on what would
need to be done is fine!) to construct a "common" benchmark, perhaps
using JMeter?, which would allow people to easily and quickly compare
Solr and ElasticSearch and Sensei performances using the same hardware?

More specifically, I am searching help and advices on:

  • what dataset(s), publicly available, I could use
  • what fields/schema (tokenizers, analyzers, etc.) I should use
  • a common set of queries
  • help with tools (JMeter?, something else?)
  • help with configuration (in particularly with Sensei, since
    it's the one I have less familiarity with)
  • ...

More general questions:

Is there a simple and pragmatic benchmark for "information retrieval"
systems (please, don't point me at TREC, see: simple and pragmatic)?

Since, all these projects are using Lucene, could the Lucene benchmark
contrib [4] be used/adapted to test Solr, ElasticSearch and Sensei?

Sorry for the crossposting, but, in this case, I think it's appropriate.

Thanks,
Paolo

[1] http://lucene.apache.org/solr/
[2] http://www.elasticsearch.com/
[3] http://sna-projects.com/sensei/
[4]
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/

Hi,

Do not know anything about Sensei but given how fast Elasticsearch and also
Solr are changing and developing then I would not expect there is any easy
answer to this question. Also I am not sure Solr and Elasticsearch are using
the exact same version of Lucene all the time (and this can be quite
important). And not only performance of the search server should be the main
criteria (how about deployment and maintenance of the server!). I think it
is very hard to produce simple fair and general metrics to compare all three
search servers. Feature matrix can make sense but again I think it is not
all just about features.

If you need to find the best solution for your project then I would
recommend doing some evaluation, i.e.: taking your data (or sample of it),
index it, put expected load on your search server, try moving index from one
server to other (or similar emulation of production use cases), try
restoring the index from source data (emulation of crash) ... etc ... and
then you can easily find what best suits your needs.

Regards,
Lukas

On Thu, Apr 15, 2010 at 10:10 AM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

Is there a fair and not biased comparison in terms of features and
performances between Solr [1], Elasticsearch [2] and Sensei [3]?

Is there someone interested in helping (just advices on what would
need to be done is fine!) to construct a "common" benchmark, perhaps
using JMeter?, which would allow people to easily and quickly compare
Solr and Elasticsearch and Sensei performances using the same hardware?

More specifically, I am searching help and advices on:

  • what dataset(s), publicly available, I could use
  • what fields/schema (tokenizers, analyzers, etc.) I should use
  • a common set of queries
  • help with tools (JMeter?, something else?)
  • help with configuration (in particularly with Sensei, since
    it's the one I have less familiarity with)
  • ...

More general questions:

Is there a simple and pragmatic benchmark for "information retrieval"
systems (please, don't point me at TREC, see: simple and pragmatic)?

Since, all these projects are using Lucene, could the Lucene benchmark
contrib [4] be used/adapted to test Solr, Elasticsearch and Sensei?

Sorry for the crossposting, but, in this case, I think it's appropriate.

Thanks,
Paolo

[1] Welcome to Apache Solr - Apache Solr
[2] http://www.elasticsearch.com/
[3] Mopid | Hire the best At 10x the speed
[4]
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/

Hi Lukáš,
first of all, thanks for your reply.

Lukáš Vlček wrote:

Do not know anything about Sensei but given how fast Elasticsearch and
also Solr are changing and developing then I would not expect there is
any easy answer to this question. Also I am not sure Solr and
Elasticsearch are using the exact same version of Lucene all the time
(and this can be quite important).

If there was an easy answer, I wouldn't asked for advice or help. :slight_smile:

According to Solr's pom.xml [1], the latest-stable release of Solr
(v1.4) uses Lucene v2.9.1.

According to ElastsicSearch's pom.xml [2], the latest release of
Elasticsearch (v0.6.0) uses Lucene v3.0.1.

According to Sensei's ivy.xml [3], the development version of
Sensei uses Lucene v3.0.0.

The fact that these projects are using different (sometimes
only slightly different) versions of Lucene is important in
term of "fairness" for a comparison, but it does not diminish
the value for users of having a "common" easy way to benchmark
these projects.

Are there big differences, in term of performances, between
Lucene v2.9.1 and Lucene v3.0.1?

Users could also use the Lucene benchmark contrib to compare
different versions of Lucene and therefore have an idea of
the effect that this might have on the benchmark for Solr,
Elasticsearch and Sensei.

Finally, if the benchmark is easy to use, we could use it to
benchmark latest/stable releases as well as development
trunks/versions.

[1]
http://repo1.maven.org/maven2/org/apache/solr/solr-core/1.4.0/solr-core-1.4.0.pom
[2]
http://oss.sonatype.org/content/repositories/releases/org/elasticsearch/elasticsearch/0.6.0/elasticsearch-0.6.0.pom
[3] http://github.com/javasoze/sensei/blob/master/ivy.xml

And not only performance of the
search server should be the main criteria (how about deployment and
maintenance of the server!). I think it is very hard to produce simple
fair and general metrics to compare all three search servers. Feature
matrix can make sense but again I think it is not all just about features.

I didn't claimed performances should be the only or main criteria.

But, I think it's valuable for the users to be able to easily compare
performances, if they want/need to. I'd like to be able to do so.

What columns would you put on a feature matrix?

If you need to find the best solution for your project then I would
recommend doing some evaluation, i.e.: taking your data (or sample of
it), index it, put expected load on your search server, try moving index
from one server to other (or similar emulation of production use cases),
try restoring the index from source data (emulation of crash) ... etc
... and then you can easily find what best suits your needs.

Everybody need to find the best solution for their projects.
Everybody will use different criteria.

It would be good to have a common/sharable/public dataset that others
can (re)use to run the same benchmark and, eventually, if they want
compare different software on the same hardware or different hardware
using the same software.

Would email archives from Apache Software Foundation or W3C be a good
dataset? I could use Tika for parsing mbox files.

I don't disagree with anything you wrote, I still think that providing
people an easy way to run the same benchmark over Solr, Elasticsearch
and Sensei is valuable.

So, I am still searching for advices, suggestions and help.

Thanks again for your reply,
Paolo

Regards,
Lukas

On Thu, Apr 15, 2010 at 10:10 AM, Paolo Castagna
<castagna.lists@googlemail.com mailto:castagna.lists@googlemail.com>
wrote:

Is there a fair and not biased comparison in terms of features and
performances between Solr [1], ElasticSearch [2] and Sensei [3]?

Is there someone interested in helping (just advices on what would
need to be done is fine!) to construct a "common" benchmark, perhaps
using JMeter?, which would allow people to easily and quickly compare
Solr and ElasticSearch and Sensei performances using the same hardware?

More specifically, I am searching help and advices on:

 - what dataset(s), publicly available, I could use
 - what fields/schema (tokenizers, analyzers, etc.) I should use
 - a common set of queries
 - help with tools (JMeter?, something else?)
 - help with configuration (in particularly with Sensei, since
  it's the one I have less familiarity with)
 - ...

More general questions:

Is there a simple and pragmatic benchmark for "information retrieval"
systems (please, don't point me at TREC, see: simple and pragmatic)?

Since, all these projects are using Lucene, could the Lucene benchmark
contrib [4] be used/adapted to test Solr, ElasticSearch and Sensei?

Sorry for the crossposting, but, in this case, I think it's appropriate.

Thanks,
Paolo

 [1] http://lucene.apache.org/solr/
 [2] http://www.elasticsearch.com/
 [3] http://sna-projects.com/sensei/
 [4]
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/

On Thu, Apr 15, 2010 at 11:15 AM, Paolo Castagna
castagna.lists@googlemail.com wrote:

It would be good to have a common/sharable/public dataset that others
can (re)use to run the same benchmark and, eventually, if they want
compare different software on the same hardware or different hardware
using the same software.

Paolo, you may want to take a look at the recently publicly released
Yahoo Firehose:
http://developer.yahoo.net/blog/archives/2010/04/yahoo_updates_firehose.html

Hope that helps,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Sergio Bossa wrote:

On Thu, Apr 15, 2010 at 11:15 AM, Paolo Castagna
castagna.lists@googlemail.com wrote:

It would be good to have a common/sharable/public dataset that others
can (re)use to run the same benchmark and, eventually, if they want
compare different software on the same hardware or different hardware
using the same software.

Paolo, you may want to take a look at the recently publicly released
Yahoo Firehose:
http://developer.yahoo.net/blog/archives/2010/04/yahoo_updates_firehose.html

Thank you Sergio, it's an interesting but I don't see how I could use
that service to build a dataset that I could then use for a benchmark.

I am not even sure if the "Terms for Use" would allow me to download
a large chunk of it and/or re-publish it, as it is or in a different
format somewhere else.

Ideally, we would need a stable and re-publishable dataset.

In a previous email, I proposed email archives from Apache Software
Foundation simply because they are available in mbox format and should
be freely usable and re-publishable.

I'll wait for other ideas/suggestions before deciding on the dataset.

Once we have a dataset, we can discuss/agree on
fields/schemas/tokenizers/analyzers etc., then queries,
then tools to generate the load and drive the benchmark,
then actual operations we want to perform/benchmark.
I was thinking these broad categories, initially: search with
small number of results, search with large number of results,
bulk/batch indexing into an empty index, index updates.

Hope that helps,

Thanks for your suggestion,
Paolo

Cheers,

Sergio B.

On Thu, Apr 15, 2010 at 11:15 AM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

Hi Lukáš,
first of all, thanks for your reply.

Lukáš Vlček wrote:

Do not know anything about Sensei but given how fast Elasticsearch and
also Solr are changing and developing then I would not expect there is any
easy answer to this question. Also I am not sure Solr and Elasticsearch are
using the exact same version of Lucene all the time (and this can be quite
important).

If there was an easy answer, I wouldn't asked for advice or help. :slight_smile:

According to Solr's pom.xml [1], the latest-stable release of Solr
(v1.4) uses Lucene v2.9.1.

According to ElastsicSearch's pom.xml [2], the latest release of
Elasticsearch (v0.6.0) uses Lucene v3.0.1.

According to Sensei's ivy.xml [3], the development version of
Sensei uses Lucene v3.0.0.

The fact that these projects are using different (sometimes
only slightly different) versions of Lucene is important in
term of "fairness" for a comparison, but it does not diminish
the value for users of having a "common" easy way to benchmark
these projects.

Are there big differences, in term of performances, between
Lucene v2.9.1 and Lucene v3.0.1?

I am not that familiar with Lucene internals but there can be performance
gains and new features which were implemented in search server layer in
previous version.
http://lucene.apache.org/java/3_0_0/changes/Changes.html
http://lucene.apache.org/java/3_0_1/changes/Changes.html
I just wanted to point out that you need to take account on this fact if you
want to compare apples and apples.

Users could also use the Lucene benchmark contrib to compare
different versions of Lucene and therefore have an idea of
the effect that this might have on the benchmark for Solr,
Elasticsearch and Sensei.

Finally, if the benchmark is easy to use, we could use it to
benchmark latest/stable releases as well as development
trunks/versions.

I would welcome such benchmark and I believe search server developers would
welcome it as well. Indexing Apache mail list is not bad idea I think.

[1]
http://repo1.maven.org/maven2/org/apache/solr/solr-core/1.4.0/solr-core-1.4.0.pom
[2]
http://oss.sonatype.org/content/repositories/releases/org/elasticsearch/elasticsearch/0.6.0/elasticsearch-0.6.0.pom
[3] http://github.com/javasoze/sensei/blob/master/ivy.xml

And not only performance of the

search server should be the main criteria (how about deployment and
maintenance of the server!). I think it is very hard to produce simple fair
and general metrics to compare all three search servers. Feature matrix can
make sense but again I think it is not all just about features.

I didn't claimed performances should be the only or main criteria.

But, I think it's valuable for the users to be able to easily compare
performances, if they want/need to. I'd like to be able to do so.

What columns would you put on a feature matrix?

Hmm.... thinking about it more it seems to me that search servers
(especially those based on Lucene) will all converge to the same basic
feature set. What will differentiate is how the distributed nature of some
features is implemented and the amount of work one has to spend on
maintenance, updates, fixes, development ... and of course when it comes to
"enterprise business" if you can buy reliable support for it.

If you need to find the best solution for your project then I would

recommend doing some evaluation, i.e.: taking your data (or sample of it),
index it, put expected load on your search server, try moving index from one
server to other (or similar emulation of production use cases), try
restoring the index from source data (emulation of crash) ... etc ... and
then you can easily find what best suits your needs.

Everybody need to find the best solution for their projects.
Everybody will use different criteria.

It would be good to have a common/sharable/public dataset that others
can (re)use to run the same benchmark and, eventually, if they want
compare different software on the same hardware or different hardware
using the same software.

Would email archives from Apache Software Foundation or W3C be a good
dataset? I could use Tika for parsing mbox files.

I don't disagree with anything you wrote, I still think that providing
people an easy way to run the same benchmark over Solr, Elasticsearch
and Sensei is valuable.

I agree.

How would you like to implement those benchmarks? A new project somewhere on
google code or github containing prebuild instances of individual search
servers, copy of benchmark dataset (mbox data if taken from Apache) and wiki
page with tesults created by volunteers testing this on various OS and HW
configurations?

So, I am still searching for advices, suggestions and help.

Thanks again for your reply,
Paolo

Regards,
Lukas

On Thu, Apr 15, 2010 at 10:10 AM, Paolo Castagna <
castagna.lists@googlemail.com mailto:castagna.lists@googlemail.com>
wrote:

Is there a fair and not biased comparison in terms of features and
performances between Solr [1], Elasticsearch [2] and Sensei [3]?

Is there someone interested in helping (just advices on what would
need to be done is fine!) to construct a "common" benchmark, perhaps
using JMeter?, which would allow people to easily and quickly compare
Solr and Elasticsearch and Sensei performances using the same hardware?

More specifically, I am searching help and advices on:

- what dataset(s), publicly available, I could use
- what fields/schema (tokenizers, analyzers, etc.) I should use
- a common set of queries
- help with tools (JMeter?, something else?)
- help with configuration (in particularly with Sensei, since
 it's the one I have less familiarity with)
- ...

More general questions:

Is there a simple and pragmatic benchmark for "information retrieval"
systems (please, don't point me at TREC, see: simple and pragmatic)?

Since, all these projects are using Lucene, could the Lucene benchmark
contrib [4] be used/adapted to test Solr, Elasticsearch and Sensei?

Sorry for the crossposting, but, in this case, I think it's
appropriate.

Thanks,
Paolo

[1] http://lucene.apache.org/solr/
[2] http://www.elasticsearch.com/
[3] http://sna-projects.com/sensei/
[4]

http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/contrib/benchmark/

Lukáš Vlček wrote:

I am not that familiar with Lucene internals but there can be
performance gains and new features which were implemented in search
server layer in previous version.
Lucene Change Log
Lucene Change Log
I just wanted to point out that you need to take account on this fact if
you want to compare apples and apples.

I agree.

Hopefully, with time Solr, Elasticsearch and Sensei will use the
same/latest stable Lucene release.

Upgrading Lucene from v2.9.x to v3.0.x is more expensive (since
deprecated things have been removed) but once a project is using
v3.0.x it should not be that difficult to upgrade.

Users could also use the Lucene benchmark contrib to compare
different versions of Lucene and therefore have an idea of
the effect that this might have on the benchmark for Solr,
ElasticSearch and Sensei.

Finally, if the benchmark is easy to use, we could use it to
benchmark latest/stable releases as well as development
trunks/versions.

I would welcome such benchmark and I believe search server developers
would welcome it as well. Indexing Apache mail list is not bad idea I think.

Good.

What columns would you put on a feature matrix?

Hmm.... thinking about it more it seems to me that search servers
(especially those based on Lucene) will all converge to the same basic
feature set. What will differentiate is how the distributed nature of
some features is implemented and the amount of work one has to spend on
maintenance, updates, fixes, development ... and of course when it comes
to "enterprise business" if you can buy reliable support for it.

So, if I need to extract some column names from what you are saying...

  • basic features
    • ...
  • distribution/clustering
  • maintenance cost
  • upgrade/update cost
  • commercial support
  • ...

How would you like to implement those benchmarks? A new project
somewhere on google code or github containing prebuild instances of
individual search servers, copy of benchmark dataset (mbox data if taken
from Apache) and wiki page with tesults created by volunteers testing
this on various OS and HW configurations?

Infrastructure (i.e. Google Code or GitHub) is not a problem.

The benchmark should be easy to use. So, ideally, an user should
checkout the benchmark with Solr, Elasticsearch and Sensei configured
and read to run.

The benchmark must include a copy of the dataset.

Ideally, the benchmark should produce reports in the same format so
that people can easily share and compare results.

Paolo

On Thu, Apr 15, 2010 at 4:18 PM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

Lukáš Vlček wrote:

I am not that familiar with Lucene internals but there can be performance
gains and new features which were implemented in search server layer in
previous version.
Lucene Change Log
Lucene Change Log
I just wanted to point out that you need to take account on this fact if
you want to compare apples and apples.

I agree.

Hopefully, with time Solr, Elasticsearch and Sensei will use the
same/latest stable Lucene release.

Upgrading Lucene from v2.9.x to v3.0.x is more expensive (since
deprecated things have been removed) but once a project is using
v3.0.x it should not be that difficult to upgrade.

Users could also use the Lucene benchmark contrib to compare

different versions of Lucene and therefore have an idea of
the effect that this might have on the benchmark for Solr,
Elasticsearch and Sensei.

Finally, if the benchmark is easy to use, we could use it to
benchmark latest/stable releases as well as development
trunks/versions.

I would welcome such benchmark and I believe search server developers
would welcome it as well. Indexing Apache mail list is not bad idea I think.

Good.

What columns would you put on a feature matrix?

Hmm.... thinking about it more it seems to me that search servers
(especially those based on Lucene) will all converge to the same basic
feature set. What will differentiate is how the distributed nature of some
features is implemented and the amount of work one has to spend on
maintenance, updates, fixes, development ... and of course when it comes to
"enterprise business" if you can buy reliable support for it.

So, if I need to extract some column names from what you are saying...

  • basic features
    • ...
  • distribution/clustering
  • maintenance cost
  • upgrade/update cost
  • commercial support
  • ...

How would you like to implement those benchmarks? A new project somewhere

on google code or github containing prebuild instances of individual search
servers, copy of benchmark dataset (mbox data if taken from Apache) and wiki
page with tesults created by volunteers testing this on various OS and HW
configurations?

Infrastructure (i.e. Google Code or GitHub) is not a problem.

The benchmark should be easy to use. So, ideally, an user should
checkout the benchmark with Solr, Elasticsearch and Sensei configured
and read to run.

Running benchmark just on single machine is probably not that useful. What
would be more important is ability to run benchmark on more nodes in
parallel (like distributed search). I am not sure how hard it would be to
deliver easy-to-deploy artefacts for the purpose of the distributed
benchmark. Also the benchmark should report as much information about
network as possible.

The benchmark must include a copy of the dataset.

Ideally, the benchmark should produce reports in the same format so
that people can easily share and compare results.

Paolo

Lukáš Vlček wrote:

The benchmark should be easy to use. So, ideally, an user should
checkout the benchmark with Solr, ElasticSearch and Sensei configured
and read to run.

Running benchmark just on single machine is probably not that useful.

Agree.

What I usually do is to have a sort of "template" for the distribution
to put on each node of my cluster and then separate configuration to
setup nodes differently if necessary.

What would be more important is ability to run benchmark on more nodes
in parallel (like distributed search). I am not sure how hard it would
be to deliver easy-to-deploy artefacts for the purpose of the
distributed benchmark.

Agree. Not trivial, but possible.

If there are Puppet guru around, we could even automate deployment and
configuration using Puppet. Or perhaps, just simple rsync scripts are
fine to start with.

Also the benchmark should report as much information about network as possible.

True, often (in particular when results are cached in RAM and/or disks
are not used) network transfer is a significant part of the response
time.

But, although important, I would leave monitoring network traffic out
for the moment, for simplicity. Having a benchmark to run is the first
step.

Paolo

Hi,

if you want to make a benchmark about search solutions over big data, you may
want to also consider the following solutions:

Lucandra uses cassandra as the persistent layer for lucene. The following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase

Both of the above are only proof of concepts by now, but may quickly become
production ready. In the end you only need to like 5 classes to glue lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro wrote:

Hi,

if you want to make a benchmark about search solutions over big data, you may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may quickly become
production ready. In the end you only need to like 5 classes to glue lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

This solution is problematic IMO in how it works and especially with how
Lucene works. With this solution, there is only a single Lucene IndexWriter
that you can open, so your writes don't scale, regardless of the amount of
machines you add.

Also, Lucene cached a lot of information per reader/searcher (fieldcache,
terms info, and so on). With large indices, you have a single reader working
against a very large cluster/index, and your client won't cope with it.... .

You can't get around it, you need to shard a Lucene index into many small
Lucene indices running on different machines, but then you need to write a
distributed Lucene solution. And hey, I think someone already built one :slight_smile:

(p.s. I am not even mentioning all the many other features elasticsearch
gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
timrobertson100@gmail.comwrote:

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro wrote:

Hi,

if you want to make a benchmark about search solutions over big data, you
may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The following
two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may quickly
become
production ready. In the end you only need to like 5 classes to glue
lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

My curiosity was really if I could open many read only index readers
to scale the reading I guess.

On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This solution is problematic IMO in how it works and especially with how
Lucene works. With this solution, there is only a single Lucene IndexWriter
that you can open, so your writes don't scale, regardless of the amount of
machines you add.
Also, Lucene cached a lot of information per reader/searcher (fieldcache,
terms info, and so on). With large indices, you have a single reader working
against a very large cluster/index, and your client won't cope with it.... .
You can't get around it, you need to shard a Lucene index into many small
Lucene indices running on different machines, but then you need to write a
distributed Lucene solution. And hey, I think someone already built one :slight_smile:
(p.s. I am not even mentioning all the many other features elasticsearch
gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson timrobertson100@gmail.com
wrote:

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro wrote:

Hi,

if you want to make a benchmark about search solutions over big data,
you may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The
following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may quickly
become
production ready. In the end you only need to like 5 classes to glue
lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

This will work up to a point where your index gets big. In this case, your
search JVM might not be able to hold all the information needed (for
example, won't be able to load the term info and field cache since its too
big, and you will get either OOM or GC trashing).

cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson timrobertson100@gmail.comwrote:

My curiosity was really if I could open many read only index readers
to scale the reading I guess.

On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This solution is problematic IMO in how it works and especially with how
Lucene works. With this solution, there is only a single Lucene
IndexWriter
that you can open, so your writes don't scale, regardless of the amount
of
machines you add.
Also, Lucene cached a lot of information per reader/searcher
(fieldcache,
terms info, and so on). With large indices, you have a single reader
working
against a very large cluster/index, and your client won't cope with
it.... .
You can't get around it, you need to shard a Lucene index into many small
Lucene indices running on different machines, but then you need to write
a
distributed Lucene solution. And hey, I think someone already built one
:slight_smile:
(p.s. I am not even mentioning all the many other features elasticsearch
gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson <
timrobertson100@gmail.com>
wrote:

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro wrote:

Hi,

if you want to make a benchmark about search solutions over big data,
you may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The
following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may quickly
become
production ready. In the end you only need to like 5 classes to glue
lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

Thanks Shay. Curiosity squashed

On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This will work up to a point where your index gets big. In this case, your
search JVM might not be able to hold all the information needed (for
example, won't be able to load the term info and field cache since its too
big, and you will get either OOM or GC trashing).
cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson timrobertson100@gmail.com
wrote:

My curiosity was really if I could open many read only index readers
to scale the reading I guess.

On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This solution is problematic IMO in how it works and especially with how
Lucene works. With this solution, there is only a single Lucene
IndexWriter
that you can open, so your writes don't scale, regardless of the amount
of
machines you add.
Also, Lucene cached a lot of information per reader/searcher
(fieldcache,
terms info, and so on). With large indices, you have a single reader
working
against a very large cluster/index, and your client won't cope with
it.... .
You can't get around it, you need to shard a Lucene index into many
small
Lucene indices running on different machines, but then you need to write
a
distributed Lucene solution. And hey, I think someone already built one
:slight_smile:
(p.s. I am not even mentioning all the many other features elasticsearch
gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
timrobertson100@gmail.com
wrote:

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro wrote:

Hi,

if you want to make a benchmark about search solutions over big data,
you may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The
following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may quickly
become
production ready. In the end you only need to like 5 classes to glue
lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

Hope elasticsearch helps in the area of the squashing :). To be honest, I
followed a similar path waay back when I tried to have a distributed Lucene
Directory built on top of GigaSpaces/Coherence/Terracotta. From a design
perspective, its not very different from what Lucandra or the one over HBase
do, and at the end, they suffer from the same limitations that I noted...

cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:49 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay. Curiosity squashed

On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This will work up to a point where your index gets big. In this case,
your
search JVM might not be able to hold all the information needed (for
example, won't be able to load the term info and field cache since its
too
big, and you will get either OOM or GC trashing).
cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <
timrobertson100@gmail.com>
wrote:

My curiosity was really if I could open many read only index readers
to scale the reading I guess.

On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This solution is problematic IMO in how it works and especially with
how
Lucene works. With this solution, there is only a single Lucene
IndexWriter
that you can open, so your writes don't scale, regardless of the
amount
of
machines you add.
Also, Lucene cached a lot of information per reader/searcher
(fieldcache,
terms info, and so on). With large indices, you have a single reader
working
against a very large cluster/index, and your client won't cope with
it.... .
You can't get around it, you need to shard a Lucene index into many
small
Lucene indices running on different machines, but then you need to
write
a
distributed Lucene solution. And hey, I think someone already built
one
:slight_smile:
(p.s. I am not even mentioning all the many other features
elasticsearch
gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
timrobertson100@gmail.com
wrote:

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro wrote:

Hi,

if you want to make a benchmark about search solutions over big
data,
you may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The
following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may
quickly
become
production ready. In the end you only need to like 5 classes to
glue
lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

Thanks Shay for this point in this thread.

having all the threads around benchmarking. Is there some king of a test
that you have done, or someone else done using Elasticsearch and can share
numbers regarding scale and performance?

things I would like to get a sense of are:

  1. number of documents in the index.
  2. how many instances on how many machines.
  3. at this size how many Writes/s
  4. how many Reads/s

if you have a link to previously done benchmark it would be appriciated - or
you can share your numbers.

Thanks
Ori

On Fri, Apr 16, 2010 at 2:36 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hope elasticsearch helps in the area of the squashing :). To be honest, I
followed a similar path waay back when I tried to have a distributed Lucene
Directory built on top of GigaSpaces/Coherence/Terracotta. From a design
perspective, its not very different from what Lucandra or the one over HBase
do, and at the end, they suffer from the same limitations that I noted...

cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:49 PM, Tim Robertson timrobertson100@gmail.comwrote:

Thanks Shay. Curiosity squashed

On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This will work up to a point where your index gets big. In this case,
your
search JVM might not be able to hold all the information needed (for
example, won't be able to load the term info and field cache since its
too
big, and you will get either OOM or GC trashing).
cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <
timrobertson100@gmail.com>
wrote:

My curiosity was really if I could open many read only index readers
to scale the reading I guess.

On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This solution is problematic IMO in how it works and especially with
how
Lucene works. With this solution, there is only a single Lucene
IndexWriter
that you can open, so your writes don't scale, regardless of the
amount
of
machines you add.
Also, Lucene cached a lot of information per reader/searcher
(fieldcache,
terms info, and so on). With large indices, you have a single reader
working
against a very large cluster/index, and your client won't cope with
it.... .
You can't get around it, you need to shard a Lucene index into many
small
Lucene indices running on different machines, but then you need to
write
a
distributed Lucene solution. And hey, I think someone already built
one
:slight_smile:
(p.s. I am not even mentioning all the many other features
elasticsearch
gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
timrobertson100@gmail.com
wrote:

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro
wrote:

Hi,

if you want to make a benchmark about search solutions over big
data,
you may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The
following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may
quickly
become
production ready. In the end you only need to like 5 classes to
glue
lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

--
http://olahav.typepad.com

Benchmark numbers are very subjective from where you run the benchmark to
how you run it. If someone would like to create an unbiased benchmark, I
would be happy to help. As for specific usecase, I suggest you write your
own and check. If you want, with elasticsearch repo there are some jmeter
scripts that I use to test.

cheers,
shay.banon

On Fri, Apr 16, 2010 at 6:46 PM, Ori Lahav olahav@gmail.com wrote:

Thanks Shay for this point in this thread.

having all the threads around benchmarking. Is there some king of a test
that you have done, or someone else done using Elasticsearch and can share
numbers regarding scale and performance?

things I would like to get a sense of are:

  1. number of documents in the index.
  2. how many instances on how many machines.
  3. at this size how many Writes/s
  4. how many Reads/s

if you have a link to previously done benchmark it would be appriciated -
or you can share your numbers.

Thanks
Ori

On Fri, Apr 16, 2010 at 2:36 PM, Shay Banon shay.banon@elasticsearch.comwrote:

Hope elasticsearch helps in the area of the squashing :). To be honest, I
followed a similar path waay back when I tried to have a distributed Lucene
Directory built on top of GigaSpaces/Coherence/Terracotta. From a design
perspective, its not very different from what Lucandra or the one over HBase
do, and at the end, they suffer from the same limitations that I noted...

cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:49 PM, Tim Robertson <timrobertson100@gmail.com

wrote:

Thanks Shay. Curiosity squashed

On Fri, Apr 16, 2010 at 12:47 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This will work up to a point where your index gets big. In this case,
your
search JVM might not be able to hold all the information needed (for
example, won't be able to load the term info and field cache since its
too
big, and you will get either OOM or GC trashing).
cheers,
shay.banon

On Fri, Apr 16, 2010 at 1:25 PM, Tim Robertson <
timrobertson100@gmail.com>
wrote:

My curiosity was really if I could open many read only index readers
to scale the reading I guess.

On Fri, Apr 16, 2010 at 12:16 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

This solution is problematic IMO in how it works and especially with
how
Lucene works. With this solution, there is only a single Lucene
IndexWriter
that you can open, so your writes don't scale, regardless of the
amount
of
machines you add.
Also, Lucene cached a lot of information per reader/searcher
(fieldcache,
terms info, and so on). With large indices, you have a single reader
working
against a very large cluster/index, and your client won't cope with
it.... .
You can't get around it, you need to shard a Lucene index into many
small
Lucene indices running on different machines, but then you need to
write
a
distributed Lucene solution. And hey, I think someone already built
one
:slight_smile:
(p.s. I am not even mentioning all the many other features
elasticsearch
gives over this very low level Lucene solution).

Shay

On Fri, Apr 16, 2010 at 10:43 AM, Tim Robertson
timrobertson100@gmail.com
wrote:

+1 from me. Very curious of how distributing the storage behind
Lucene performs, rather than multiple Lucene indexes distributed
themselves

On Fri, Apr 16, 2010 at 8:52 AM, Thomas Koch thomas@koch.ro
wrote:

Hi,

if you want to make a benchmark about search solutions over big
data,
you may
want to also consider the following solutions:

Sematext Cassandra Monitoring | Performance Monitoring Tool

Lucandra uses cassandra as the persistent layer for lucene. The
following two
attempts have ported lucandra to use HBase:

http://github.com/thkoch2001/lucehbase
GitHub - akkumar/hbasene: HBase as the backing store for the TF-IDF representations for Lucene

Both of the above are only proof of concepts by now, but may
quickly
become
production ready. In the end you only need to like 5 classes to
glue
lucene
with cassandra or HBase.

Best regards,

Thomas Koch, http://www.koch.ro

--
http://olahav.typepad.com

Shay Banon wrote:

If someone would like to create an unbiased benchmark, I would be happy to help.

Will you suggest using JMeter as tool for benchmarking?

Today, I was looking at the Lucene benchmark contrib, ave you ever use
it? I was wondering if I could adapt it to Elasticsearch and Solr or
it's better to start with JMeter.

With JMeter, I am not sure it can produce useful reports people can
then share/exchange.

Thanks,
Paolo

If you plan to use the HTTP interface of both products, then jmeter can be
the way to go.

cheers,
sahy.banon

On Fri, Apr 16, 2010 at 10:31 PM, Paolo Castagna <
castagna.lists@googlemail.com> wrote:

Shay Banon wrote:

If someone would like to create an unbiased benchmark, I would be happy to
help.

Will you suggest using JMeter as tool for benchmarking?

Today, I was looking at the Lucene benchmark contrib, ave you ever use
it? I was wondering if I could adapt it to Elasticsearch and Solr or
it's better to start with JMeter.

With JMeter, I am not sure it can produce useful reports people can
then share/exchange.

Thanks,
Paolo