where elasticsearch and solr are compared with regard to the indexing
speed?
A quote from the article: "I ran each test 4 times, killing the JVM
and removing the data directory for both Solr and elasticsearch. The
final averaged results expressed as throughputs were 43204 docs/sec
for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec
for elasticsearch."
PS: Don't take me wrong, I know that it is only one (partial) test,
and that some features in elasticsearch make it unique!
I've posted a reply (currently awaiting moderation) but his benchmark is
severely flawed. eg, he wasn't actually indexing what he thought he was
indexing.
With a few simple changes, I got much better performance out of ES than
he was getting.
On a side note, it seems refresh_interval is not being respected in
0.15.2, which would also decrease raw indexing speed
{"add":{"doc":{ "id":"1582039702", "field1_s":"1184645701" }} in case
of SOLR compared to
{"index": {"_index":"test", "_type":"type1", "_id":"1582039702",
"field1":"1184645701" }} for ES
he can't be serious; it's also not sure how the fields were treated
and configurated as no config options were stated.
From my own ES usage I know ES can index 1500 doc's containing each 45
fields (some very long language ones with up to 10 000 chars) in under
0.6 seconds; So if I just think about 2 fields here and take 1500 * 45
fields at 0.6 secs, I would expect that ES can take at least about 57
000 of his 2 field demo's without any problems;
A quote from the article: "I ran each test 4 times, killing the JVM
and removing the data directory for both Solr and elasticsearch. The
final averaged results expressed as throughputs were 43204 docs/sec
for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec
for elasticsearch."
PS: Don't take me wrong, I know that it is only one (partial) test,
and that some features in elasticsearch make it unique!
I wouldn't pay much attention to that post/benchmark. A good
benchmark needs to publish a lot more details than the above, starting
with basic stuff like -Xmx. I'm also of the opinion that if you are
going to publish a benchmark comparing 2 pieces of software then you
better invite experts from both sides and let them tune and optimize
things.
A quote from the article: "I ran each test 4 times, killing the JVM
and removing the data directory for both Solr and elasticsearch. The
final averaged results expressed as throughputs were 43204 docs/sec
for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec
for elasticsearch."
PS: Don't take me wrong, I know that it is only one (partial) test,
and that some features in elasticsearch make it unique!
In order to completely compare the two in terms of overhead when indexing, at least for this very simple doc, the _source and _all field needs to be disabled.
The type used for Solr field1 is, when used in ES, of index set to not_analyzed, and omit_norms set to true. It should be the same for ES.
Again, ES will index two more additional fields, _id and _type. To really compare, they should be set to index to no. When doing so, the only thing one looses is the ability to query them on search time (this is in master).
I posted a sample as a comment on clinton post.
Some more aspects to how ES works differently than Solr:
When indexing data its there. If you "kill -9" ES (even with a single server), and start it back up, all data indexing up until that point will be there with local gateway (this is not done through committing Lucene on each change, as this will not scale). Solr, on the other hand, will loose all changes until the last commit. This does come with a (small) overhead.
The bulk API format for elasticsearch is more optimized for distributed execution, where it needs to be sliced and diced in order to point the bulk items to the correct shards. This does come with an overhead compared to a single big json that is parsed and processed in a single shard scenario, while proves very crucial when working with several shards.
-shay.banon
On Monday, April 18, 2011 at 5:56 AM, Otis wrote:
Hi,
I wouldn't pay much attention to that post/benchmark. A good
benchmark needs to publish a lot more details than the above, starting
with basic stuff like -Xmx. I'm also of the opinion that if you are
going to publish a benchmark comparing 2 pieces of software then you
better invite experts from both sides and let them tune and optimize
things.
A quote from the article: "I ran each test 4 times, killing the JVM
and removing the data directory for both Solr and elasticsearch. The
final averaged results expressed as throughputs were 43204 docs/sec
for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec
for elasticsearch."
PS: Don't take me wrong, I know that it is only one (partial) test,
and that some features in elasticsearch make it unique!
In order to completely compare the two in terms of overhead when indexing, at least for this very simple doc, the _source and _all field needs to be disabled.
The type used for Solr field1 is, when used in ES, of index set to not_analyzed, and omit_norms set to true. It should be the same for ES.
Again, ES will index two more additional fields, _id and _type. To really compare, they should be set to index to no. When doing so, the only thing one looses is the ability to query them on search time (this is in master).
I posted a sample as a comment on clinton post.
Some more aspects to how ES works differently than Solr:
When indexing data its there. If you "kill -9" ES (even with a single server), and start it back up, all data indexing up until that point will be there with local gateway (this is not done through committing Lucene on each change, as this will not scale). Solr, on the other hand, will loose all changes until the last commit. This does come with a (small) overhead.
The bulk API format for elasticsearch is more optimized for distributed execution, where it needs to be sliced and diced in order to point the bulk items to the correct shards. This does come with an overhead compared to a single big json that is parsed and processed in a single shard scenario, while proves very crucial when working with several shards.
-shay.banon
On Monday, April 18, 2011 at 5:56 AM, Otis wrote:
Hi,
I wouldn't pay much attention to that post/benchmark. A good
benchmark needs to publish a lot more details than the above, starting
with basic stuff like -Xmx. I'm also of the opinion that if you are
going to publish a benchmark comparing 2 pieces of software then you
better invite experts from both sides and let them tune and optimize
things.
Hi Guys,
What do you think of this article:http://dmurphy747.wordpress.com/2011/04/02/solr-vs-elasticsearch-deat...
where elasticsearch and solr are compared with regard to the indexing
speed?
A quote from the article: "I ran each test 4 times, killing the JVM
and removing the data directory for both Solr and elasticsearch. The
final averaged results expressed as throughputs were 43204 docs/sec
for Solr, 44052 docs/sec for Solr direct streaming, and 9823 docs/sec
for elasticsearch."
PS: Don't take me wrong, I know that it is only one (partial) test,
and that some features in elasticsearch make it unique!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.