Thanks, updated below...
On Wed, Apr 6, 2011 at 9:43 PM, Shay Banon shay.banon@elasticsearch.comwrote:
Heya, great you jumped in :), answers below:
On Thursday, April 7, 2011 at 12:02 AM, tjake wrote:
On Apr 6, 7:21 am, Shay Banon shay.ba...@elasticsearch.com wrote:
Yes, agreed, Solandra and Lucandra are pretty cool in terms of concepts,
but I think they lack when it comes to how they use Lucene (putting the solr
aspect aside).
The problem (and its based on my last check in how the Lucene on top of
Cassandra implementation was done) is how the reader/searcher works on top
of Cassandra. First, it bypass a lot of optimizations lucene has in terms of
how it stores/fetches data in its reader/searcher implementation, and uses
Cassandra to load that info.
You mean in terms of file format correct? It's true Lucene format is
much better here but Cassandra will get to this.
Yes, the intimacy with how Lucene works with the data structures it
generates and then searches on is a great boon for performance. Cassandra is
blazing fast, but its hard to believe it can catch up to a system that
creates the data structures it knows it is going to consume.
I guess the limitation is if you can't represent the data in a column family
style. I don't see anything in the lucene file format that isn't equally
representable. With Solandra on a single node is basically the same search
performance for solr with facets and sorting.
In the meantime the
benefit of the data being truly masterless and scalable (compared to a
tech like solr).
Agreed, thats a real benefit. The main problem here though is that Solandra
still needs to work with solr distributed search support
It's not that bad the binary format is ok, the only problem is when you
want to jump to result 100k it resends all that results. (not sure if 3.1
has addressed this...) Also there is no reason why Solandra has to only use
Solr only. I could picture a ES service layer on Cassandra as well It
seems much of what you've built is similar to cassandra in many ways with
write consistency and live scaling, replication. But my goal right now is to
provide true distributed to all the solr users out there...
The more problematic nature is how lucene uses things like FieldCache and
caches based on readers, which is problematic when using the reader/searcher
based on Lucandra (sorting and facets for example). This can cause very
severe performance/memory problems.
Solandra breaks the index into manageable shards so this is less of a
problem than it was in Lucandra.
Right, but how big can a single shard be? The more you have, the more
problems you run into with how Solr executes distributed search (blocking
IO, http, and all the other bits). And with big shards, having to reload all
the caches once you want to see changes is very costly.
The solution for this long term
comes when Cassandra trigger support comes in, then it will be able to
update the FieldCache's without invalidating them (since Solandra
allocates a fixed space of documents that can be filled in over
time).
FieldCache is one (very good) example. Other includes filter caching, or
any reader cache base constructs. Those are used heavily in advance search
systems and become harder to solve with trigger based approach.
Right, but this is the plan of attack to handle not needing to invalidate
the caches as often.
I haven't taken a deep look at how ES handles this problem but
I'd love a breif description if you have a sec Shay.
Sure, no secret sauce here, it basically uses the same logic as Lucene
FIeldCache, using the segment cache key to cache the data (like field level
cache). Those are immutable (not evicted because of deletes, mind you).
Again, this is based on my overview of the code. If I am missing something,
I would love to be corrected. Not here to spread FUD or anything.
I don't see any FUD here. Simply that Solandra is taking a more
fundamental approach to handling distributed search by dealing with it
at the file format level... Elasticsearch has build distributed search
ontop of Lucene and added a number of great features and service
layer.
Its certainly a cool way to try and solve it. I have been there, trying to
do the same starting from custom Directory implementations to custom readers
mainly optimized to use Data Grids (with collocation), elasticsearch, at
least for me, is the next step that I took from there, mainly letting Lucene
do what it does best, and try and wrap it in the best way possible :).
Term based Vs. Document based partitioning has been a great question to ask
when trying to build distributed search engines. I think that is has
basically been proven, at least from the practical sense, that document
based partitioning is the way to go (watch this on google doing it:
Challenges in Building Large-Scale Information Retrieval Systems - VideoLectures.NET, at 9:44 mark).
This is, by the way, why riak search, though another really cool
technology, is very very problematic. They do term based partitioning, and
its a slippery road from there (think of implementing something like facets
in a collocated manner with term based partitioning, something called by doc
data).
Though Solandra does not strictly use term based partitioning, it still
replicate based on terms (I think?).
Lucandra partitioned on terms and this was flawed. Solandra uses Document
partitions so a chunk of documents are available locally to search from this
minimized the cross node IO. A later feature will be partitioned indexes
based on a field like time windows...
Solandras strongest use case is for those who need potentially
millions of smaller indexes since it's not creating a directory per
index under the hood but rather using composite keys in Cassandra.
Agreed, thats a boon. In elasticsearch there is no reason why you could not
create million of indices, but the overhead of each shard Lucene wise is
substantial (starting with file handles...).
At least for those cases, elasticsearch wise, the idea that you can control
routing when indexing and searching allows you to create index with N number
of shards, and index all those docs with the a qualifier (that also controls
routing) and be able to filter only the ones you want out.
-Jake
-shay.banon
On Wednesday, April 6, 2011 at 3:14 PM, Karussell wrote:
Just as a side note.
Maybe it is a bit unfair to compare Solandra with Elasticsearch.
Solandra looks pretty cool and interesting.
3 months ago I did a minor evaluation for my project and Solandra was
in a useful state.
BUT as it turns out performance and memory usage of ES (or pure solr)
was far better on a single machine than Solandra.
Also Solandra had some bugs (jvm crashed, facets didn't work) which
then were fixed I think.
Also ES is a lot cleaner. In terms of technologies involved and in
terms of written unit tests. If you look into the test folder of
solandra it is pretty much empty.
Then Solandra is 1 year old, where Elasticsearch is built from an
author with concepts of the well known, several years old project
'Compass' in mind
Regards,
Peter.
On 6 Apr., 11:08, Shay Banon shay.ba...@elasticsearch.com wrote:
Its not what was asked for. The question was for something to automatically
index changes done to cassandra. Solandra tries to build solr on top of
cassandra to make it better distributed, but it combines then two problems
instead of one: using solr (distributed execution) and trying to hack Lucene
to work on top of cassandra which is problematic and probably bad in most
cases.
On Wednesday, April 6, 2011 at 6:41 AM, Paul Smith wrote:
It's not based on Elasticsearch, but tjake has the Solandra integration:
GitHub - tjake/Solandra: Solandra = Solr + Cassandra
On 5 April 2011 20:44, Shay Banon shay.ba...@elasticsearch.com wrote:
Not that I am aware... .
On Monday, April 4, 2011 at 8:12 PM, Drew wrote:
Hi Everyone,
Has anyone done/attempted to integrate Elasticsearch with Cassandra
like was described athttp://
Elasticsearch Platform — Find real-time answers at scale | Elastic
?
I don't mean using Cassandra as a Gateway for ES, but using ES to
query the data in Cassandra automatically.
Thanks,
Drew
--
http://twitter.com/tjake