I'm new to the enterprise search world and am beginning a new project that
deals with listings and full-text search. The data store I will be using
is Cassandra for lightning fast insertions, but will definitely need help
on the searching/querying side.
The application will be both write and read/search intensive, but there is
a projected ~600x more reads/searches than writes, so the index will
constantly be updating amongst real-time searches. There will be faceted
searches on multiple attributes and also full-text searches on certain
attributes. Everything needs to be real-time, when a row is persisted
within Cassandra it should be immediately and consistently replicated,
indexed, and searchable in the search system. The entire stack will be
deployed on AWS, so a more cloud oriented solution like ElasticSearch does
Exploring prospective technologies, the main ones that popped out are
Solr/Solandra, ElasticSearch, and Sphinx. Going with a Lucene based
solution leaves me with Solr/Solandra and ElasticSearch. I fully
understand that DataStax Enterprise now has enterprise search integrating
Solr and Cassandra within the same JVM, but I would like follow open-source
and avoid being locked into a specific vendor. After research,
ElasticSearch seems like the more scalable solution being designed for the
cloud (distributed and elastic) and more performant in real-time
applications. Is there any reason I should look more into Solr/Solandra
(Solandra development is dead now after DataStax took in their developer)?
I'm sure DataStax chose to integrate with them for a reason...
Am I looking at the right technology for my use case? Research shows me
that ES may perform better than Solr in my use case, but the user base for
Solr seems a lot more mature. ES does seem to have very active community
What is the best way for me to get started with this project?
How difficult will it be to integrate Cassandra with ES? I see that it
has been done before, but how substantial will the engineering effort be?
Should I be looking at DSE instead to mitigate formidable overhead? Any
further documentation or engineering work done to bridge this gap?
From this link I saw that there is some kind of integration between
Cassandra and ES, but looking at the github, the last commit was 2 years
ago. What kind of integration does this entail and should I even be
Thanks in advance.