Elasticsearch and cassandra integration?

Hello!

I am looking for an integration b/w elastic search and cassandra, so that I
can index and search my data sitting in cassandra cluster.
I found a bunch of plugins for ES but not for cassandra. Is there is a
reason why no one has attempted to write the plugin?

Existing integrations:

  1. Found a 2yr old version of ES, which has a cassandra plugin:
    https://github.com/gistinc/elasticsearch/tree/cassandra/plugins/cassandra
    not sure how will this work out.
  2. Also found:
    http://architects.dzone.com/articles/big-data-quadfecta-cassandra but I see
    a lot of moving pieces there (storm, kafka etc) and Cassandra Bolt still
    has some open issues due to storm bug as pointed out in the article.

So my questions are:

  1. Has any one ever tried integration of ES with Cassandra? Is it a good
    idea?
  2. Any pointers on how to get started?

Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi,

in
https://github.com/gistinc/elasticsearch/tree/cassandra/plugins/cassandra you
have found old gateway code, that is, how Cassandra could be used to
persist ES index data.

Cassandra, as a member of distributed nosql dbms, is quite similar to ES
in the sense of JVM resource demands and workloads (and I think ES has
inherited the ./bin/cassandra -f notion to start a node in the
foreground) but in the underlying concept they are different.

To index data in ES you have to decide how to model your documents in a
JSON representation. In Cassandra, you have columnar data. If you find a
method to design documents from the columnar data, then you can index
them into ES, that's the theory.

The Cassandra Bolt and the Elasticsearch Bolt for Storm developed by
Brian O'Neill, one of the leading Cassandra developers is an elegant way
for pushing the Cassandra data around in a distributed system into
Elasticsearch. It is similar to a distributed changes stream. You can
see a changes stream in action in the ES couchdb river.

The Storm issue mentioned by Brian reminds me of a similar issue with
Elasticsearch plugins, they are also not classpath-isolated from each
other. I think this is not fatal. I would not get too discouraged, it is
more a packaging/upgrade/compatibility issue, that can be solved,
especially with open source projects (maybe not quickly and easily though).

Jörg

Am 21.03.13 20:25, schrieb utkarsh2012@gmail.com:

Hello!

I am looking for an integration b/w Elasticsearch and cassandra, so
that I can index and search my data sitting in cassandra cluster.
I found a bunch of plugins for ES but not for cassandra. Is there is a
reason why no one has attempted to write the plugin?

Existing integrations:

  1. Found a 2yr old version of ES, which has a cassandra plugin:
    https://github.com/gistinc/elasticsearch/tree/cassandra/plugins/cassandra
    not sure how will this work out.
  2. Also found:
    http://architects.dzone.com/articles/big-data-quadfecta-cassandra but
    I see a lot of moving pieces there (storm, kafka etc) and Cassandra
    Bolt still has some open issues due to storm bug as pointed out in the
    article.

So my questions are:

  1. Has any one ever tried integration of ES with Cassandra? Is it a
    good idea?
  2. Any pointers on how to get started?

Thanks,
-Utkarsh

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the update Jorg.
I understand that I need to model the columnar data to json. So, just to
summarize the options I have:

  1. Manually write a script which reads data from cassandra and pushes it to
    elasticsearch. Say a python script running as a cron.
  2. Write a cassandra-river for ES, which will encapsulate the logic of
    transforming data in cassandra to JSON for ES.
  3. Use Brian's cassandra-storm-es technique (which needs cassandra-bolt and
    Elasticsearch bolt for storm).

From my perspective, the easiest thing to do will be to go with Pt.1
(writing a manual script which handles data import for me).
Pt.2 also looks interesting but I will need to investigate sometime in
understanding how to write an ES river. The only reason why I am not
inclined towards storm is we don't have it in our stack and don't want to
add an addition software management burden.

So, what do you think about this approach? How does writing a river differ
from manually pushing data from cassandra to ES?

Thanks,
-Utkarsh

On Thu, Mar 21, 2013 at 3:03 PM, Jörg Prante joergprante@gmail.com wrote:

Hi,

in https://github.com/gistinc/**elasticsearch/tree/cassandra/**
plugins/cassandrahttps://github.com/gistinc/elasticsearch/tree/cassandra/plugins/cassandrayou have found old gateway code, that is, how Cassandra could be used to
persist ES index data.

Cassandra, as a member of distributed nosql dbms, is quite similar to ES
in the sense of JVM resource demands and workloads (and I think ES has
inherited the ./bin/cassandra -f notion to start a node in the foreground)
but in the underlying concept they are different.

To index data in ES you have to decide how to model your documents in a
JSON representation. In Cassandra, you have columnar data. If you find a
method to design documents from the columnar data, then you can index them
into ES, that's the theory.

The Cassandra Bolt and the Elasticsearch Bolt for Storm developed by Brian
O'Neill, one of the leading Cassandra developers is an elegant way for
pushing the Cassandra data around in a distributed system into
Elasticsearch. It is similar to a distributed changes stream. You can see a
changes stream in action in the ES couchdb river.

The Storm issue mentioned by Brian reminds me of a similar issue with
Elasticsearch plugins, they are also not classpath-isolated from each
other. I think this is not fatal. I would not get too discouraged, it is
more a packaging/upgrade/**compatibility issue, that can be solved,
especially with open source projects (maybe not quickly and easily though).

Jörg

Am 21.03.13 20:25, schrieb utkarsh2012@gmail.com:

Hello!

I am looking for an integration b/w Elasticsearch and cassandra, so that
I can index and search my data sitting in cassandra cluster.
I found a bunch of plugins for ES but not for cassandra. Is there is a
reason why no one has attempted to write the plugin?

Existing integrations:

  1. Found a 2yr old version of ES, which has a cassandra plugin:
    https://github.com/gistinc/**elasticsearch/tree/cassandra/**
    plugins/cassandrahttps://github.com/gistinc/elasticsearch/tree/cassandra/plugins/cassandranot sure how will this work out.
  2. Also found: http://architects.dzone.com/**articles/big-data-quadfecta-
    **cassandrahttp://architects.dzone.com/articles/big-data-quadfecta-cassandrabut I see a lot of moving pieces there (storm, kafka etc) and Cassandra
    Bolt still has some open issues due to storm bug as pointed out in the
    article.

So my questions are:

  1. Has any one ever tried integration of ES with Cassandra? Is it a good
    idea?
  2. Any pointers on how to get started?

Thanks,
-Utkarsh

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**9TJFiWr1oUQ/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

A river has some benefits, it is a plugin, which means tight integration
with ES. Just put the JAR into the plugin path and it will be picked up.
The river will be managed as singleton, so if the node fails the river
is running on, the cluster will activate the river on another node. And,
what I woud like most, rivers can be shared with the community here, who
are interested in a configurable method to move data from cassandra to
ES. You can invite others to improve the code.

On the other hand, with manually pushing your data, you can implement a
solution which is optimal to fit your requirements, but you have extra
things you must take care of, most important the cluster node switchover
in case of node downtime or failure. And it may not be attractive to the
ES community to join you effort if the approach is not a generic one.

Jörg

Am 22.03.13 21:28, schrieb Utkarsh Sengar:

How does writing a river differ from manually pushing data from
cassandra to ES?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I agree with you. I am also inclined towards implementing a plugin due to
lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install it
    in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss river
    code helps. Can I safely assume that I can use rss river as a boildeplate
    project for cassandra river right? Or is there a way to create a plugin
    project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

Thanks,
-Utkarsh

On Fri, Mar 22, 2013 at 5:15 PM, Jörg Prante joergprante@gmail.com wrote:

A river has some benefits, it is a plugin, which means tight integration
with ES. Just put the JAR into the plugin path and it will be picked up.
The river will be managed as singleton, so if the node fails the river is
running on, the cluster will activate the river on another node. And, what
I woud like most, rivers can be shared with the community here, who are
interested in a configurable method to move data from cassandra to ES. You
can invite others to improve the code.

On the other hand, with manually pushing your data, you can implement a
solution which is optimal to fit your requirements, but you have extra
things you must take care of, most important the cluster node switchover in
case of node downtime or failure. And it may not be attractive to the ES
community to join you effort if the approach is not a generic one.

Jörg

Am 22.03.13 21:28, schrieb Utkarsh Sengar:

How does writing a river differ from manually pushing data from cassandra

to ES?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**9TJFiWr1oUQ/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. I use IntelliJ (previously Netbeans) and mvn on command line but
    Eclipse TestNG use is documented here: TestNG - Eclipse

  2. Debugging running plugins works like debugging a running ES node.
    Beside extensive logging I use tools like jvisualvm to analyze runtime
    behaviour.

  3. I think it is best to start from an existing river as boilerplate
    code. It helps to examine the river sources documented at
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Jörg

Am 23.03.13 04:56, schrieb Utkarsh Sengar:

I agree with you. I am also inclined towards implementing a plugin due
to lack of Elasticsearch and cassandra integration. I have been
looking at the jdbc and rss river and it surely helps to understand
the anatomy of an ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests
    via eclipse?
  2. Say I have a working prototype of the plugin and I manually install
    it in my local Elasticsearch instance by placing the plugin project
    in the plugins folder. What is the best way to debug the plugin in ES,
    except logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss
    river code helps. Can I safely assume that I can use rss river as a
    boildeplate project for cassandra river right? Or is there a way to
    create a plugin project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the answer! I was able to write a simple river for cassandra
while pulls data periodically (similar to couchdb's river).

Which leads to some questions:

  1. I saw that EsExecutors exists but there is no implementation of
    ScheduledExecutorService. So, is there any reason why EsExecutor is
    implemented other than having a custom name and priority? Can I use
    ScheduledExecutorService inside a river without any performance issues?

  2. What I am doing for now is, I have 1 thread which wakes up every x hours
    and moves all the data from cassandra to ES, everytime. Its not very
    performant if the data is alot (will add some kind of batching of records).
    So wanted to know, are there some standard practices while throwing data to
    ES?

The implementation is just 1 day old, very raw. I will put it up on github
soon!
I loved the simple APIs and it was very east to get started with (except
lack of documentation, but reference implementations helped)!

Thanks,
-Utkarsh

On Sat, Mar 23, 2013 at 2:31 AM, Jörg Prante joergprante@gmail.com wrote:

  1. I use IntelliJ (previously Netbeans) and mvn on command line but
    Eclipse TestNG use is documented here: http://testng.org/doc/eclipse.**
    html http://testng.org/doc/eclipse.html

  2. Debugging running plugins works like debugging a running ES node.
    Beside extensive logging I use tools like jvisualvm to analyze runtime
    behaviour.

  3. I think it is best to start from an existing river as boilerplate code.
    It helps to examine the river sources documented at
    Elasticsearch Platform — Find real-time answers at scale | Elastichttp://www.elasticsearch.org/guide/reference/modules/plugins.html

Jörg

Am 23.03.13 04:56, schrieb Utkarsh Sengar:

I agree with you. I am also inclined towards implementing a plugin due to

lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install it
    in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss river
    code helps. Can I safely assume that I can use rss river as a boildeplate
    project for cassandra river right? Or is there a way to create a plugin
    project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**9TJFiWr1oUQ/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@**googlegroups.comelasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

  1. EsExecutors is a helper for ES internal thread pools. You can use
    other Java classes inside a river of course, there is no restriction.

  2. You can index in bulk with the classes in
    org.elasticsearch.action.bulk, and also, you can bulk index in parallel.
    I recommend using a concurrent request threshold.

Jörg

Am 26.03.13 02:46, schrieb Utkarsh Sengar:

Thanks for the answer! I was able to write a simple river for
cassandra while pulls data periodically (similar to couchdb's river).

Which leads to some questions:

  1. I saw that EsExecutors exists but there is no implementation of
    ScheduledExecutorService. So, is there any reason why EsExecutor is
    implemented other than having a custom name and priority? Can I use
    ScheduledExecutorService inside a river without any performance issues?

  2. What I am doing for now is, I have 1 thread which wakes up every x
    hours and moves all the data from cassandra to ES, everytime. Its not
    very performant if the data is alot (will add some kind of batching of
    records).
    So wanted to know, are there some standard practices while throwing
    data to ES?

The implementation is just 1 day old, very raw. I will put it up on
github soon!
I loved the simple APIs and it was very east to get started with
(except lack of documentation, but reference implementations helped)!

Thanks,
-Utkarsh

On Sat, Mar 23, 2013 at 2:31 AM, Jörg Prante <joergprante@gmail.com
mailto:joergprante@gmail.com> wrote:

1. I use IntelliJ (previously Netbeans) and mvn on command line
but Eclipse TestNG use is documented here:
http://testng.org/doc/eclipse.html

2. Debugging running plugins works like debugging a running ES
node. Beside extensive logging I use tools like jvisualvm to
analyze runtime behaviour.

3. I think it is best to start from an existing river as
boilerplate code. It helps to examine the river sources documented
at http://www.elasticsearch.org/guide/reference/modules/plugins.html

Jörg

Am 23.03.13 04:56, schrieb Utkarsh Sengar:

    I agree with you. I am also inclined towards implementing a
    plugin due to lack of elastic search and cassandra
    integration. I have been looking at the jdbc and rss river and
    it surely helps to understand the anatomy of an ES river.

    Although I have some questions about elastic search plugin
    development:
    1. These plugins have some nicely written tests whose test
    suits are defined in xml files under test/resources. How can I
    debug these tests via eclipse?
    2. Say I have a working prototype of the plugin and I manually
    install it in my local elastic search instance by placing the
    plugin project in the plugins folder. What is the best way to
    debug the plugin in ES, except logging the output of-course.
    3. Documentation about plugin development lacks but the sample
    rss river code helps. Can I safely assume that I can use rss
    river as a boildeplate project for cassandra river right? Or
    is there a way to create a plugin project for ES?

    Any pointers from you about ES plugin development will help :)


-- 
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email
to elasticsearch+unsubscribe@googlegroups.com
<mailto:elasticsearch%2Bunsubscribe@googlegroups.com>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Thanks,
-Utkarsh

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Isn't this duplicating the whole data though?

On Friday, March 22, 2013 11:56:32 PM UTC-4, Utkarsh Sengar wrote:

I agree with you. I am also inclined towards implementing a plugin due to
lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install it
    in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss river
    code helps. Can I safely assume that I can use rss river as a boildeplate
    project for cassandra river right? Or is there a way to create a plugin
    project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

Thanks,
-Utkarsh

On Fri, Mar 22, 2013 at 5:15 PM, Jörg Prante <joerg...@gmail.com<javascript:>

wrote:

A river has some benefits, it is a plugin, which means tight integration
with ES. Just put the JAR into the plugin path and it will be picked up.
The river will be managed as singleton, so if the node fails the river is
running on, the cluster will activate the river on another node. And, what
I woud like most, rivers can be shared with the community here, who are
interested in a configurable method to move data from cassandra to ES. You
can invite others to improve the code.

On the other hand, with manually pushing your data, you can implement a
solution which is optimal to fit your requirements, but you have extra
things you must take care of, most important the cluster node switchover in
case of node downtime or failure. And it may not be attractive to the ES
community to join you effort if the approach is not a generic one.

Jörg

Am 22.03.13 21:28, schrieb Utkarsh Sengar:

How does writing a river differ from manually pushing data from

cassandra to ES?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**9TJFiWr1oUQ/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I think there are some issues with river approach.

  1. River runs on only 1 node so we can't leverage the power of cluster.
  2. Data is passed around the cluster. First, data is loaded from a node of
    C* cluster then passed to the node has river running to prepare index, then
    index is distributed to the node should store it depends on ES
    shard/replica strategy.

Vào 04:14:13 UTC+7 Thứ sáu, ngày 24 tháng năm năm 2013, hans robert đã viết:

Isn't this duplicating the whole data though?

On Friday, March 22, 2013 11:56:32 PM UTC-4, Utkarsh Sengar wrote:

I agree with you. I am also inclined towards implementing a plugin due to
lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install it
    in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss river
    code helps. Can I safely assume that I can use rss river as a boildeplate
    project for cassandra river right? Or is there a way to create a plugin
    project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

Thanks,
-Utkarsh

On Fri, Mar 22, 2013 at 5:15 PM, Jörg Prante joerg...@gmail.com wrote:

A river has some benefits, it is a plugin, which means tight integration
with ES. Just put the JAR into the plugin path and it will be picked up.
The river will be managed as singleton, so if the node fails the river is
running on, the cluster will activate the river on another node. And, what
I woud like most, rivers can be shared with the community here, who are
interested in a configurable method to move data from cassandra to ES. You
can invite others to improve the code.

On the other hand, with manually pushing your data, you can implement a
solution which is optimal to fit your requirements, but you have extra
things you must take care of, most important the cluster node switchover in
case of node downtime or failure. And it may not be attractive to the ES
community to join you effort if the approach is not a generic one.

Jörg

Am 22.03.13 21:28, schrieb Utkarsh Sengar:

How does writing a river differ from manually pushing data from

cassandra to ES?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**9TJFiWr1oUQ/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm told the next version of Cassandra will have Triggers, so you can
implement some custom action when new data is inserted. This might be a
better approach then the River, assuming the custom action would execute on
whatever node in the ring received the new data. On the other hand, you
can easily execute multiple instances of the same river by giving them
different names, and a simple bucketing algorithm would ensure they don't
ingest the same data. It feels a little hacky, but then again i'm not sure
i understand the need for the restriction in elasticsearch in the first
place.

On Friday, June 7, 2013 5:05:09 AM UTC-4, Baotq wrote:

I think there are some issues with river approach.

  1. River runs on only 1 node so we can't leverage the power of cluster.
  2. Data is passed around the cluster. First, data is loaded from a node of
    C* cluster then passed to the node has river running to prepare index, then
    index is distributed to the node should store it depends on ES
    shard/replica strategy.

Vào 04:14:13 UTC+7 Thứ sáu, ngày 24 tháng năm năm 2013, hans robert đã
viết:

Isn't this duplicating the whole data though?

On Friday, March 22, 2013 11:56:32 PM UTC-4, Utkarsh Sengar wrote:

I agree with you. I am also inclined towards implementing a plugin due
to lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install
    it in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss river
    code helps. Can I safely assume that I can use rss river as a boildeplate
    project for cassandra river right? Or is there a way to create a plugin
    project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

Thanks,
-Utkarsh

On Fri, Mar 22, 2013 at 5:15 PM, Jörg Prante joerg...@gmail.com wrote:

A river has some benefits, it is a plugin, which means tight
integration with ES. Just put the JAR into the plugin path and it will be
picked up. The river will be managed as singleton, so if the node fails the
river is running on, the cluster will activate the river on another node.
And, what I woud like most, rivers can be shared with the community here,
who are interested in a configurable method to move data from cassandra to
ES. You can invite others to improve the code.

On the other hand, with manually pushing your data, you can implement a
solution which is optimal to fit your requirements, but you have extra
things you must take care of, most important the cluster node switchover in
case of node downtime or failure. And it may not be attractive to the ES
community to join you effort if the approach is not a generic one.

Jörg

Am 22.03.13 21:28, schrieb Utkarsh Sengar:

How does writing a river differ from manually pushing data from

cassandra to ES?

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/**
topic/elasticsearch/**9TJFiWr1oUQ/unsubscribe?hl=en-**UShttps://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US
.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi, Utkarsh

Are you still working on the cassandra river?

Thanks
Vanz

On Monday, March 25, 2013 10:46:50 PM UTC-3, Utkarsh Sengar wrote:

Thanks for the answer! I was able to write a simple river for cassandra
while pulls data periodically (similar to couchdb's river).

Which leads to some questions:

  1. I saw that EsExecutors exists but there is no implementation of
    ScheduledExecutorService. So, is there any reason why EsExecutor is
    implemented other than having a custom name and priority? Can I use
    ScheduledExecutorService inside a river without any performance issues?

  2. What I am doing for now is, I have 1 thread which wakes up every x
    hours and moves all the data from cassandra to ES, everytime. Its not very
    performant if the data is alot (will add some kind of batching of records).
    So wanted to know, are there some standard practices while throwing data
    to ES?

The implementation is just 1 day old, very raw. I will put it up on github
soon!
I loved the simple APIs and it was very east to get started with (except
lack of documentation, but reference implementations helped)!

Thanks,
-Utkarsh

On Sat, Mar 23, 2013 at 2:31 AM, Jörg Prante <joerg...@gmail.com
<javascript:>> wrote:

  1. I use IntelliJ (previously Netbeans) and mvn on command line but
    Eclipse TestNG use is documented here: TestNG - Eclipse

  2. Debugging running plugins works like debugging a running ES node.
    Beside extensive logging I use tools like jvisualvm to analyze runtime
    behaviour.

  3. I think it is best to start from an existing river as boilerplate
    code. It helps to examine the river sources documented at
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Jörg

Am 23.03.13 04:56, schrieb Utkarsh Sengar:

I agree with you. I am also inclined towards implementing a plugin due

to lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install
    it in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss river
    code helps. Can I safely assume that I can use rss river as a boildeplate
    project for cassandra river right? Or is there a way to create a plugin
    project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2a8718cd-22b1-4b42-a938-e771b877fe6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I am not actively working on the elsaticsearch cassandra river now, but
always open to pull requests! :slight_smile:

I found another fork of the project:

-Utkarsh

On Thu, Oct 16, 2014 at 7:10 AM, José Guilherme Vanz <
guilherme.sft@gmail.com> wrote:

Hi, Utkarsh

Are you still working on the cassandra river?

Thanks
Vanz

On Monday, March 25, 2013 10:46:50 PM UTC-3, Utkarsh Sengar wrote:

Thanks for the answer! I was able to write a simple river for cassandra
while pulls data periodically (similar to couchdb's river).

Which leads to some questions:

  1. I saw that EsExecutors exists but there is no implementation of
    ScheduledExecutorService. So, is there any reason why EsExecutor is
    implemented other than having a custom name and priority? Can I use
    ScheduledExecutorService inside a river without any performance issues?

  2. What I am doing for now is, I have 1 thread which wakes up every x
    hours and moves all the data from cassandra to ES, everytime. Its not very
    performant if the data is alot (will add some kind of batching of records).
    So wanted to know, are there some standard practices while throwing data
    to ES?

The implementation is just 1 day old, very raw. I will put it up on
github soon!
I loved the simple APIs and it was very east to get started with (except
lack of documentation, but reference implementations helped)!

Thanks,
-Utkarsh

On Sat, Mar 23, 2013 at 2:31 AM, Jörg Prante joerg...@gmail.com wrote:

  1. I use IntelliJ (previously Netbeans) and mvn on command line but
    Eclipse TestNG use is documented here: TestNG - Eclipse.
    html

  2. Debugging running plugins works like debugging a running ES node.
    Beside extensive logging I use tools like jvisualvm to analyze runtime
    behaviour.

  3. I think it is best to start from an existing river as boilerplate
    code. It helps to examine the river sources documented at
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Jörg

Am 23.03.13 04:56, schrieb Utkarsh Sengar:

I agree with you. I am also inclined towards implementing a plugin due

to lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install
    it in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss
    river code helps. Can I safely assume that I can use rss river as a
    boildeplate project for cassandra river right? Or is there a way to create
    a plugin project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a8718cd-22b1-4b42-a938-e771b877fe6c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2a8718cd-22b1-4b42-a938-e771b877fe6c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADjjot_O9543Gnz3r%2BftFP-4-xaCKZE9ZjAWezt5JgD0B3imqg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Utkarsh

I want to backup old data from Elasticsearch, options are Cassandra and
Hadoop. I want to know which one is better in terms of integration,
scalability and performance.
In cassandra, only do we need to install plugin or there are other pieces
of code that we may need to write.

On Thursday, October 16, 2014 9:07:06 PM UTC+5:30, Utkarsh Sengar wrote:

I am not actively working on the elsaticsearch cassandra river now, but
always open to pull requests! :slight_smile:
GitHub - eBay/cassandra-river: Cassandra river for Elastic search.

I found another fork of the project:
GitHub - srecon/elasticsearch-cassandra-river

-Utkarsh

On Thu, Oct 16, 2014 at 7:10 AM, José Guilherme Vanz <guilhe...@gmail.com
<javascript:>> wrote:

Hi, Utkarsh

Are you still working on the cassandra river?

Thanks
Vanz

On Monday, March 25, 2013 10:46:50 PM UTC-3, Utkarsh Sengar wrote:

Thanks for the answer! I was able to write a simple river for cassandra
while pulls data periodically (similar to couchdb's river).

Which leads to some questions:

  1. I saw that EsExecutors exists but there is no implementation of
    ScheduledExecutorService. So, is there any reason why EsExecutor is
    implemented other than having a custom name and priority? Can I use
    ScheduledExecutorService inside a river without any performance issues?

  2. What I am doing for now is, I have 1 thread which wakes up every x
    hours and moves all the data from cassandra to ES, everytime. Its not very
    performant if the data is alot (will add some kind of batching of records).
    So wanted to know, are there some standard practices while throwing data
    to ES?

The implementation is just 1 day old, very raw. I will put it up on
github soon!
I loved the simple APIs and it was very east to get started with (except
lack of documentation, but reference implementations helped)!

Thanks,
-Utkarsh

On Sat, Mar 23, 2013 at 2:31 AM, Jörg Prante joerg...@gmail.com wrote:

  1. I use IntelliJ (previously Netbeans) and mvn on command line but
    Eclipse TestNG use is documented here: TestNG - Eclipse.
    html

  2. Debugging running plugins works like debugging a running ES node.
    Beside extensive logging I use tools like jvisualvm to analyze runtime
    behaviour.

  3. I think it is best to start from an existing river as boilerplate
    code. It helps to examine the river sources documented at
    Elasticsearch Platform — Find real-time answers at scale | Elastic

Jörg

Am 23.03.13 04:56, schrieb Utkarsh Sengar:

I agree with you. I am also inclined towards implementing a plugin due

to lack of Elasticsearch and cassandra integration. I have been looking at
the jdbc and rss river and it surely helps to understand the anatomy of an
ES river.

Although I have some questions about Elasticsearch plugin development:

  1. These plugins have some nicely written tests whose test suits are
    defined in xml files under test/resources. How can I debug these tests via
    eclipse?
  2. Say I have a working prototype of the plugin and I manually install
    it in my local Elasticsearch instance by placing the plugin project in the
    plugins folder. What is the best way to debug the plugin in ES, except
    logging the output of-course.
  3. Documentation about plugin development lacks but the sample rss
    river code helps. Can I safely assume that I can use rss river as a
    boildeplate project for cassandra river right? Or is there a way to create
    a plugin project for ES?

Any pointers from you about ES plugin development will help :slight_smile:

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/9TJFiWr1oUQ/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/9TJFiWr1oUQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2a8718cd-22b1-4b42-a938-e771b877fe6c%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2a8718cd-22b1-4b42-a938-e771b877fe6c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Thanks,
-Utkarsh

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/4c5f9556-da35-473b-9fe5-04d6dfde4c6f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.