Just Pushed: Indexer Support + Twtiter indexer plugin

Hi,

Both the indexer support and an initial indexer implementation for
twitter sample stream are now at master. Index information can be found
here: http://github.com/elasticsearch/elasticsearch/issues/closed#issue/377.
The twitter indexer plugin here:
http://github.com/elasticsearch/elasticsearch/issues/closed#issue/378.

I encourage you to head over and read it. I had a great Eureka moment
where I decided to store the indexers meta and state information as another
index in elasticsearch (did someone say eating your own god food? :wink: ) which
really means the indexer is very open and have a good state persistance api
(the elasticsearch API).

The indexer itself is pretty open, here is the twitter indexer (pretty
simple):
http://github.com/elasticsearch/elasticsearch/blob/master/plugins/indexer/twitter/src/main/java/org/elasticsearch/indexer/twitter/TwitterIndexer.java
.

Some of my plans are to provide for polygot indexers (write your own

indexer in groovy, ruby), but since it will probably require the
elasticsearch Client API, I would love to first get proper support for the
JVM lang (like it is with the groovy case). Also, I have some ideas for more
indexers implementation like wikipedia, rabbitmq, JMS, redis, couchdb, and
others.

As a side note, the sample twitter stream API is pretty slow (in

elasticsearch terms) so you can easily run it on your laptop without it
breaking a sweat at all ;). Would have loved to test it with the firehose...
.

-shay.banon

Hi Shay,

I'm not sure I get the difference between indexing stuff by using the
new indexer feature, or by normally sending data through client APIs
(or curl or whatever): may you elaborate more?

Thanks,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Shay can explain the details, but one thing that is very interesting for us
is that it looks like indexer gives us automatic failover for the client.

"Indexers are singletons within the cluster. They get allocated
automatically to one of the nodes and run. If that node fails, an indexer
will be automatically allocated to another node"

When you send the data through client APIs, you're responsible for the
client prg availability, etc. Indexer runs as part of the cluster, hence
can have better availability. It can eliminate the code we have to monitor
the client (our clients run continuously)

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Tue, Sep 21, 2010 at 11:04 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

Hi Shay,

I'm not sure I get the difference between indexing stuff by using the
new indexer feature, or by normally sending data through client APIs
(or curl or whatever): may you elaborate more?

Thanks,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Hi,

Berkay touched on an important point of this feature. It all boils down to
how you index data into elasticsearch. There are two different categories of
doing it, the first is by responding to real time events and applying them
to the index. For example, indexing data as part of your "insert" path
within the web application, or by using hooks into nosql solutions (like
terrastore).

The other is one where you need to write code that performs a bridge
between the data coming in and the indexing API. As an example, writing
something that listens on rabbitmq for indexing messages and applies them,
or something that listens to the twitter stream and index it, or something
that sits on top of couchdb _changes and applies those, or even sit on top
of a database and index it. For those cases, a custom runtime component is
usually written to do this bridging. But then, when writing one, you need to
handle its failover, and state (what I indexed last + other). The indexer
support in elasticsearch aims at solving this problem.

I did not name those two cases properly (need to think of better names)
but I hope the examples make sense.

-shay.banon

On Tue, Sep 21, 2010 at 10:13 AM, Berkay Mollamustafaoglu <mberkay@gmail.com

wrote:

Shay can explain the details, but one thing that is very interesting for us
is that it looks like indexer gives us automatic failover for the client.

"Indexers are singletons within the cluster. They get allocated
automatically to one of the nodes and run. If that node fails, an indexer
will be automatically allocated to another node"

When you send the data through client APIs, you're responsible for the
client prg availability, etc. Indexer runs as part of the cluster, hence
can have better availability. It can eliminate the code we have to monitor
the client (our clients run continuously)

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Tue, Sep 21, 2010 at 11:04 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

Hi Shay,

I'm not sure I get the difference between indexing stuff by using the
new indexer feature, or by normally sending data through client APIs
(or curl or whatever): may you elaborate more?

Thanks,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

One more thing, "indexer" has now been renamed to "river". So search/replace
everything from indexer to river. Issues have been updated.

On Tue, Sep 21, 2010 at 10:31 AM, Shay Banon
shay.banon@elasticsearch.comwrote:

Hi,

Berkay touched on an important point of this feature. It all boils down
to how you index data into elasticsearch. There are two different categories
of doing it, the first is by responding to real time events and applying
them to the index. For example, indexing data as part of your "insert" path
within the web application, or by using hooks into nosql solutions (like
terrastore).

The other is one where you need to write code that performs a bridge
between the data coming in and the indexing API. As an example, writing
something that listens on rabbitmq for indexing messages and applies them,
or something that listens to the twitter stream and index it, or something
that sits on top of couchdb _changes and applies those, or even sit on top
of a database and index it. For those cases, a custom runtime component is
usually written to do this bridging. But then, when writing one, you need to
handle its failover, and state (what I indexed last + other). The indexer
support in elasticsearch aims at solving this problem.

I did not name those two cases properly (need to think of better names)
but I hope the examples make sense.

-shay.banon

On Tue, Sep 21, 2010 at 10:13 AM, Berkay Mollamustafaoglu <
mberkay@gmail.com> wrote:

Shay can explain the details, but one thing that is very interesting for
us is that it looks like indexer gives us automatic failover for the
client.

"Indexers are singletons within the cluster. They get allocated
automatically to one of the nodes and run. If that node fails, an indexer
will be automatically allocated to another node"

When you send the data through client APIs, you're responsible for the
client prg availability, etc. Indexer runs as part of the cluster, hence
can have better availability. It can eliminate the code we have to monitor
the client (our clients run continuously)

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Tue, Sep 21, 2010 at 11:04 AM, Sergio Bossa sergio.bossa@gmail.comwrote:

Hi Shay,

I'm not sure I get the difference between indexing stuff by using the
new indexer feature, or by normally sending data through client APIs
(or curl or whatever): may you elaborate more?

Thanks,
Cheers,

Sergio B.

--
Sergio Bossa
http://www.linkedin.com/in/sergiob