[ANN] es-nozzle 0.3.0 - elasticsearch document ingestion


(Ralf Schmitt) #1

Hi all,

on behalf of brainbot technologies AG I'm proud to release es-nozzle
0.3.0 as open source today. This is the first public release.

es-nozzle is a scalable open source framework for connecting source
content repositories like file systems or mail servers to
ElasticSearch clusters.

The framework supports source-repository security policies in
ElasticsSearch and therefore enables users to create open source
Enterprise Search solutions based on es-nozzle and ElasticSearch.

The architecture allows for scalable and fault tolerant
synchronization setups that complement the scalability of
ElasticSearch clusters.

Professional development and production support is available through
brainbot technologies AG, a company specialized in search solutions
which created the framework.

Links

documentation:
http://brainbot.com/es-nozzle/doc/

prebuilt distribution:
http://brainbot.com/es-nozzle/download/es-nozzle-0.3.0.zip

source code:
https://github.com/brainbot-com/es-nozzle

Contact:
http://brainbot.com / es-nozzle@brainbot.com
(or use the elasticsearch mailing list)

We're excited to get feedback about this release, so please give it a
shot and let us know about your experience.

--
Cheers
Ralf Schmitt

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ban Mido) #2

hello Ralk .

I am trying to understand what es-nozzle is exactly is.
I am guessing its some feed fetching utlitiy to fetch feeds fro outside
world and index into ES in a distributed manner.

If so , i have following question

  1. Is it extensible ? Like i want to fetch feeds from RSS in real time
    fashion , can fetch them using this application
  2. Does it support parllel fetch. Like fetching all the feeds in a mail
    inbox at the same time.
  3. If 2 is possible , can i limit number of thread per source .
  4. Is there any provision for dedupe checking ? Like dedupe based on
    title.
  5. What all are the data sources it support at present ?

Looking forward for your reply.

Thanks
Vineeth

On Thu, Sep 26, 2013 at 6:49 PM, Ralf Schmitt ralf@brainbot.com wrote:

Hi all,

on behalf of brainbot technologies AG I'm proud to release es-nozzle
0.3.0 as open source today. This is the first public release.

es-nozzle is a scalable open source framework for connecting source
content repositories like file systems or mail servers to
ElasticSearch clusters.

The framework supports source-repository security policies in
ElasticsSearch and therefore enables users to create open source
Enterprise Search solutions based on es-nozzle and ElasticSearch.

The architecture allows for scalable and fault tolerant
synchronization setups that complement the scalability of
ElasticSearch clusters.

Professional development and production support is available through
brainbot technologies AG, a company specialized in search solutions
which created the framework.

Links

documentation:
http://brainbot.com/es-nozzle/doc/

prebuilt distribution:
http://brainbot.com/es-nozzle/download/es-nozzle-0.3.0.zip

source code:
https://github.com/brainbot-com/es-nozzle

Contact:
http://brainbot.com / es-nozzle@brainbot.com
(or use the elasticsearch mailing list)

We're excited to get feedback about this release, so please give it a
shot and let us know about your experience.

--
Cheers
Ralf Schmitt

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ralf Schmitt) #3

Ban Mido banmidobeyondtime@gmail.com writes:

hello Ralk .

I am trying to understand what es-nozzle is exactly is.
I am guessing its some feed fetching utlitiy to fetch feeds fro outside
world and index into ES in a distributed manner.

Hi Vineeth,

thanks for your feedback and sorry if the description has been blurry.

I'm not sure which feeds you like to import. es-nozzle can be used to
import hierarchically organised data ('filesystem like') into ES.

The current version of es-nozzle allows you to recursively index
filesystem directories and the documents contained in those directories
into ES. It's similar in purpose to the well known fsriver plugin
(http://www.pilato.fr/fsriver/).

If so , i have following question

  1. Is it extensible ? Like i want to fetch feeds from RSS in real time
    fashion , can fetch them using this application

yes, but we still need to document how to write extensions.

  1. Does it support parllel fetch. Like fetching all the feeds in a mail
    inbox at the same time.

yes.

  1. If 2 is possible , can i limit number of thread per source .

no, you can limit the number of threads used in a single process, but
when you're scaling es-nozzle to multiple machines, there's no way to
limit the number of threads working on a single source. Different
es-nozzle processes do not synchronize with each other, other than
through queuing/consuming messages from RabbitMQ.

But if you need that limit, there's also probably no need to scale out
to multiple machines, at least not for the source in question.

  1. Is there any provision for dedupe checking ? Like dedupe based on
    title.

no, that not possible at the moment. I'm also not sure if it would make
sense to have this functionality in es-nozzle.

  1. What all are the data sources it support at present ?

es-nozzle currently supports indexing from a locally mounted file system
and from a windows file server via the cifs protocol. IMAP, Microsoft
Exchange and Sharepoint can be licensed from brainbot technologies.

Hope that helps, pleas let us know if you have more questions.

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ban Mido) #4

Hello Ralf ,

My requirement is as below and i am seeing if i can fit this system into it

  1. I want to fetch feed from N number of RSS sources. As es-nozzle
    supports this , i can see that this is perfectly possible.
  2. As a news source like IBN or Bloomberg will block me if i try tofetch
    more than 2 news at a time , i want to limit the number of workers per
    source.
  3. Also there can occur dedupes within a RSS or among RSS. Like the
    exact same news can come in bloomberg and IBN. Here primary key is the
    title. I want to store the feed as one but preserve the information that
    this news has come from 2 different sources.

Let me know , how i can use es-nozzle for this.

Thanks
Vineeth

On Tue, Oct 1, 2013 at 3:26 PM, Ralf Schmitt ralf@brainbot.com wrote:

Ban Mido banmidobeyondtime@gmail.com writes:

hello Ralk .

I am trying to understand what es-nozzle is exactly is.
I am guessing its some feed fetching utlitiy to fetch feeds fro outside
world and index into ES in a distributed manner.

Hi Vineeth,

thanks for your feedback and sorry if the description has been blurry.

I'm not sure which feeds you like to import. es-nozzle can be used to
import hierarchically organised data ('filesystem like') into ES.

The current version of es-nozzle allows you to recursively index
filesystem directories and the documents contained in those directories
into ES. It's similar in purpose to the well known fsriver plugin
(http://www.pilato.fr/fsriver/).

If so , i have following question

  1. Is it extensible ? Like i want to fetch feeds from RSS in real time
    fashion , can fetch them using this application

yes, but we still need to document how to write extensions.

  1. Does it support parllel fetch. Like fetching all the feeds in a
    mail
    inbox at the same time.

yes.

  1. If 2 is possible , can i limit number of thread per source .

no, you can limit the number of threads used in a single process, but
when you're scaling es-nozzle to multiple machines, there's no way to
limit the number of threads working on a single source. Different
es-nozzle processes do not synchronize with each other, other than
through queuing/consuming messages from RabbitMQ.

But if you need that limit, there's also probably no need to scale out
to multiple machines, at least not for the source in question.

  1. Is there any provision for dedupe checking ? Like dedupe based on
    title.

no, that not possible at the moment. I'm also not sure if it would make
sense to have this functionality in es-nozzle.

  1. What all are the data sources it support at present ?

es-nozzle currently supports indexing from a locally mounted file system
and from a windows file server via the cifs protocol. IMAP, Microsoft
Exchange and Sharepoint can be licensed from brainbot technologies.

Hope that helps, pleas let us know if you have more questions.

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ralf Schmitt) #5

Ban Mido banmidobeyondtime@gmail.com writes:

Hello Ralf ,

My requirement is as below and i am seeing if i can fit this system into it

  1. I want to fetch feed from N number of RSS sources. As es-nozzle
    supports this , i can see that this is perfectly possible.
  2. As a news source like IBN or Bloomberg will block me if i try tofetch
    more than 2 news at a time , i want to limit the number of workers per
    source.
  3. Also there can occur dedupes within a RSS or among RSS. Like the
    exact same news can come in bloomberg and IBN. Here primary key is the
    title. I want to store the feed as one but preserve the information that
    this news has come from 2 different sources.

Let me know , how i can use es-nozzle for this.

I don't think es-nozzle is a good fit for your requirements. Fetching
feeds from N RSS sources doesn't map to "hierarchically organised data"
(which es-nozzle would be able to handle).

I would try http://www.pilato.fr/rssriver/

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ban Mido) #6

Actually i wanted to fetch the news i the link of each feed item in the RSS.
Which exactly fit "hierarchically organised data" description
So it would be

RSS -> Set of news Link -> HTTP -> Set of news Content

Thanks
Vineeth

On Tue, Oct 1, 2013 at 5:33 PM, Ralf Schmitt ralf@brainbot.com wrote:

Ban Mido banmidobeyondtime@gmail.com writes:

Hello Ralf ,

My requirement is as below and i am seeing if i can fit this system into
it

  1. I want to fetch feed from N number of RSS sources. As es-nozzle
    supports this , i can see that this is perfectly possible.
  2. As a news source like IBN or Bloomberg will block me if i try
    tofetch
    more than 2 news at a time , i want to limit the number of workers per
    source.
  3. Also there can occur dedupes within a RSS or among RSS. Like the
    exact same news can come in bloomberg and IBN. Here primary key is the
    title. I want to store the feed as one but preserve the information
    that
    this news has come from 2 different sources.

Let me know , how i can use es-nozzle for this.

I don't think es-nozzle is a good fit for your requirements. Fetching
feeds from N RSS sources doesn't map to "hierarchically organised data"
(which es-nozzle would be able to handle).

I would try http://www.pilato.fr/rssriver/

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ralf Schmitt) #7

Ban Mido banmidobeyondtime@gmail.com writes:

Actually i wanted to fetch the news i the link of each feed item in the RSS.
Which exactly fit "hierarchically organised data" description
So it would be

RSS -> Set of news Link -> HTTP -> Set of news Content

Hi Vineeth,

if you view each RSS feed as directory, probably with dates as
subdirectories, so you end up with a hierarchy like

/bloomberg/2012/..
/bloomberg/2013/09/30/...
/bloomberg/2013/10/01/...

it may be possible. But RSS only gives the latest N items, so you would
have to store, which older articles are available (or use what is stored
in ES).

It may be possible, but IMHO it's not a good fit at the moment.

What's wrong with the rssriver?

--
Cheers
Ralf

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(konrad) #8

Am Dienstag, 1. Oktober 2013 14:35:31 UTC+2 schrieb Ralf Schmitt:

Ban Mido <banmidob...@gmail.com <javascript:>> writes:

Actually i wanted to fetch the news i the link of each feed item in the
RSS.
Which exactly fit "hierarchically organised data" description
So it would be

RSS -> Set of news Link -> HTTP -> Set of news Content

it may be possible. But RSS only gives the latest N items, so you would
have to store, which older articles are available (or use what is stored
in ES).

It may be possible, but IMHO it's not a good fit at the moment.

Hi Vineeth,

while I understand Ralf's assessment that it may not be a good fit at the moment, I absolutely want to encourage you to give it a shot!
From my point of view, rssriver could be the wrong solution whenever you have scaling requirements or want to use access control mechanisms for RSS feeds of protected resources.

es-nozzle is designed with expandability in mind. However, your specific use case will have to deal with some obstacles. As Ralf pointed out, RSS usually only gives you the most recent N results. An implementation that extends the vfs (virtual file system) and uses the standard ES connector (esconnect.clj) would treat older RSS items like 'deleted' items and purge them from ES. But that shouldn't be a showstopper (have a look at esconnect.clj, you could work around the remove parts in the sync-logic).

Furthermore es-nozzle doesn't incorporate any dedupe mechanism across datasources (or 'filesystems' in the es-nozzle docs lingo). This is something we leave to the search applications. That means, if you run against IBN and Bloomberg and happen to crawl a duplicate item, it will end up in both respective indices.

Anyway, it is doable and I think you should try it! Let us know, if you need any further help!

Regards, Konrad

brainbot technologies ag

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #9