ElasticSerach java client or ElasticSerach java plugin for enrich all document at index time?


(stefano ruggiero) #1

hi all,

iam trying to create a new custom node or plugin that listen to all stream
of data in ES cluster and then if a particular document satisfy my
condition add a specific field or something similar. How is the best
approach? there are some sort of examples?

i try to explain better what is my goal:

  1. i would like to have a document parser job inside the ES input data
    stream.
  2. i would like to chose if that document has a particular sintax then if
    yes add custom field or better invoke an other class that store some
    information over the time.
  3. doing this in async way, i mean dont stop the indexing process but only
    read and parse document while indexing.

have you some suggestions on where to start ?

iam at very initial point on decide if it is better to create a plugin or
this can be done with java Client API given from ES. ( i expect that there
will be an ovveride action like onParse(parse doc) that let me read all the
flow of data.. )

Thanks for all your time and attention, i will appreciate all type of help.

Regards.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #2

There are some aspects.

First, ES is distributed. That means, it is very cumbersome to chase docs
while they are on their way from the client to the Lucene index (the
shard). There is a built-in chasing. ES provides a perculator mechanism:
you can register queries, and given a doc to index, clients can find out
what query has matched. This works similar to your condition detection
scenario.

Second, it would be much easy to have full control over the ingest if you
are after filtering docs that are going to be indexed. If there is only one
client, or only one doc source, you should be able to filter the docs for
the condition with little effort, outside ES, also taking load from the
cluster for this job.

And third, if you prefer parsing documents for word detection, look at the
basic Lucene methods. Lucene analyzes token streams for doc fields to
generate the tokens that finally are getting passed to the index. So maybe
you want to focus on the field level to examine words. ES uses the source
JSON just as a handy mechanism to create a list of fields for a Lucene doc.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(stefano ruggiero) #3

Hi jorg, thanks for your clarification.

but consider this scenario where all the features build in ES could not be
applied:

i would like to parse every document at index time ( yes, only document
that i send to that specific node.. ), if there is a document with RU
geolocation then invoke a class ( my cusotm one ) for storing this
information in a hashmap, then go ahead with process...
after 3 hours i get another document with RU geolocation and some others
information i would like to add this to the same key of hashmap, then if
the value associated with that key is > then 3 in 3 hours drop all other RU
document or index a new document with this information or tell me this
behaviour...

is it possible in some way by hook into ES indexing action ? have you some
suggestions ?

yes my goal is something like to correlate my document with same field and
then give me the answer.. an extension of percolator ( that how it is
implemented in ES is useless for me... )

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #4

Not sure where your docs come from, a stream maybe. It looks like
time-based data series.

If you have only one geolocation (RU) you are after, it is straightforward.
Index all documents on a time-based scale with timestamps and the geo
location as a geo field. Then you can periodically fire a geo distance
filter query. Join the hits, sort them by timestamp, throw away the docs
which fall outside your range.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(stefano ruggiero) #5

yes that is exaclty what i have to do!

but i need to do it in real time ( i cant do 100 search per seconds, 100
search = 100 different geo field ) , thats way iam asking for a custom node
or plugin let me hook in indexing doc procress.

Stefano

Il giorno mercoledì 30 ottobre 2013 15:54:33 UTC+1, Jörg Prante ha scritto:

Not sure where your docs come from, a stream maybe. It looks like
time-based data series.

If you have only one geolocation (RU) you are after, it is
straightforward. Index all documents on a time-based scale with timestamps
and the geo location as a geo field. Then you can periodically fire a geo
distance filter query. Join the hits, sort them by timestamp, throw away
the docs which fall outside your range.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #6

If you have 100 geo locations, my suggestion is like this. Core of the
solution is the translation of a doc with geo location to a geo name which
is used as field content and retrieved by the percolator.

  • register 100 geolocation queries to the percolator. The name of the
    registered query is equivalent to the content of the target field (ideally
    a geo location name)
  • use a client to fire documents, the docs get auto timestamped, and have a
    geo location. Run them against the percolator API.
  • use the geo location name obtained from the percolator API to retrieve an
    old doc, add info (geo location name plus other info), write new doc
  • check for index alias API. It might be useful to set up virtual indexes.
    So, for example, each geo name could be exposed as an ES index to the API
    but you only have one concrete index for 100 geo locations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(stefano ruggiero) #7

exactly what i have done, but it is too much performance intensive and on
10+ gb of data per day its not scale, if i would like to check for more
days my percolator it is a real pain with logstash sintax, that's why we
are trying to understand how to do it in a built in way ( with Java client
node or a plugin ),because we need to solve the problem of real time
analysis, unfortunally percolator its just another way to search document
and it isnent really integrated with logstash...

Stefano

Il giorno mercoledì 30 ottobre 2013 16:31:43 UTC+1, Jörg Prante ha scritto:

If you have 100 geo locations, my suggestion is like this. Core of the
solution is the translation of a doc with geo location to a geo name which
is used as field content and retrieved by the percolator.

  • register 100 geolocation queries to the percolator. The name of the
    registered query is equivalent to the content of the target field (ideally
    a geo location name)
  • use a client to fire documents, the docs get auto timestamped, and have
    a geo location. Run them against the percolator API.
  • use the geo location name obtained from the percolator API to retrieve
    an old doc, add info (geo location name plus other info), write new doc
  • check for index alias API. It might be useful to set up virtual indexes.
    So, for example, each geo name could be exposed as an ES index to the API
    but you only have one concrete index for 100 geo locations.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #8

It will not help you immediately (unless you want early code testing), but
there are known scaling issues with percolator in the 0.90 version which
will be fixed in 1.0

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Krushnat Khawale) #9

As I am creating a java web application without maven, I need jar for creating java client.
Where can I find jars?


(David Pilato) #10

Maven central is the best place IMO.


(system) #11