How to index XML data


(Billy Newman) #1

Hello again.

Does Elasticsearch support pulling xml data from various URLs into an index
out of the box, or with a plugin. I.E. given a set or URLs can I have
Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.

Thanks!

--


(Billy Newman) #2

I looked a little at the RSS river and it is no quite what I need. Not a
big deal at all since I assume that most peoples indexing needs are
different. In this case does Elasticsearch recommend writing your own
plugin to index data, or maybe just a simple client that runs every so
often to update the index.

In my case I have some restful URLs that I am going to hit to get data from
my system. That data will contain history so I will get back new, updated
and deleted 'things'. I will want to index new 'things', update updated
'things', and remove deleted 'things' from the index. Again plugin, or
maybe just the Java API to write a client that can do this work?

Thanks

On Wednesday, October 3, 2012 7:48:40 AM UTC-6, Billy Newman wrote:

Hello again.

Does Elasticsearch support pulling xml data from various URLs into an
index out of the box, or with a plugin. I.E. given a set or URLs can I
have Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.

Thanks!

--


(David Pilato) #3

I think that there are two questions here:

1/ how do we index XML?
I suggest here to use the mapper attachment plugin if you only want to flatten
your xml file and don't care about tags.
But, if you need to map tags as fields, I suggest to use Jackson to parse your
XML as a bean and to rewrite it as a JSon document. (it should exist other
method to do it).

2/ how do we fetch content?
You can fork the RSS River or the FS River to have an idea of how you can build
your own river as I don't think that a web crawling river exist right now.
You can also create a batch to fetch your XML content, parse it to JSON and
index in ES. You can use an ETL for that need.

It really depends on your use case.

I hope I have answered a little to your questions.

David.

Le 3 octobre 2012 à 21:07, Billy Newman newmanw10@gmail.com a écrit :

I looked a little at the RSS river and it is no quite what I need. Not a big
deal at all since I assume that most peoples indexing needs are different. In
this case does Elasticsearch recommend writing your own plugin to index data,
or maybe just a simple client that runs every so often to update the index.

In my case I have some restful URLs that I am going to hit to get data from
my system. That data will contain history so I will get back new, updated and
deleted 'things'. I will want to index new 'things', update updated 'things',
and remove deleted 'things' from the index. Again plugin, or maybe just the
Java API to write a client that can do this work?

Thanks

On Wednesday, October 3, 2012 7:48:40 AM UTC-6, Billy Newman wrote:

Hello again.

Does Elasticsearch support pulling xml data from various URLs into an
index out of the box, or with a plugin. I.E. given a set or URLs can I have
Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.

Thanks!

--

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--


(Jörg Prante) #4

Hi Billy,

it would help if you could give an example of the data you index.

I did some work while implementing a JDBC river and I am aware of the
nature of (semi-)structured data.

There are several challenges: managing some data streams (initial loading,
incremental updates, and incremental deletes), mapping XML elements to JSON
(which is not straightforward if XML attributes and namespaces are
present), and efficient and reliable bulk indexing (important if data
volume is high).

I wonder if you are indexing from Atom feeds? The Atom protocol matches
exactly your description, it is RESTful. The RSS river is a good start but
afaik there is no ElasticSearch Atom river yet. I have plans for an Apache
Abdera based Atom river though.

In the case of bibliographic data, I managed to index from OAI servers
where I tried to accomplish the task of XML indexing,
see https://github.com/jprante/elasticsearch-river-oai

Best regards,

Jörg

On Wednesday, October 3, 2012 9:07:15 PM UTC+2, Billy Newman wrote:

I looked a little at the RSS river and it is no quite what I need. Not a
big deal at all since I assume that most peoples indexing needs are
different. In this case does Elasticsearch recommend writing your own
plugin to index data, or maybe just a simple client that runs every so
often to update the index.

In my case I have some restful URLs that I am going to hit to get data
from my system. That data will contain history so I will get back new,
updated and deleted 'things'. I will want to index new 'things', update
updated 'things', and remove deleted 'things' from the index. Again
plugin, or maybe just the Java API to write a client that can do this work?

Thanks

On Wednesday, October 3, 2012 7:48:40 AM UTC-6, Billy Newman wrote:

Hello again.

Does Elasticsearch support pulling xml data from various URLs into an
index out of the box, or with a plugin. I.E. given a set or URLs can I
have Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.

Thanks!

--


(Billy Newman) #5

First of all thanks so much for the responses guys!

A little more on my use case. To start and be fair I am working some quick
prototypes with solr and elasticsearch to see if one works better/worse for
what I need.

I have to index 3 different data sets that are a little different. They do
overlap in some ways so I would like to either index them all in the same
index or use separate indexes and query all three indexes.

2 of the 3 data set are relatively small and are updated very frequently.
Updates occur at least every 5 minutes on a max of a couple hundred
records. Was thinking best case for this scenario was to just recreate the
index completely every so often. Both of these data sets are returned in
XML via a restful URL query. So writing an app using the elasticsearch
java api that fires every so often to delete the indexes and recreate them
looks to be a fairly easy task.

Data set #3 is my tricky case. I am hitting internal/in house Restful URLs
to get objects. Objects contain top level data as well as child data, such
as reports, comments, images, etc. Using the URLs I can determine what has
changed since last time I asked for 'objects' from the url.
Deletes/Creation are pretty easy cases to handle, just remove or create the
doc in the index. Updated objects are the tricky case in which I would
either need to delete the object from the index and re-create or update
just the diffs including children (comments, reports, etc).

So in the end my problem at a high level is probably something that has
been solved a numerous times. In the end I guess I still have the same
question in my mind. Is this problem better suited for a plugin that is
installed in elasticsearch or better solved by writing an application that
uses the Java API to access the index to preform my create/update/delete
operations? My guess is that latter, but I am not 100% sure.

Sorry for the long winded response, but if anyone has good ideas or
articles I can read that would be greatly appreciated. Thanks again guys.

Billy

On Wednesday, October 3, 2012 7:48:40 AM UTC-6, Billy Newman wrote:

Hello again.

Does Elasticsearch support pulling xml data from various URLs into an
index out of the box, or with a plugin. I.E. given a set or URLs can I
have Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.

Thanks!

--


(Billy Newman) #6

Any more ideas???

Thanks again guys!

On Wednesday, October 3, 2012 9:55:19 PM UTC-6, Billy Newman wrote:

First of all thanks so much for the responses guys!

A little more on my use case. To start and be fair I am working some
quick prototypes with solr and elasticsearch to see if one works
better/worse for what I need.

I have to index 3 different data sets that are a little different. They
do overlap in some ways so I would like to either index them all in the
same index or use separate indexes and query all three indexes.

2 of the 3 data set are relatively small and are updated very frequently.
Updates occur at least every 5 minutes on a max of a couple hundred
records. Was thinking best case for this scenario was to just recreate the
index completely every so often. Both of these data sets are returned in
XML via a restful URL query. So writing an app using the elasticsearch
java api that fires every so often to delete the indexes and recreate them
looks to be a fairly easy task.

Data set #3 is my tricky case. I am hitting internal/in house Restful
URLs to get objects. Objects contain top level data as well as child data,
such as reports, comments, images, etc. Using the URLs I can determine
what has changed since last time I asked for 'objects' from the url.
Deletes/Creation are pretty easy cases to handle, just remove or create the
doc in the index. Updated objects are the tricky case in which I would
either need to delete the object from the index and re-create or update
just the diffs including children (comments, reports, etc).

So in the end my problem at a high level is probably something that has
been solved a numerous times. In the end I guess I still have the same
question in my mind. Is this problem better suited for a plugin that is
installed in elasticsearch or better solved by writing an application that
uses the Java API to access the index to preform my create/update/delete
operations? My guess is that latter, but I am not 100% sure.

Sorry for the long winded response, but if anyone has good ideas or
articles I can read that would be greatly appreciated. Thanks again guys.

Billy

On Wednesday, October 3, 2012 7:48:40 AM UTC-6, Billy Newman wrote:

Hello again.

Does Elasticsearch support pulling xml data from various URLs into an
index out of the box, or with a plugin. I.E. given a set or URLs can I
have Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.

Thanks!

--


(Jörg Prante) #7

Hi Billy,

just some general ideas, without getting an example for understanding the
nature of your XML data it may not be so helpful.

For object updates, using unique identifiers as references might be handy
so that related objects can be indexed naturally to their "parent" (the
top-level dependency). A simple case is the parent/child mechanism of
Elasticsearch, but it depends on the type of your queries and how you want
to deliver the children objects in the response you send to the user.

Storing media objects (like images) in Elasticsearch may be too voluminous,
in that case, just use URLs for remote access.

The river plugin (a river is also a plugin) gives you the feature that
Elasticseach takes over the runtime control. Node failures will not degrade
the object indexing process at any time the cluster is running because the
river execution will move to another node.

A standalone app needs its own execution control, you need to check for the
Elasticsearch cluster availability by yourself and you need to execute the
CRUD ops regularly. This may be an advantage or not.

Processing XML to obtain an object model can be tricky. For larger projects
that are more generic, it is recommendable to abstract the object model
implementation away from Elasticsearch indexing, so you don't have to opt
either for an ES plugin or a standalone app. The ES river plugin will
include the object model construction code and the special code to generate
JSON docs from it. And, you can move easier your data to other storage
mechanisms in the future.

You can depend strictly on Elasticsearch's XContentBuilder (which is also
able to understand SMILE and YAML syntax). If you need object model
independence (or "loosely coupled" dependency), check if you want to base
your data model on something like RDF (which is schemaless resource
description metadata, good for solving the XML attribute/namespace
conversion problem and for resolving object impedance mismatches), or JSON
( to match Elasticsearch's document model).

Best regards,

Jörg

On Thursday, October 4, 2012 5:55:19 AM UTC+2, Billy Newman wrote:

First of all thanks so much for the responses guys!

A little more on my use case. To start and be fair I am working some
quick prototypes with solr and elasticsearch to see if one works
better/worse for what I need.

I have to index 3 different data sets that are a little different. They
do overlap in some ways so I would like to either index them all in the
same index or use separate indexes and query all three indexes.

2 of the 3 data set are relatively small and are updated very frequently.
Updates occur at least every 5 minutes on a max of a couple hundred
records. Was thinking best case for this scenario was to just recreate the
index completely every so often. Both of these data sets are returned in
XML via a restful URL query. So writing an app using the elasticsearch
java api that fires every so often to delete the indexes and recreate them
looks to be a fairly easy task.

Data set #3 is my tricky case. I am hitting internal/in house Restful
URLs to get objects. Objects contain top level data as well as child data,
such as reports, comments, images, etc. Using the URLs I can determine
what has changed since last time I asked for 'objects' from the url.
Deletes/Creation are pretty easy cases to handle, just remove or create the
doc in the index. Updated objects are the tricky case in which I would
either need to delete the object from the index and re-create or update
just the diffs including children (comments, reports, etc).

So in the end my problem at a high level is probably something that has
been solved a numerous times. In the end I guess I still have the same
question in my mind. Is this problem better suited for a plugin that is
installed in elasticsearch or better solved by writing an application that
uses the Java API to access the index to preform my create/update/delete
operations? My guess is that latter, but I am not 100% sure.

Sorry for the long winded response, but if anyone has good ideas or
articles I can read that would be greatly appreciated. Thanks again guys.

Billy

On Wednesday, October 3, 2012 7:48:40 AM UTC-6, Billy Newman wrote:

Hello again.

Does Elasticsearch support pulling xml data from various URLs into an
index out of the box, or with a plugin. I.E. given a set or URLs can I
have Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.

Thanks!

--


(system) #8