just some general ideas, without getting an example for understanding the
nature of your XML data it may not be so helpful.
For object updates, using unique identifiers as references might be handy
so that related objects can be indexed naturally to their "parent" (the
top-level dependency). A simple case is the parent/child mechanism of
Elasticsearch, but it depends on the type of your queries and how you want
to deliver the children objects in the response you send to the user.
Storing media objects (like images) in Elasticsearch may be too voluminous,
in that case, just use URLs for remote access.
The river plugin (a river is also a plugin) gives you the feature that
Elasticseach takes over the runtime control. Node failures will not degrade
the object indexing process at any time the cluster is running because the
river execution will move to another node.
A standalone app needs its own execution control, you need to check for the
Elasticsearch cluster availability by yourself and you need to execute the
CRUD ops regularly. This may be an advantage or not.
Processing XML to obtain an object model can be tricky. For larger projects
that are more generic, it is recommendable to abstract the object model
implementation away from Elasticsearch indexing, so you don't have to opt
either for an ES plugin or a standalone app. The ES river plugin will
include the object model construction code and the special code to generate
JSON docs from it. And, you can move easier your data to other storage
mechanisms in the future.
You can depend strictly on Elasticsearch's XContentBuilder (which is also
able to understand SMILE and YAML syntax). If you need object model
independence (or "loosely coupled" dependency), check if you want to base
your data model on something like RDF (which is schemaless resource
description metadata, good for solving the XML attribute/namespace
conversion problem and for resolving object impedance mismatches), or JSON
( to match Elasticsearch's document model).
On Thursday, October 4, 2012 5:55:19 AM UTC+2, Billy Newman wrote:
First of all thanks so much for the responses guys!
A little more on my use case. To start and be fair I am working some
quick prototypes with solr and elasticsearch to see if one works
better/worse for what I need.
I have to index 3 different data sets that are a little different. They
do overlap in some ways so I would like to either index them all in the
same index or use separate indexes and query all three indexes.
2 of the 3 data set are relatively small and are updated very frequently.
Updates occur at least every 5 minutes on a max of a couple hundred
records. Was thinking best case for this scenario was to just recreate the
index completely every so often. Both of these data sets are returned in
XML via a restful URL query. So writing an app using the elasticsearch
java api that fires every so often to delete the indexes and recreate them
looks to be a fairly easy task.
Data set #3 is my tricky case. I am hitting internal/in house Restful
URLs to get objects. Objects contain top level data as well as child data,
such as reports, comments, images, etc. Using the URLs I can determine
what has changed since last time I asked for 'objects' from the url.
Deletes/Creation are pretty easy cases to handle, just remove or create the
doc in the index. Updated objects are the tricky case in which I would
either need to delete the object from the index and re-create or update
just the diffs including children (comments, reports, etc).
So in the end my problem at a high level is probably something that has
been solved a numerous times. In the end I guess I still have the same
question in my mind. Is this problem better suited for a plugin that is
installed in elasticsearch or better solved by writing an application that
uses the Java API to access the index to preform my create/update/delete
operations? My guess is that latter, but I am not 100% sure.
Sorry for the long winded response, but if anyone has good ideas or
articles I can read that would be greatly appreciated. Thanks again guys.
On Wednesday, October 3, 2012 7:48:40 AM UTC-6, Billy Newman wrote:
Does Elasticsearch support pulling xml data from various URLs into an
index out of the box, or with a plugin. I.E. given a set or URLs can I
have Elasticsearch hit those URLs and index the data that comes back? Also
keeping in mind updates and deletes from the index depending on if the data
coming back from the URL contains deleted/updated entires.