I am looking at ES as a replacement to the 3rd party index server that we
are currently using on property search website. I have not installed ES yet
as I’m just gathering information at this point, but I do expect to install
a test instance within the next couple of days and start working with it.
Our main need is to index a set of property data that comes out of our SQL
server database. We are talking only a relatively small number of db
records (50k initially but could go to a couple hundred thousand depending
on how we leverage ES). My concern is that our data updates frequently and
we will want to keep somewhat close to real-time data in our index server.
This will require a rather constant stream of creates, updates and deletes
to the target indexes.
Based on what I've read so far, there appears to be 2 applicable ways to
get SQL data into the index server: 1) River and 2) ETL process that will
probably post to the "_Bulk" API method.
A "River" seems like the easiest method, but the samples I've seen have
to do with creating/updating documents in the index and I've not seen any
way to delete documents. Is there a way to delete documents via a river and
if so how is that done?
An ETL "Post" to the "_Bulk" api method will require a little more work
but it appears to be a more robust solution that directly supports deleting
documents. Samples/API, however, only review to "Create" and "Delete"
methods for handling documents and I'm a bit confused on how the "Create"
method handles updates? I've read the info on versioning that applies to
creates, but I'm not really interested in any previous version of a
document. I want "Create" actions to existing documents just to overwrite
the existing document. Is there anything I need to do/setup to handle the
"Create" to make sure that any existing document is overwritten?
Once data is posted it needs to be available for searching immediately? Are
there any concerns I need to address from an "Index" standpoint to make
this possible?
Other things to know:
I am initially looking at using a single "Index" that would contain all
of our property records. Is there any reason other than fault tolerance not
to do this?
I am initially looking at using a simple single, possible 2 server
cluster setup. Is there any reason other than fault tolerance not to do
this?
As far as searching goes, we are looking to some basic field searching
(matches, starts with, contains) in combination with some Geospatial
searching (polygon and radius).
ETL: To me, ETL means that an external process is getting information from
your database (Extract), then parse it and create documents (Transform) and then
index it in Elasticsearch (Load). Fine. I think there is a third option here.
index from your business or service layer: I don't know if you have an access
to the code that create,update and delete your entities, but let's imagine you
can do it.
Just after C(R)UD operations, you can index or delete documents in Elasticsearch
directly. It won't cost you much (in term of performance).
You can also think of it in an asynchronous way with ActiveMQ or RabbitMQ...
So each time a creation or deletion occurs, you will send the action to
Elasticsearch.
For the first load process (aka initialize ES), you can create a batch that read
all your entities and index them 10000 by 10000 for example using bulks.
This last option is the one I used on a previous project and we were happy with
that.
Some notes:
When you send a document to ES, by default, ES will index it or update it. ES
never stores different versions of the same doc. Versioning is here for
concurrency management.
When a document is posted, you can GET it immediately but you will have to wait
for about 1 second before it will be searchable.
If you really need to, you can sometimes call the refresh API to make your
documents searchable but you should avoid of using it. Is your use case really
need that once the document is sent, it must be in the search result?
About you hardware, it depends if you want to avoid downtimes or not. If you
want to have a real cluster, start 2 nodes on 2 different boxes. If one box
crash, your users won't see the difference.
One single index seems to be fine but hard to tell without knowing your use case
and what kind of documents you are going to index.
I am looking at ES as a replacement to the 3rd party index server that we are
currently using on property search website. I have not installed ES yet as I’m
just gathering information at this point, but I do expect to install a test
instance within the next couple of days and start working with it.
<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
Our main need is to index a set of property data that comes out of our SQL
server database. We are talking only a relatively small number of db records
(50k initially but could go to a couple hundred thousand depending on how we
leverage ES). My concern is that our data updates frequently and we will want
to keep somewhat close to real-time data in our index server. This will
require a rather constant stream of creates, updates and deletes to the target
indexes.
Based on what I've read so far, there appears to be 2 applicable ways to get
SQL data into the index server: 1) River and 2) ETL process that will probably
post to the "_Bulk" API method.
1) A "River" seems like the easiest method, but the samples I've seen have to
do with creating/updating documents in the index and I've not seen any way to
delete documents. Is there a way to delete documents via a river and if so how
is that done?
2) An ETL "Post" to the "_Bulk" api method will require a little more work
but it appears to be a more robust solution that directly supports deleting
documents. Samples/API, however, only review to "Create" and "Delete" methods
for handling documents and I'm a bit confused on how the "Create" method
handles updates? I've read the info on versioning that applies to creates, but
I'm not really interested in any previous version of a document. I want
"Create" actions to existing documents just to overwrite the existing
document. Is there anything I need to do/setup to handle the "Create" to make
sure that any existing document is overwritten?
Once data is posted it needs to be available for searching immediately? Are
there any concerns I need to address from an "Index" standpoint to make this
possible?
Other things to know:
- I am initially looking at using a single "Index" that would contain all of
our property records. Is there any reason other than fault tolerance not to do
this?
- I am initially looking at using a simple single, possible 2 server cluster
setup. Is there any reason other than fault tolerance not to do this?
- As far as searching goes, we are looking to some basic field searching
(matches, starts with, contains) in combination with some Geospatial searching
(polygon and radius).
Any info provided will be greatly appreciated
Thanks
Lee
--
A "River" seems like the easiest method, but the samples I've seen have
to do with creating/updating documents in the index and I've not seen any
way to delete documents. Is there a way to delete documents via a river and
if so how is that done?
Technically a river is an embedded application that is run as a singleton
in the cluster. Only one node at a time will run the river. A river can do
anything, not just indexing. Since it is part of the JVM, it can use an
exposed modules like any other plugins. Of course, AFAIK rivers have only
been used for indexing, which is their only use case. Most rivers will bulk
index and can delete documents as well. The full indexing API can be used.
An ETL "Post" to the "_Bulk" api method will require a little more work
but it appears to be a more robust solution that directly supports deleting
documents. Samples/API, however, only review to "Create" and "Delete"
methods for handling documents and I'm a bit confused on how the "Create"
method handles updates? I've read the info on versioning that applies to
creates, but I'm not really interested in any previous version of a
document. I want "Create" actions to existing documents just to overwrite
the existing document. Is there anything I need to do/setup to handle the
"Create" to make sure that any existing document is overwritten?
Elasticsearch only keep tracks of version numbers, not the contents of each
version. Many developers ignore the version number returned by ES.
Elasticsearch uses Lucene underneath, so it has no true update mechanism.
Each update is actually an atomic delete-create.
Once data is posted it needs to be available for searching immediately? Are
there any concerns I need to address from an "Index" standpoint to make
this possible?
Data will become available after the refresh interval, the default being 1
second. Get requests should contain all data since they work off both the
index and the transaction log.
Other things to know:****
I am initially looking at using a single "Index" that would contain all
of our property records. Is there any reason other than fault tolerance not
to do this?
Sounds good. A single index is ideal for many reasons such as less OS-level
file resources and better shard allocation (not competing with other
indices).
I am initially looking at using a simple single, possible 2 server
cluster setup. Is there any reason other than fault tolerance not to do
this?
Sounds good.
As far as searching goes, we are looking to some basic field searching
(matches, starts with, contains) in combination with some Geospatial
searching (polygon and radius).
Wow, thanks for the quick responses to my questoins. Over the next day or
so I'll look into the info supplied and post back with my
findings/additional questions.
Thanks
Lee
On Wednesday, January 16, 2013 11:00:31 AM UTC-5, Lee Duhl wrote:
I am looking at ES as a replacement to the 3rd party index server that we
are currently using on property search website. I have not installed ES yet
as I’m just gathering information at this point, but I do expect to install
a test instance within the next couple of days and start working with it.
Our main need is to index a set of property data that comes out of our SQL
server database. We are talking only a relatively small number of db
records (50k initially but could go to a couple hundred thousand depending
on how we leverage ES). My concern is that our data updates frequently and
we will want to keep somewhat close to real-time data in our index server.
This will require a rather constant stream of creates, updates and deletes
to the target indexes.
Based on what I've read so far, there appears to be 2 applicable ways to
get SQL data into the index server: 1) River and 2) ETL process that will
probably post to the "_Bulk" API method.
A "River" seems like the easiest method, but the samples I've seen have
to do with creating/updating documents in the index and I've not seen any
way to delete documents. Is there a way to delete documents via a river and
if so how is that done?
An ETL "Post" to the "_Bulk" api method will require a little more work
but it appears to be a more robust solution that directly supports deleting
documents. Samples/API, however, only review to "Create" and "Delete"
methods for handling documents and I'm a bit confused on how the "Create"
method handles updates? I've read the info on versioning that applies to
creates, but I'm not really interested in any previous version of a
document. I want "Create" actions to existing documents just to overwrite
the existing document. Is there anything I need to do/setup to handle the
"Create" to make sure that any existing document is overwritten?
Once data is posted it needs to be available for searching immediately?
Are there any concerns I need to address from an "Index" standpoint to make
this possible?
Other things to know:
I am initially looking at using a single "Index" that would contain all
of our property records. Is there any reason other than fault tolerance not
to do this?
I am initially looking at using a simple single, possible 2 server
cluster setup. Is there any reason other than fault tolerance not to do
this?
As far as searching goes, we are looking to some basic field searching
(matches, starts with, contains) in combination with some Geospatial
searching (polygon and radius).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.