Newb questions on indexing data from SQL Server and how to setup ES

I am looking at ES as a replacement to the 3rd party index server that we
are currently using on property search website. I have not installed ES yet
as I’m just gathering information at this point, but I do expect to install
a test instance within the next couple of days and start working with it.

Our main need is to index a set of property data that comes out of our SQL
server database. We are talking only a relatively small number of db
records (50k initially but could go to a couple hundred thousand depending
on how we leverage ES). My concern is that our data updates frequently and
we will want to keep somewhat close to real-time data in our index server.
This will require a rather constant stream of creates, updates and deletes
to the target indexes.

Based on what I've read so far, there appears to be 2 applicable ways to
get SQL data into the index server: 1) River and 2) ETL process that will
probably post to the "_Bulk" API method.

  1. A "River" seems like the easiest method, but the samples I've seen have
    to do with creating/updating documents in the index and I've not seen any
    way to delete documents. Is there a way to delete documents via a river and
    if so how is that done?

  2. An ETL "Post" to the "_Bulk" api method will require a little more work
    but it appears to be a more robust solution that directly supports deleting
    documents. Samples/API, however, only review to "Create" and "Delete"
    methods for handling documents and I'm a bit confused on how the "Create"
    method handles updates? I've read the info on versioning that applies to
    creates, but I'm not really interested in any previous version of a
    document. I want "Create" actions to existing documents just to overwrite
    the existing document. Is there anything I need to do/setup to handle the
    "Create" to make sure that any existing document is overwritten?

Once data is posted it needs to be available for searching immediately? Are
there any concerns I need to address from an "Index" standpoint to make
this possible?

Other things to know:

  • I am initially looking at using a single "Index" that would contain all
    of our property records. Is there any reason other than fault tolerance not
    to do this?

  • I am initially looking at using a simple single, possible 2 server
    cluster setup. Is there any reason other than fault tolerance not to do
    this?

  • As far as searching goes, we are looking to some basic field searching
    (matches, starts with, contains) in combination with some Geospatial
    searching (polygon and radius).

Any info provided will be greatly appreciated

Thanks

Lee

--

Hi Lee,

I would say first that you don't have to wait to install and launch an instance.
It takes some seconds to be up and running! :wink:

That said, yes you're right, there are some options here:

  1. river: Jörg did an amazing job with the jdbc river and it could be a good
    point to start with it as it won't cost you much development to have all the
    stuff running.
    About deletions, look at what Jörg wrote about it:
    GitHub - jprante/elasticsearch-jdbc: JDBC importer for Elasticsearch
    https://github.com/jprante/elasticsearch-river-jdbc#managing-updates-with-elasticsearch-versioning

  2. ETL: To me, ETL means that an external process is getting information from
    your database (Extract), then parse it and create documents (Transform) and then
    index it in Elasticsearch (Load). Fine. I think there is a third option here.

  3. index from your business or service layer: I don't know if you have an access
    to the code that create,update and delete your entities, but let's imagine you
    can do it.
    Just after C(R)UD operations, you can index or delete documents in Elasticsearch
    directly. It won't cost you much (in term of performance).
    You can also think of it in an asynchronous way with ActiveMQ or RabbitMQ...
    So each time a creation or deletion occurs, you will send the action to
    Elasticsearch.
    For the first load process (aka initialize ES), you can create a batch that read
    all your entities and index them 10000 by 10000 for example using bulks.

This last option is the one I used on a previous project and we were happy with
that.

Some notes:
When you send a document to ES, by default, ES will index it or update it. ES
never stores different versions of the same doc. Versioning is here for
concurrency management.
When a document is posted, you can GET it immediately but you will have to wait
for about 1 second before it will be searchable.
If you really need to, you can sometimes call the refresh API to make your
documents searchable but you should avoid of using it. Is your use case really
need that once the document is sent, it must be in the search result?

About you hardware, it depends if you want to avoid downtimes or not. If you
want to have a real cluster, start 2 nodes on 2 different boxes. If one box
crash, your users won't see the difference.
One single index seems to be fine but hard to tell without knowing your use case
and what kind of documents you are going to index.

Does it help?
David.

Le 16 janvier 2013 à 17:00, Lee Duhl leevduhl@gmail.com a écrit :

I am looking at ES as a replacement to the 3rd party index server that we are
currently using on property search website. I have not installed ES yet as I’m
just gathering information at this point, but I do expect to install a test
instance within the next couple of days and start working with it.

<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /> Our main need is to index a set of property data that comes out of our SQL server database. We are talking only a relatively small number of db records (50k initially but could go to a couple hundred thousand depending on how we leverage ES). My concern is that our data updates frequently and we will want to keep somewhat close to real-time data in our index server. This will require a rather constant stream of creates, updates and deletes to the target indexes. Based on what I've read so far, there appears to be 2 applicable ways to get SQL data into the index server: 1) River and 2) ETL process that will probably post to the "_Bulk" API method. 1) A "River" seems like the easiest method, but the samples I've seen have to do with creating/updating documents in the index and I've not seen any way to delete documents. Is there a way to delete documents via a river and if so how is that done? 2) An ETL "Post" to the "_Bulk" api method will require a little more work but it appears to be a more robust solution that directly supports deleting documents. Samples/API, however, only review to "Create" and "Delete" methods for handling documents and I'm a bit confused on how the "Create" method handles updates? I've read the info on versioning that applies to creates, but I'm not really interested in any previous version of a document. I want "Create" actions to existing documents just to overwrite the existing document. Is there anything I need to do/setup to handle the "Create" to make sure that any existing document is overwritten? Once data is posted it needs to be available for searching immediately? Are there any concerns I need to address from an "Index" standpoint to make this possible? Other things to know: - I am initially looking at using a single "Index" that would contain all of our property records. Is there any reason other than fault tolerance not to do this? - I am initially looking at using a simple single, possible 2 server cluster setup. Is there any reason other than fault tolerance not to do this? - As far as searching goes, we are looking to some basic field searching (matches, starts with, contains) in combination with some Geospatial searching (polygon and radius). Any info provided will be greatly appreciated Thanks Lee --

--
David Pilato
http://www.scrutmydocs.org/
http://dev.david.pilato.fr/
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

--

Hi Lee,

Responses inline.

On Wed, Jan 16, 2013 at 8:00 AM, Lee Duhl leevduhl@gmail.com wrote:

  1. A "River" seems like the easiest method, but the samples I've seen have
    to do with creating/updating documents in the index and I've not seen any
    way to delete documents. Is there a way to delete documents via a river and
    if so how is that done?

Technically a river is an embedded application that is run as a singleton
in the cluster. Only one node at a time will run the river. A river can do
anything, not just indexing. Since it is part of the JVM, it can use an
exposed modules like any other plugins. Of course, AFAIK rivers have only
been used for indexing, which is their only use case. Most rivers will bulk
index and can delete documents as well. The full indexing API can be used.

  1. An ETL "Post" to the "_Bulk" api method will require a little more work
    but it appears to be a more robust solution that directly supports deleting
    documents. Samples/API, however, only review to "Create" and "Delete"
    methods for handling documents and I'm a bit confused on how the "Create"
    method handles updates? I've read the info on versioning that applies to
    creates, but I'm not really interested in any previous version of a
    document. I want "Create" actions to existing documents just to overwrite
    the existing document. Is there anything I need to do/setup to handle the
    "Create" to make sure that any existing document is overwritten?

Elasticsearch only keep tracks of version numbers, not the contents of each
version. Many developers ignore the version number returned by ES.
Elasticsearch uses Lucene underneath, so it has no true update mechanism.
Each update is actually an atomic delete-create.

Once data is posted it needs to be available for searching immediately? Are

there any concerns I need to address from an "Index" standpoint to make
this possible?

Data will become available after the refresh interval, the default being 1
second. Get requests should contain all data since they work off both the
index and the transaction log.



Other things to know:****

  • I am initially looking at using a single "Index" that would contain all
    of our property records. Is there any reason other than fault tolerance not
    to do this?

Sounds good. A single index is ideal for many reasons such as less OS-level
file resources and better shard allocation (not competing with other
indices).


  • I am initially looking at using a simple single, possible 2 server
    cluster setup. Is there any reason other than fault tolerance not to do
    this?

Sounds good.


  • As far as searching goes, we are looking to some basic field searching
    (matches, starts with, contains) in combination with some Geospatial
    searching (polygon and radius).

That's what a search engine does!

Cheers,

Ivan

--

Wow, thanks for the quick responses to my questoins. Over the next day or
so I'll look into the info supplied and post back with my
findings/additional questions.

Thanks
Lee

On Wednesday, January 16, 2013 11:00:31 AM UTC-5, Lee Duhl wrote:

I am looking at ES as a replacement to the 3rd party index server that we
are currently using on property search website. I have not installed ES yet
as I’m just gathering information at this point, but I do expect to install
a test instance within the next couple of days and start working with it.

Our main need is to index a set of property data that comes out of our SQL
server database. We are talking only a relatively small number of db
records (50k initially but could go to a couple hundred thousand depending
on how we leverage ES). My concern is that our data updates frequently and
we will want to keep somewhat close to real-time data in our index server.
This will require a rather constant stream of creates, updates and deletes
to the target indexes.

Based on what I've read so far, there appears to be 2 applicable ways to
get SQL data into the index server: 1) River and 2) ETL process that will
probably post to the "_Bulk" API method.

  1. A "River" seems like the easiest method, but the samples I've seen have
    to do with creating/updating documents in the index and I've not seen any
    way to delete documents. Is there a way to delete documents via a river and
    if so how is that done?

  2. An ETL "Post" to the "_Bulk" api method will require a little more work
    but it appears to be a more robust solution that directly supports deleting
    documents. Samples/API, however, only review to "Create" and "Delete"
    methods for handling documents and I'm a bit confused on how the "Create"
    method handles updates? I've read the info on versioning that applies to
    creates, but I'm not really interested in any previous version of a
    document. I want "Create" actions to existing documents just to overwrite
    the existing document. Is there anything I need to do/setup to handle the
    "Create" to make sure that any existing document is overwritten?

Once data is posted it needs to be available for searching immediately?
Are there any concerns I need to address from an "Index" standpoint to make
this possible?

Other things to know:

  • I am initially looking at using a single "Index" that would contain all
    of our property records. Is there any reason other than fault tolerance not
    to do this?

  • I am initially looking at using a simple single, possible 2 server
    cluster setup. Is there any reason other than fault tolerance not to do
    this?

  • As far as searching goes, we are looking to some basic field searching
    (matches, starts with, contains) in combination with some Geospatial
    searching (polygon and radius).

Any info provided will be greatly appreciated

Thanks

Lee

--