Ideal method of indexing data into ES from SQL Server?


(Joshua P) #1

Hey everyone,

I'm trying to figure out the best way to implement Elasticsearch in our
company.

We currently have a SQL Server that houses all of our data and want to get
some better search and analytics functionality so ES seems like an awesome
option.

I've explored using the JDBC river to index data and I've successfully set
up elasticsearch with some test data.

There are some things I'm not clear on:

  1. The status of rivers in Elasticsearch. Another thread suggested that
    they were convenience feature added at the beginning to get data out of
    CouchDB and into ES easily. Are they going to be deprecated soon? It was
    suggested by on of the ES developers that companies should be coming up
    with their own solutions for indexing data.
    1. On that note, what does a custom indexing solution look like in a
      scenario like mine? Does this mean leveraging the provided APIs to create a
      custom solution that polls the SQL server for changes?
    2. Should I be modifying our middle tier to also queue up update
      requests as a user modifies data?
    3. I've read RabbitMQ can be used for queuing up bulk requests, but
      isn't this just done via another river and therefore has the same concerns
      as above?
  2. Does using the JDBC river scale well as the size of your data grows?
    I might have this wrong, but rivers don't allow you to take advantage of
    the distributed nature of ES?
  3. The syntax on the JDBC river for creating nested documents in the ES
    index. For example I have an PrimaryAddress table and an AddressAlias table
    which contains alternate addresses/names for a single PrimaryAddress. Can
    this be modeled in an ES index? The number of AddressAliases isn't the same
    for each PrimaryAddress

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #2
  1. Yes, rivers will be deprecated soon. But, and that is the good news,
    even without rivers, I will continue to offer a plugin-based solution for
    pulling data from external sources, for instance JDBC. The working name of
    this project is the "gatherer", which can organize data fetch processes by
    delegating "jobs" to gatherer nodes (similar to web crawler jobs).

1.1. From an architectural point of view, it is easier to plug an ES "push"
indexer routine at that tier where data is inserted/updated into the RDBMS,
to execute the action twice, one for RDBMS, second for ES. Beside
performance and scalability aspects, this is much better suited to detect
data anomalies (deletions, updates, or data parts that should not be
indexed). Poll is inferior to push - the number of polling clients is
strictly limited by server capacity while push has no limits at all.

1.2. Not sure why update requests should be something special. If you want
instant indexing or even real-time indexing, you must tweak your middle
tier or RDBMS to do so. You do not need a queue unless you are unsure about
how to organize a store-and-forward process for reindexing data, or reset
the indexer process to a safe commit point.

1.3. Yes, RabbitMQ is just another part in such a push queue. This might
help in case you want to achieve reliability by loosely coupled systems
(RDBMS, queue, ES are all independent) that can be managed separately. If
you don't need persistent streams, you also don't need message queues.

  1. The JDBC river, like all rivers, can run in many instances, distributed
    over as many ES nodes as you like. But this is a nightmare to manage. River
    failovers in case of node outages are not seamless, data from RDBMS source
    might be dropped (this is the motivation for message queues). The JDBC
    river can fetch JDBC result rows as fast as possible, even concurrently.
    And it uses bulk indexing and distributes the load over all nodes that hold
    the index. The point is, a river instance is a singleton, and if this
    singleton fails, you're in trouble. There is no coordination between river
    instances in polling data or to reset to safe points. Each river instance
    stands alone for itself, and that does not scale, it does not fit into the
    picture of all other ES components that are aware of their distributed
    nature.

  2. Yes, this can be modeled. By using SQL column names with dot notation
    and array elements, you can perform hierarchical construction of JSON docs,
    very flexible, perfect for denormalization. If you have a concrete example,
    I might be able to suggest more.

Jörg

On Fri, Mar 7, 2014 at 4:35 PM, Joshua P jpetersen841@gmail.com wrote:

Hey everyone,

I'm trying to figure out the best way to implement Elasticsearch in our
company.

We currently have a SQL Server that houses all of our data and want to get
some better search and analytics functionality so ES seems like an awesome
option.

I've explored using the JDBC river to index data and I've successfully set
up elasticsearch with some test data.

There are some things I'm not clear on:

  1. The status of rivers in Elasticsearch. Another thread suggested
    that they were convenience feature added at the beginning to get data out
    of CouchDB and into ES easily. Are they going to be deprecated soon? It was
    suggested by on of the ES developers that companies should be coming up
    with their own solutions for indexing data.
    1. On that note, what does a custom indexing solution look like in
      a scenario like mine? Does this mean leveraging the provided APIs to create
      a custom solution that polls the SQL server for changes?
    2. Should I be modifying our middle tier to also queue up update
      requests as a user modifies data?
    3. I've read RabbitMQ can be used for queuing up bulk requests, but
      isn't this just done via another river and therefore has the same concerns
      as above?
  2. Does using the JDBC river scale well as the size of your data
    grows? I might have this wrong, but rivers don't allow you to take
    advantage of the distributed nature of ES?
  3. The syntax on the JDBC river for creating nested documents in the
    ES index. For example I have an PrimaryAddress table and an AddressAlias
    table which contains alternate addresses/names for a single PrimaryAddress.
    Can this be modeled in an ES index? The number of AddressAliases isn't the
    same for each PrimaryAddress

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFQoJvjtzX9%3D8%3Df0h3s3W7QedoCqVUUjacGRLwc9Yyt7A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3