Ideal method of indexing data into ES from SQL Server?

Joshua_P · March 7, 2014, 3:35pm

Hey everyone,

I'm trying to figure out the best way to implement Elasticsearch in our
company.

We currently have a SQL Server that houses all of our data and want to get
some better search and analytics functionality so ES seems like an awesome
option.

I've explored using the JDBC river to index data and I've successfully set
up elasticsearch with some test data.

There are some things I'm not clear on:

The status of rivers in Elasticsearch. Another thread suggested that
they were convenience feature added at the beginning to get data out of
CouchDB and into ES easily. Are they going to be deprecated soon? It was
suggested by on of the ES developers that companies should be coming up
with their own solutions for indexing data.
1. On that note, what does a custom indexing solution look like in a
  scenario like mine? Does this mean leveraging the provided APIs to create a
  custom solution that polls the SQL server for changes?
2. Should I be modifying our middle tier to also queue up update
  requests as a user modifies data?
3. I've read RabbitMQ can be used for queuing up bulk requests, but
  isn't this just done via another river and therefore has the same concerns
  as above?
Does using the JDBC river scale well as the size of your data grows?
I might have this wrong, but rivers don't allow you to take advantage of
the distributed nature of ES?
The syntax on the JDBC river for creating nested documents in the ES
index. For example I have an PrimaryAddress table and an AddressAlias table
which contains alternate addresses/names for a single PrimaryAddress. Can
this be modeled in an ES index? The number of AddressAliases isn't the same
for each PrimaryAddress

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · March 7, 2014, 10:20pm

Yes, rivers will be deprecated soon. But, and that is the good news,
even without rivers, I will continue to offer a plugin-based solution for
pulling data from external sources, for instance JDBC. The working name of
this project is the "gatherer", which can organize data fetch processes by
delegating "jobs" to gatherer nodes (similar to web crawler jobs).

1.1. From an architectural point of view, it is easier to plug an ES "push"
indexer routine at that tier where data is inserted/updated into the RDBMS,
to execute the action twice, one for RDBMS, second for ES. Beside
performance and scalability aspects, this is much better suited to detect
data anomalies (deletions, updates, or data parts that should not be
indexed). Poll is inferior to push - the number of polling clients is
strictly limited by server capacity while push has no limits at all.

1.2. Not sure why update requests should be something special. If you want
instant indexing or even real-time indexing, you must tweak your middle
tier or RDBMS to do so. You do not need a queue unless you are unsure about
how to organize a store-and-forward process for reindexing data, or reset
the indexer process to a safe commit point.

1.3. Yes, RabbitMQ is just another part in such a push queue. This might
help in case you want to achieve reliability by loosely coupled systems
(RDBMS, queue, ES are all independent) that can be managed separately. If
you don't need persistent streams, you also don't need message queues.

The JDBC river, like all rivers, can run in many instances, distributed
over as many ES nodes as you like. But this is a nightmare to manage. River
failovers in case of node outages are not seamless, data from RDBMS source
might be dropped (this is the motivation for message queues). The JDBC
river can fetch JDBC result rows as fast as possible, even concurrently.
And it uses bulk indexing and distributes the load over all nodes that hold
the index. The point is, a river instance is a singleton, and if this
singleton fails, you're in trouble. There is no coordination between river
instances in polling data or to reset to safe points. Each river instance
stands alone for itself, and that does not scale, it does not fit into the
picture of all other ES components that are aware of their distributed
nature.
Yes, this can be modeled. By using SQL column names with dot notation
and array elements, you can perform hierarchical construction of JSON docs,
very flexible, perfect for denormalization. If you have a concrete example,
I might be able to suggest more.

Jörg

On Fri, Mar 7, 2014 at 4:35 PM, Joshua P jpetersen841@gmail.com wrote:

Hey everyone,

I'm trying to figure out the best way to implement Elasticsearch in our
company.

We currently have a SQL Server that houses all of our data and want to get
some better search and analytics functionality so ES seems like an awesome
option.

I've explored using the JDBC river to index data and I've successfully set
up elasticsearch with some test data.

There are some things I'm not clear on:

The status of rivers in Elasticsearch. Another thread suggested
that they were convenience feature added at the beginning to get data out
of CouchDB and into ES easily. Are they going to be deprecated soon? It was
suggested by on of the ES developers that companies should be coming up
with their own solutions for indexing data.

On that note, what does a custom indexing solution look like in
a scenario like mine? Does this mean leveraging the provided APIs to create
a custom solution that polls the SQL server for changes?

Should I be modifying our middle tier to also queue up update
requests as a user modifies data?

I've read RabbitMQ can be used for queuing up bulk requests, but
isn't this just done via another river and therefore has the same concerns
as above?

Does using the JDBC river scale well as the size of your data
grows? I might have this wrong, but rivers don't allow you to take
advantage of the distributed nature of ES?

The syntax on the JDBC river for creating nested documents in the
ES index. For example I have an PrimaryAddress table and an AddressAlias
table which contains alternate addresses/names for a single PrimaryAddress.
Can this be modeled in an ES index? The number of AddressAliases isn't the
same for each PrimaryAddress

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/cbadcb61-a0c0-4907-9182-5075566b8a7d%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFQoJvjtzX9%3D8%3Df0h3s3W7QedoCqVUUjacGRLwc9Yyt7A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Newb questions on indexing data from SQL Server and how to setup ES Elasticsearch	4	366	July 6, 2017
What is the recommend way to index data from Web App using SQL Server? Elasticsearch	11	1337	October 24, 2017
Best way to index from SQL since Rivers is deprecated? Elasticsearch	4	727	July 6, 2017
Synchonising Elastic Search with SQL server Elasticsearch	4	437	July 3, 2018
Best practice on getting data out of RDBMS(PostgreSQL)? Elasticsearch	2	441	July 6, 2017

Ideal method of indexing data into ES from SQL Server?

Related topics