I'm looking for strategies people have used to index highly normalized data from MS SQL Server or any relational database. This isn't a problem unique to SQL Server.
My current approach was to write a complex stored procedure. It joined multiple tables and and used STUFF to handle one-to-many relationships and put rows from other tables into a single cell.
The results of the stored procedure would get parsed by my application and put into the format I want in my Elasticsearch index. I would use the NEST API and bulk insert the records.
This is cumbersome and feels inefficient, even running it on a beefy server. The stored procedure takes modified date parameters so I'm not pulling entire tables into memory.
Rivers are deprecated (and I wasn't using the JDBC river in the first place). They seemed to come with a set of other problems anyway. Logstash doesn't have a SQL Server input yet and I don't know if it's planned. My Elasticsearch index includes a lot of nested objects. So it's not a flat structure.
I was wondering how other people might have solved this problem or if you had any suggestions for dealing with normalized data. I'd like to still maintain the relationships (some of them) that I have in my relational database. Nested Objects seem to be better performing (according to the ES docs) than the Parent/Child model in Elasticsearch, though I haven't run any tests.
Relationship wise, you really need to try each method for yourself and figure out what is best for your requirements. Sorry I can't give you a direct answer!
FWIW I wrote JDBC river exactly for this problem, denormalizing normalized tabular data in multiple tables from an Oracle DB.
The river API is deprecated, but JDBC plugin will continue to live on. It will no longer use river API in the next version, it will be a standalone JVM using transport client.
We do the lion share of our indexing from SQL server, and we have some rich relationships and large docs as sounds like you do. The key for us was to find a way to become more event driven vs heavy ETL of large amounts of data. So if there is some logical way for your app to fire events, via triggers, or time based queries, or something, you can pick them up in a queue, and then index things more row by row as they change, find a way to initiate from the higher level objects. For sql server, if you want to go really low level you can even use the CDC log.
The feeder mode is a pre-release, still containing some river code so the same plugin zip can be used both river or standalone. The next version will drop all the river-related code, no plugin zip any more, only a jar.
I have to index SQL Server database table which is having million of records. Can someone let me know the process how to do indexing the SQL Server database table ?
I tried indexing the SQLServer tabale through MVC code, but if the data is millions I'm getting timeout exception and for 1000 records it is taking 1hr to do indexing. Can u plz suggest any sample code. I'm using Elasticsearch 1.5.1.
TanQ for ur reply. I downloaded the JDBC from the link which u suggested. But not clear how to use it. Can u plz suggest me any references for using JDBC. Currently I downloaded Elasticsearch 1.5.1 version. I want to index miilions of records in SQL Server table. Plz suggest the suitable documents.
How long does it take to do exactly the same without sending data to elasticsearch (just comment elasticSearchContext.AddUpdateDocument(item, item.Id))
I would not send to elasticsearch the full collection but every X documents in case you have 1 000 000 docs. So call SaveChanges() every X docs.
go to the unpacked directory (we call it $JDBC_IMPORTER_HOME) cd elasticsearch-jdbc-1.6.0.0
Can you please elaborate this point. Got stuck here.
if you do not find the JDBC driver jar in the lib directory, download it from your vendor's site and put the driver jar into the lib folder
Already jdbc-driver-csv-1.0.0 is there in lib folder.
modify script in the bin directory to your needs (Elasticsearch cluster address)
Can you please elaborate this point.
run script with a command that starts org.xbib.tools.JDBCImporter with the lib directory on the classpath
Can you please elaborate this point.
I'm not sure but it looks like it is the Windows environment. I do not know Windows at all so I can not understand the question or offer help, but there are many responses that lead me to the conclusion that unpacking a zip, changing to directories, editing text files and executing shell scripts is too difficult on Windows.
Also I receive some responses that Windows folks with JDBC importer are not able to find ES addresses on the network and can not connect to ES, for whatever reason.
This is rather frustrating.
So I think that it would be convenient to develop something like a wrapper where the JDBC importer can be installed like a Elasticsearch plugin, for the Windows folks getting started again.
With these few changes, the import from SQL Server with worked great. I was able to import and index about 20,000 records in less than a minute. I do agree that a wrapper for the import tool would be very helpful to bring more Windows users into the Elasticsearch world.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.