Indexing Data In Bulk From SQL Server


(Joshua P) #1

I'm looking for strategies people have used to index highly normalized data from MS SQL Server or any relational database. This isn't a problem unique to SQL Server.

My current approach was to write a complex stored procedure. It joined multiple tables and and used STUFF to handle one-to-many relationships and put rows from other tables into a single cell.

The results of the stored procedure would get parsed by my application and put into the format I want in my Elasticsearch index. I would use the NEST API and bulk insert the records.

This is cumbersome and feels inefficient, even running it on a beefy server. The stored procedure takes modified date parameters so I'm not pulling entire tables into memory.

Rivers are deprecated (and I wasn't using the JDBC river in the first place). They seemed to come with a set of other problems anyway. Logstash doesn't have a SQL Server input yet and I don't know if it's planned. My Elasticsearch index includes a lot of nested objects. So it's not a flat structure.

I was wondering how other people might have solved this problem or if you had any suggestions for dealing with normalized data. I'd like to still maintain the relationships (some of them) that I have in my relational database. Nested Objects seem to be better performing (according to the ES docs) than the Parent/Child model in Elasticsearch, though I haven't run any tests.

Thoughts?


(Mark Walkom) #2

There is a JDBC plugin in the works - https://github.com/logstash-plugins/logstash-input-jdbc

Relationship wise, you really need to try each method for yourself and figure out what is best for your requirements. Sorry I can't give you a direct answer!


(Jörg Prante) #3

FWIW I wrote JDBC river exactly for this problem, denormalizing normalized tabular data in multiple tables from an Oracle DB.

The river API is deprecated, but JDBC plugin will continue to live on. It will no longer use river API in the next version, it will be a standalone JVM using transport client.


(Joshua P) #4

@jprante Is that the "feeder" configuration described in your documentation?


(John Bush) #5

We do the lion share of our indexing from SQL server, and we have some rich relationships and large docs as sounds like you do. The key for us was to find a way to become more event driven vs heavy ETL of large amounts of data. So if there is some logical way for your app to fire events, via triggers, or time based queries, or something, you can pick them up in a queue, and then index things more row by row as they change, find a way to initiate from the higher level objects. For sql server, if you want to go really low level you can even use the CDC log.

https://msdn.microsoft.com/en-us/cc645937.aspx


(Jörg Prante) #6

The feeder mode is a pre-release, still containing some river code so the same plugin zip can be used both river or standalone. The next version will drop all the river-related code, no plugin zip any more, only a jar.


(Vani Aravinda) #7

Hi,

I have to index SQL Server database table which is having million of records. Can someone let me know the process how to do indexing the SQL Server database table ?

Thanks,
Vani Aravinda


(David Pilato) #8

@vaniaravinda Is there anything unclear with what @jprante or @warkolm wrote?

But if your question is "how I can do it without using JDBC "river" or Logstash", then I'd say:

  • read entities from your database
  • convert them to JSON
  • push them to elasticsearch using _bulk

I wrote a blog post about it here: http://david.pilato.fr/blog/2015/05/02/devoxx-france-2015/ Have a look at "Our CRM database" section.


(Jörg Prante) #9

@vaniaravinda please do not capture foreign threads.

You can try JDBC importer for your task https://github.com/jprante/elasticsearch-jdbc


(Vani Aravinda) #10

Hi @dadoonet,

I tried indexing the SQLServer tabale through MVC code, but if the data is millions I'm getting timeout exception and for 1000 records it is taking 1hr to do indexing. Can u plz suggest any sample code. I'm using Elasticsearch 1.5.1.

Thanks,
Vani Aravinda


(Vani Aravinda) #11

Hi @jprante,

TanQ for ur reply. I downloaded the JDBC from the link which u suggested. But not clear how to use it. Can u plz suggest me any references for using JDBC. Currently I downloaded Elasticsearch 1.5.1 version. I want to index miilions of records in SQL Server table. Plz suggest the suitable documents.

Thanks,
Vani Aravinda


(David Pilato) #12

What did not work?
What is your code?


(Vani Aravinda) #13

Hi @dadoonet,

I'm using ElasticsearchCURD for indexing the SQLServer table. Please find below for sample code.

IElasticsearchMappingResolver elasticsearchMappingResolver = new ElasticsearchMappingResolver();
			using (var elasticSearchContext = new ElasticsearchContext("http://localhost:9200/", new ElasticsearchSerializerConfiguration(elasticsearchMappingResolver, true, true)))
			{                
                if (!elasticSearchContext.IndexTypeExists<Emp>())
				{
					elasticSearchContext.TraceProvider = new ConsoleTraceProvider();
					using (var databaseEfModel = new EfModel())
					{
						int pointer = 0;
						const int interval = 20;
						bool firstRun = true;
						int length = databaseEfModel.Emp.Count();

						while (pointer < length)
						{
							_stopwatch.Start();
							var collection =
                                databaseEfModel.Emp.OrderBy(t => t.Id)
									.Skip(pointer)
									.Take(interval)
									.ToList<Emp>();
							_stopwatch.Stop();
                            Console.WriteLine("Time taken for select {0} Emp: {1}", interval, _stopwatch.Elapsed);
							_stopwatch.Reset();

							_stopwatch.Start();
							foreach (var item in collection)
							{
								//var ee = item.CountryRegion.Name;
								elasticSearchContext.AddUpdateDocument(item, item.Id);
							}

							if (firstRun)
							{
								elasticSearchContext.SaveChangesAndInitMappings();
								firstRun = false;
							}
							else
							{
								elasticSearchContext.SaveChanges();
							}

It is taking 1hr to complete the indexing for 1000 records. After processing few more records it is throwing timeout exception also.

Thanks,
Vani Aravinda


(David Pilato) #14

I don't know this language.

That said:

  • How long does it take to do exactly the same without sending data to elasticsearch (just comment elasticSearchContext.AddUpdateDocument(item, item.Id))
  • I would not send to elasticsearch the full collection but every X documents in case you have 1 000 000 docs. So call SaveChanges() every X docs.

(Jörg Prante) #15

Did you notice the README at https://github.com/jprante/elasticsearch-jdbc? There you can learn how to index documents from JDBC sources.

I strongly suggest Elasticsearch 1.6.0 and JDBC importer 1.6.0.0


(Vani Aravinda) #16

Hi @jprante,

TanQ for your reply. Can you please help me in executing below points 3, 5, & 6 .

  1. download the JDBC importer distribution
    a. wget http://xbib.org/repository/org/xbib/elasticsearch/importer/elasticsearch-jdbc/1.6.0.0/elasticsearch-jdbc-1.6.0.0-dist.zip
    Completed
  2. unpack
    unzip elasticsearch-jdbc-1.6.0.0-dist.zip
    Completed
  3. go to the unpacked directory (we call it $JDBC_IMPORTER_HOME) cd elasticsearch-jdbc-1.6.0.0
    Can you please elaborate this point. Got stuck here.
  4. if you do not find the JDBC driver jar in the lib directory, download it from your vendor's site and put the driver jar into the lib folder
    Already jdbc-driver-csv-1.0.0 is there in lib folder.
  5. modify script in the bin directory to your needs (Elasticsearch cluster address)
    Can you please elaborate this point.
  6. run script with a command that starts org.xbib.tools.JDBCImporter with the lib directory on the classpath
    Can you please elaborate this point.

Thanks,
Vani Aravinda


(Jörg Prante) #17

I'm not sure but it looks like it is the Windows environment. I do not know Windows at all so I can not understand the question or offer help, but there are many responses that lead me to the conclusion that unpacking a zip, changing to directories, editing text files and executing shell scripts is too difficult on Windows.

Also I receive some responses that Windows folks with JDBC importer are not able to find ES addresses on the network and can not connect to ES, for whatever reason.

This is rather frustrating.

So I think that it would be convenient to develop something like a wrapper where the JDBC importer can be installed like a Elasticsearch plugin, for the Windows folks getting started again.


(Aih) #18

I encountered a number of issues when importing from MS SQL Server, and wrote up the workarounds in this post: http://blog.tabulaw.com/2015/08/elasticsearch-and-ms-sql-server-would.html. I also collected some additions to your README.md instructions in a gist: https://gist.github.com/aih/f0370479669e3003b2a4

With these few changes, the import from SQL Server with worked great. I was able to import and index about 20,000 records in less than a minute. I do agree that a wrapper for the import tool would be very helpful to bring more Windows users into the Elasticsearch world.


(Takemoa) #19

Another Java tool for importing SQL DB data is https://github.com/takemoa/sqldb2es (Disclaimer: I am the author).


(Jörg Prante) #20

@takemoa wow, nice project, what is the license of the code?