I'm currently trying to find a way to efficiently migrate all my data from MSSQL database to Elastisearch.
Right now, my data in MSSQL is normalized to 4 different tables however, I'm planning on using the following logic to denormalized my data.
Let's assume I have 4 different tables and all have a foreign key. Now, if I have 4 different jobs that
pushes the data from each table to my ES index, I can end up with a completely denormalized data.
For example, consider the ES index type called product
Table 1 ==> Job1 Insert or update the document with id 1 (4 fields) version 1
Table 2 ==> Job2 Update the same document with id 1 (now 4 + 5 = 9 fields) version 2
Table 3 ==> Job3 Update the same document with id 1 (now 9 + 3 = 12 fields) version 3
Table 4 ==> Job4 Update the same document with id 1 (now 12 + 6 = 18 fields) version 4
At the end, I will end up with 4 updates on the same document.
My question now, is this going to drastically affect my performance if I do a bulk UPSERT ?
Is this recommended or perhaps I would be better off to just join the tables before pushing it to ES ?
I use the JDBC Rivers plugin which has been deprecated but it works slick. There is a new JDBC Importer and JDBC Feeder but those are working in Linux now and haven't made it to Windows as far as I know. If you have linux servers you should be able to make a go at them.
I can make Stored Proc calls or just straight up SQL with JDBC rivers and it does the indexing very quickly 800 rows /second for me.
All Indexing in ES is upserting if you maintain the ID. You don't even have to think about it as an upsert just index it.. Or if you use the rivers you don't have to do it at all.
You do...want to create a mapping if you can. Otherwise fields like Email Address will be tokenized and not be searchable with a standard search.
In the end the best way to go about it is to think about what you want your search results to look like. Once in ES you can't really do joins so if you want to retrieve all the data by getting one document it is best to denormalize with joins like you said and use the above JDBC Feeder / River / Importer to bring it in. You could also use LogStash to do the same but it's got a bit more of a learning curve.
You can also do it like I have done and have separate Types in the index for each equivalent table.
Basically I just have a globally searchable database. Which is great for my use case.
This doesn't search very well but it's easy to see the data. What I've done is to Pseudo denormalize and take the human keys and add them as an array to the foreign key fields. So FKID will look like "ABC123LONGID","BobSmith" in the document which gets the documents for me in search. I did this by declaring the same name twice in the River in my SQL statement. select id as FKID, name as FKID and it handled it.
How many documents are you throwing at it and how wide? This may make a difference to you.
Seems you don't understand my question but what I want to know is if it's okay to do several partial updates on a document multiple times (say 4) in order to build the full document.
Also, I would be doing this from 4 different jobs via the bulk indexing API.
Think of updates as deletes followed by inserts. It works more similarly to PostgreSQL than InnoDB - there are tombstones instead of rollback chains. So updates cost. It doesn't matter if you are adding new fields are not.
If its simpler on your application is still worth trying. Some things to think about:
Updates to documents that have been recently indexed are cheaper than documents that were indexed longer ago. So squash those updates together as close in time as you can get.
Updates to old documents are fine if you have to do them. Just never, ever, call _optimize. It'll make updates to old documents a much more painful proposition because they'll be tombstoned in larger segments so they'll be less likely to be cleaned up.
You won't see the real performance impact of this until you have lots of documents. 10,000 documents is still a toy example sadly.
It may be useful to perform the updates in such a way that you can detect if they are noops. If you set the ctx.op to noop then there is no tombstone, etc.
You can get around the "updates create tombstones" thing if you use nested documents. I don't personally have any experience running them in production but they should work fine. They weren't intended to solve this problem but I believe they do.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.