We are building cloud application which is going to have high scaling data store inside an RDBMS. What we are trying to do is moving the relational data which are searchable to the ES which everyone faces I've seen on the internet already. Now, it turned out that we had to use nested objects (documents) and it went around 4 levels deep almost per document.
The ES docs say that we have to use the nesting in special case only. We also cannot find some good references for querying the nested data properly. As we are up to using elasticsearch-dsl for python, could not seem to find a trace of searching nested data there too. Is it that there is less community support in this high level python client? How is the future of it, should we use it in production ?
So my concern is that is it good idea to use nesting in ES or not ?
If not, can we use RDBMS to get relations and build up query objects and send to ES to only search documents (RDBMS + ES together) ?
If curious about how our data looks like, it is similar to this document structure:
Elasticsearch is a document store, which requires you to change how you model your data when you move from a relational database. Trying to mimic a relational structure using parent-child or nested document is generally not the way to approach this as they are not a replacement for the lack of joins. Instead think about what entities you want to search for and denormalize your data into documents that match this. This will result in a flat model that is generally a lot easier to query.
This means data will be duplicated and if you make changes at the higher levels of the hierarchy you will need to update multiple documents. This is however often a reasonable tradeoff as doing a bit more work at index or update time in exchange for faster and simpler queries generally is worth it if updates are infrequent.
Okay, sounds right, thanks! So we are now on the way to de-normalize and flatten our stuff. For duplication part, wouldn't the index only store reference to the textual data in the documents? which I think will reduce duplication.
In our documents, it will happen that we will require to query related data fields which are multiple in numbers from multiple documents; we cannot duplicate them, which may tend us to querying multiple documents. Any suggestions for querying multiple documents at the same time efficiently? I mean would it be better to query two times or perform joins inside ES (putting it all in the query).
I can’t tell as I do not know your data. Making recommendations based on simplified sample data can often lead to important aspects being missed or overlooked.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.