Hi All, I am fairly new to elasticsearch. We have a large amount of data in multiple databases (mainly mysql). I need to move them to elasticsearch with either a parent child approach or nested approach. The requirement is to be able to search based on parent / child attributes to get the parent / child documents. Also, I am comparing custom vs dynamic mappings.
Could you help me with modelling the same. Any help is appreciated. Thanks in advance.
I'd totally forget about the existing model and would just think about the usage to build the right "search" objects for my use case.
Here I can't really answer as I don't know the use case.
Basically I'd recommend to ask yourself 2 questions:
What kind of objects my users want to get back as a response? If it's object X, then just index object X
What typical attributes my users want to search for? Let say I need attribute a, b and c, just index those attributes within object X whatever the original source of those attributes is.
@kartheek91 Thanks for the response. I have multiple databases as source as we have multiple microservices. Is there any option to load data do some merging and then pushing it to elasticsearch
I am planning to forget the existing model and go with one that would scale well with elasticsearch.
Here is my problem in detail. Say, I have four categories of entities E1, E2, E3 and E4.
E1 - moderate count
E2 depends on E1; - large amount of data
E3 is like addons to E1; - moderate count
E4 depends on E2. - very large amount of data
(q1) Each of these have separate screens where users would search for them hence all of them need to be separate indices. Say, I am not sure if parent child would work very well. Hence I am thinking of nesting at the moment.
(q2) Also, for moving the data into elastic search i am considering writing an adapter where i could format data to the model that would work well with querying in elasticsearch
Am I on the right track here ? Also, could you correct me if i am wrong.
In case I am not clear here, i would elaborate the entities in the real life use case.
nested or parent/child answers different use cases. I'd use parent/child only if anytime I have to update a parent value, it requires to reindex so many children that it's not fast enough and that it would be better to reindex only the parent.
When I'm doing a local demo on my laptop, I can index 12k+ documents per second. So (re)indexing 1m documents is only one or 2 minutes.
In that case, if I know that the worse case would be to reindex let say 100k documents, or even 1m, I'm good of not using parent/child and I'm ready to pay the price of that 2 minutes delay...
(q2) Also, for moving the data into Elasticsearch i am considering writing an adapter where i could format data to the model that would work well with querying in elasticsearch
Yes. IMHO that's a good approach. You can optimize a lot of things by creating data structure dedicated to search. Things like computing fields at index time instead of asking elasticsearch to compute that at search time.
For example, if you have an invoice with a list of individual items with their own price, that could be a good idea to index the sum of all the individual items instead of computing that at search time.
In my case we have Plans (E1), Subscriptions (E2), Addons (E3) and Invoices (E4). Here plans are basic entities which users cant opt in to. Once opted in a subscription is created and it can have addons to the subscribed plan. At the end of every cycle an invoice is generated. Users can search for any of these entities. Say, we are in subscriptions list screen, then the user can search for any subscription using the details of subscription, plan or addons. Similarly for others as well
What i was referring to earlier was more like nesting plan and addon details into the subscription itself, apart from having separate index for plans and addons respectively.
If you need to search for Plans, Subscriptions, Addons and Invoices, then index 4 different objects: Plans, Subscriptions, Addons and Invoices.
If an invoice contains elements from Plans, Subscriptions and Addons, then put all those elements within the Subscription object. I mean duplicate data.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.