Ideas to Normalize and join

Hello everyone,

I am working on a project to integrate two existing indexes (30-50M records each) into one by joining across a common field. Unfortunately as I understand Elasticsearch does not support joins like SQL. I have tried looking into parent-child joins with an alias, but it does not seem to fit the use case I am after as the data is already existing in the index (Maybe I don't understand enough about joins - would appreciate input if I am incorrect).

I would love to hear from the community to understand if someone has accomplished a method of getting around this, such as extract chunks of data from Elasticsearch, perform data processing, and reinserting into Elasticsearch. I have tried creating an API, but due to the number of requests and data processing needed, it was too much for the system to handle and response times became unbearable.

Would it make sense to pull the data into Hadoop HDFS and use something like Hive or Spark to perform the data processing I require to push into Elasticsearch using elastic-hadoop?


The best approach is often to denormalize and store the parent data with each child rather than try to mimic relational concepts using parent-child or nested documents. Assuming the parents data is not frequently updated this takes up a bit more space but offers simpler and often faster queries. This naturally requires reindexing.

The issue with this is that data is coming from different sources at different times and indexes are being updated as such quite frequently. There's a need to keep the indexes separated as well as one to be built together for analytical purposes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.