Hello everyone,
I am working on a project to integrate two existing indexes (30-50M records each) into one by joining across a common field. Unfortunately as I understand Elasticsearch does not support joins like SQL. I have tried looking into parent-child joins with an alias, but it does not seem to fit the use case I am after as the data is already existing in the index (Maybe I don't understand enough about joins - would appreciate input if I am incorrect).
I would love to hear from the community to understand if someone has accomplished a method of getting around this, such as extract chunks of data from Elasticsearch, perform data processing, and reinserting into Elasticsearch. I have tried creating an API, but due to the number of requests and data processing needed, it was too much for the system to handle and response times became unbearable.
Would it make sense to pull the data into Hadoop HDFS and use something like Hive or Spark to perform the data processing I require to push into Elasticsearch using elastic-hadoop?
Thanks.