My scenario is that I have 5 fields in an existing index "emp_data" in elasticsearch. One of these field is employee_name. I have to add two new sixth & seventh fields "department", "company" to which this employee belongs. The dataset "emp_data" is too big in range of 20 million documents and keep on updating in hourly basis.
I have a small file containing "employee_name","company","department" information in csv format of few kb size.
I can take this file into a separate index say "corp_data" into elasticsearch.
My question is best way in elasticsearch to :
Join emp_data(big datas set) with corp_data( very small dataset)
My restrictions are : I can't do a join outside ES because emp_data is loaded by a application for which I don't have access to. I can do it only after emp_data is already populated.
Previously in a sql database I was using a ETL to join these two data sets and populate 7 fields in into a new table "emp_record" than emp_data and corp_data.
you cannot join two indices together in elasticsearch, as Elasticsearch does not work as a SQL database. What you can do instead is to add those fields to your existing documents. One possible approach would be to use the Update API to update parts of a document
But here I am not sure which is more suitable in ES from performance point of view because I will be handing data size in magnitude of 50 million of documents during query and sorting them.
Is it better
to store all fields as a single document and then do query ?
OR
Join two documents during query with term lookup mechanism?
Another point, is it possible to transform an index into another index within elasticsearch using logstash? I mean, can I query both index stored in elasticsearch ( to be joined) in logstash and join them in logstash and store in a new index that have fields from both index?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.