We have a use case where we need to query and join two database tables in a 1-to-many relationship and upload and index those documents (records) to ElasticSearch every 3 minutes for use by a Kibana dashboard.
The total number of documents (records) that need to be indexed is ~240,000 every three minutes.
My question is:
Where is the best place to do this join pre-processing? In Logstash using a jdbc plug-in or in a c# application that does the query, join and converts the result set to JSON and then uses NEXT to bulk upload the JSON to ElasticSearch.
Hi John,
If you share a common field e.g. "customer_id" one way to do an efficient join purely in elasticsearch is using a sorted scroll across 2 indices e.g:
A bespoke client script would process this stream, fusing docs that share a common customer_id into a single document which can then be added using a bulk index operation.
Hi Mark, Thank you for your response. I have to admit I am new to Elastic Search, so I apologize in advance for any silly statements/questions.
Yes, I have a common field.
Using your example, what I am trying to do is combine 10,000 customers who have 25 orders each for a total of about 250,000 records (documents). The data currently exists in two database tables which is queried every three minutes. I need to refresh (re-index) all 250,000 records every 3 minutes (or shorter if possible), because the data in the orders is changing.
I am looking for the best way to bulk upload this data to ElasticSearch.
It will probably involve writing some code using your favourite choice of language. We have client side libraries for a wide variety of languages that will call the apis I mentioned
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.