I am working on an event based dataset where let's say there is a central customer index and multiple indexes for different events types, say index event_1, event_2, event_3.
I want to search across all the event_* indices and find out the customers common to each of them.
I tried a query with a bool should clause and terms aggregations, however I feel that the aggregations would be the limiting factor here.
Is there any other way by which we can query the event indexes ?
Have you thought about modeling your data differently, so that you can answer such a question with a single query over a single index?
How about including all the customer data in each event so that all of your data is available in every event? It seems that could possibly simplify your queries a lot!
Thanks Alex for the suggestion.
The customer data is a huge JSON and hence would be really expensive to duplicate them for each of the events. There are more than 2000 events per day appending customer data to each of them would create a huge dataset!
Any other modelling/querying technique would be really appreciated! thanks!
I have a similar problem where I need to find ids common to two indices. Currently I solve it by querying ES for all ids in each index (using scrolls) and comparing them. For indices containing millions of documents my current implementation takes less than a minute and is reasonable in terms of RAM.
Here's the pseudocode (using map-reduce paradigm):
Get 10k elements from first index, sorted by the field I'm interested in
Get first and last field value from those 10k
Query second index for all documents whose values are between the first and the last (taking care to use the same sort in the query as well).
Compare both batches. This is done in separate threads to take advantage of multiprocessing.
In the main thread continue from step 1 by scrolling to the next batch of 10k.
Thanks DmitriV for sharing your experience with this problem.
Interesting approach.
Are you implementing this approach in Apache Spark or other such Big data processing engines ?
If yes, how are you partitioning the data ? Are you keeping a fixed size of 10k documents per partition ?
Currently it's just raw Python using scroll queries. The first index always returns 10k documents due to the size limit of the scroll request. The other index returns a variable number of documents (currently could be anything between 0 and 300k) depending on the beginning and end positions of the batch.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.