Search across multiple indices for the same field

Hello all,

I am working on an event based dataset where let's say there is a central customer index and multiple indexes for different events types, say index event_1, event_2, event_3.

I want to search across all the event_* indices and find out the customers common to each of them.

I tried a query with a bool should clause and terms aggregations, however I feel that the aggregations would be the limiting factor here.

Is there any other way by which we can query the event indexes ?

Eg:
Customer index:
{
"id":1
"Name": XYZ
}

Event_1:
{
"event1_val": ABC,
"id": 1
},
{
"event1_val": PQR,
"id":2
}

Event_2:
{
"event2_val": ,
"id": 1
},
{
"event2_val": PQR,
"id":2
}

Let's say i want to get all the customer ids which have event_1 value as ABC and event_2 value as PQR.

Currently i am creating it by this query:

{
  "size":0,
  "query":{
    "bool":{
      "should": [
        {
          "term": {
            "event1_val": {
              "value": "ABC"
            }
          }
        },
        {
          "term": {
            "event2_val": {
              "value": "PQR"
            }
          }
        }
      ]
    }
  }
  , "aggs": {
    "customers": {
      "terms": {
        "field": "id",
        "min_doc_count": 2
      }
    }
  }
}

Any other way by which I can find common ids among the indices ?

Have you thought about modeling your data differently, so that you can answer such a question with a single query over a single index?

How about including all the customer data in each event so that all of your data is available in every event? It seems that could possibly simplify your queries a lot!

--Alex

Thanks Alex for the suggestion.
The customer data is a huge JSON and hence would be really expensive to duplicate them for each of the events. There are more than 2000 events per day appending customer data to each of them would create a huge dataset!
Any other modelling/querying technique would be really appreciated! thanks!

Hey,

you may want to check out the join datatype then.

--Alex

I have a similar problem where I need to find ids common to two indices. Currently I solve it by querying ES for all ids in each index (using scrolls) and comparing them. For indices containing millions of documents my current implementation takes less than a minute and is reasonable in terms of RAM.

Here's the pseudocode (using map-reduce paradigm):

  • Get 10k elements from first index, sorted by the field I'm interested in
  • Get first and last field value from those 10k
  • Query second index for all documents whose values are between the first and the last (taking care to use the same sort in the query as well).
  • Compare both batches. This is done in separate threads to take advantage of multiprocessing.
  • In the main thread continue from step 1 by scrolling to the next batch of 10k.

Thanks! I had that in mind but seemed expensive computationally. Will try it to see the actual performance.

Thanks DmitriV for sharing your experience with this problem.
Interesting approach.
Are you implementing this approach in Apache Spark or other such Big data processing engines ?
If yes, how are you partitioning the data ? Are you keeping a fixed size of 10k documents per partition ?

Currently it's just raw Python using scroll queries. The first index always returns 10k documents due to the size limit of the scroll request. The other index returns a variable number of documents (currently could be anything between 0 and 300k) depending on the beginning and end positions of the batch.

Okay. Need to try this approach. Thanks for this!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.