Aggregate from one doc table to select items from another

Consider the following, I have a ratings and posts index.

The ratings doc is structured like:

    {
         "postId": {
             "type": "text",
             "fields": {
             "keyword": {
               "type": "keyword",
               "ignore_above": 256
             }
           },
           "analyzer": "keyword",
           "fielddata": true
        },
        "value": {
            "type": "long"
        }
    }

The posts doc is structured like:

    {
         "postId": {
             "type": "text",
             "fields": {
             "keyword": {
               "type": "keyword",
               "ignore_above": 256
             }
           },
           "analyzer": "keyword",
           "fielddata": true
        },
        ... other fields
    }

Is it possible in a single query to aggregate all the ratings, sort them by highest value field and then return the posts documents that are associated with each of those values?

I've needed to do this for awhile now but have always just done extra steps to add the aggregated value field to the posts document whenever it gets updated. This doesn't seem to scale the best and it would be wonderful if I could accomplish this in a single query.

I'm using ES v6.6.2

Thanks!

I guess you are trying to denormalize the data that is already inside the elasticsearch in two different indices and doing an SQL equivalent of something like below:

SELECT 
posts.*, 
ratings.rating_value

From ratings 
LEFT OUTER JOIN posts 
ON ratings.post_id = posts.post_id 
ORDER BY ratings.value DESC; 

The problem, as documented in the Elastic blog below, is the limited SQL JOIN capability which can only be done on fields of data type nested in one index which does not seem to be in your case.

Even their latest docs seem to limit the JOIN type for a single index for nested and parent/child relationships
https://www.elastic.co/guide/en/elasticsearch/reference/current/joining-queries.html

I think expectation/understanding here is that JOINs tends to shuffle data between nodes and Elastisearch wants to discourage that as that potentially interferes with cluster performance so your best bet could be to:

  1. Denormalize your data outside of Elasticsearch and then do the indexing. Outside would mean not to rely on using Logstash's Elasticsearch filter plugin because yes as you mentioned the look ups become expensive as your data volume grows. I have run into this situation too.
  2. Have your posts and ratings all live in a single index with a child/parent nested relationship which could help you do the JOIN since there won't be any shuffling then.

I am curious what the Elasticsearch experts have to say on this topic.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.