Using child documents as links between parent documents


(P├ęter Jaloveczki) #1

Hi,

We have a bit of a dilemma in terms of which way to go with our design, and we were wondering what would the elastic search community recommend.

The high level model is something like this:

We have entity A that is represented in the database and in the index as well.

Using a particular process we resolve LINKS to this entity of Type A. These links are not present in the database they are the result of some heuristics and searches. These links should be stored in the index as the cardinality of these links are essential. Lets assume in this context a link represents an entity of type B.

So one possibility would be to store the list of B entities in the A entity's mapping as a list of nested elements. The problem with this is, that we might wan't to re-index entity A. Since the links (B) are not in the database they would get erased from the elastic search document. I would assume this solution is not viable in this scenario.

Second possibility would be using parent/child relationship. In this case A would be the parent and B would be the child.

This can also go in two ways:

  1. B would represent ALL links that have been resolved for A. B would have a nested element that contains the data of the links, and would have aggregate values as well like the total number of links. So The relationship between A and B would be 1-1 potentially with the same identifier. So when I re-index A from the database, it's links would remain intact. In this case adding a new link might only be possible by running a SCRIPT I believe. I've tested this and it works but in this particular article:
    http://code972.com/blog/2015/03/84-elasticsearch-one-tip-a-day-avoid-costly-scripts-at-all-costs
    it is not considered wise to use scripts all that much because of performance issues. The first time I tested this was indeed slow, but it caught up after, not sure if this is relevant.
  2. B would only represent a single link so the relationship between A and B would be 1..n, n being the number of links of course. In this case if a new link is resolved you would not need scripting you would only insert a new B document with the data of a single link.

Going with way 2 might not be a trivial decision as we have 15 million documents in the index of type A. And each document of type A can potentially have 10-15 resolved link (document of type B).
So for solution 1: 15 million A, and 15 million B
For solution 2: 15 million A, 150 million+ B

Also, there is one more important design input. The number of resolved links plays a big role when searching for type A. Assuming going with solution 1, where document B has a field that represent the number of the nested element (number of resolved LINKS) it might be trivial because it is a 1-1 relationship however again, based on this bug report:


This is not really possible. I was able to scramble together a little workaround however for this particular issue by utilizing script_score. Here is an example:

GET my_index/A/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": {
              "query": "What is the best solution?",
              "boost": 1
            }
          }
        },
        {
          "has_child": {
            "type": "B",
            "score_type": "max",
            "query": {
              "function_score": {
                "boost": 100,
                "script_score": {
                  "script": "doc['B.link_count'].value"
                }
              }
            }
          }
        }
      ]
    }
  }
}

I'd like to know what does the community think about our current considerations if they are well founded at this point,and if there is a better way to model this relationship at all based on these simple requirements. Of course if you need more input or something to clarify I'd be happy to help.

Thank you very much,
Peter


(system) #2