Modeling data that is shared among clients

I've got document as follows:

{
  "CountryeId":213,
  "CountryName":"Malta",
  "IndustryName":"Fresh Food",
  "ProductId":12351,
  "ProductName":"Pork",
  "MeasureTypeId":561,
  "MeasureType":"Market Sizes",
  "Description":"Aggregated sales in a time series by standard data types, per capita and growth.",
  "IsPopular":true,
  "SortOrder":100,
  "ResultTypeId":1
}

There are approximately 2,000,000 documents like that and then we have about 5,000 clients that can have access to any of these documents.

What would be the most appropriate way to return only documents that client has access to?

Common properties between this document and a client are:

  • MeasureTypeId
  • ProductId
  • CountryId

Initially I was thinking to build a string delimited by an underscore to store these properties in a single field, such as : 561_12351_213 and then store such string in client document and do terms lookup queries:

GET /statistic/_search
{
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "common_property": {
            "index": "clients",
            "type": "statistic",
            "id": "1235143",
            "path": "common_property"
          }
        }
      }
    }
  }
}

But having up to 2,000,000 strings in an array doesn't seem like the best approach so I'm looking for alternatives.

Not sure I follow. Why do you need to store 2,000,000 strings per document if you only have 5,000 clients?

I'll speak better in code. That was my initial idea:

Statistic index:

PUT statistic
{
  "mappings": {
    "statistic": {
      "properties": {
        "StatisticID": {
          "type": "keyword"
        },
        "IndustryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "ProductId": {
          "type": "integer"
        },
        "ProductName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "CountryCodeId": {
          "type": "integer"
        },
        "CountryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "MeasureTypeId": {
          "type": "integer"
        },
        "MeasureType": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "IsPopular": {
          "type": "boolean"
        },
        "ResultTypeId": {
          "type": "byte"
        },
        "Description": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "SortOrder": {
          "type": "integer"
        }
      }
    }
  }
}

Clients index:

PUT /clients
{
  "mappings": {
    "statistic": {
      "properties": {
        "StatisticIDs": {
          "type": "keyword"
        }
      }
    }
  }
}

Dummy data:

PUT /statistic/statistic/561_12351_213
{
  "CountryeId": 213,
  "CountryName": "Malta",
  "IndustryName": "Fresh Food",
  "ProductId": 12351,
  "ProductName": "Pork",
  "MeasureTypeId": 561,
  "MeasureType": "Market Sizes",
  "Description": "Aggregated sales in a time series by standard data types, per capita and growth.",
  "IsPopular": true,
  "SortOrder": 100,
  "ResultTypeId": 1,
  "StatisticID": "561_12351_213"
}

PUT /clients/statistic/1235143
{
  "StatisticIDs": [
    "561_12351_213",
    "561_12351_214"
    /* up to 2,000,000 other possible combinations */
  ]
}

Actual search query:

GET /statistic/_search
{
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "StatisticID": {
            "index": "clients",
            "type": "statistic",
            "id": "1235143", <--- this is client_id
            "path": "StatisticIDs"
          }
        }
      }
    }
  }
}

I wanted to have separate indexes because statistic index is being updated rarely - once in a week, while the client index is updated regularly.

In a relational world it makes sense because you can query this quite efficiently using a simple join between two tables:

SELECT S.*
FROM Elasticsearch.Statistic AS S
INNER JOIN Elasticsearch.Client AS C
    ON C.MeasureTypeId = S.MeasureTypeId
    AND C.ProductId = S.ProductId
    AND C.CountryId = S.CountryId
WHERE C.ClientId = @ClientId;

This would work quite well in Elasticsearch too if the amount of StatisticIDs in client index wouldn't be that big.

This also simplifies the update process because whenever there's a change in client index data i can just rewrite the whole document without having to make explicit checks what was updated.

The whole data is right now managed in SQL Server and I'm aiming to use Kafka stack to read data from SQL Server and put into Kafka topics and then push it to Elasticsearch.

Elasticsearch is not a SQL database. It doesn't support joins. So, if you need to denormalize and potentially have large records that you need to frequently update or you need to perform joins on the client side, which might lead to potentially very large queries, and slow performance.

The only relationship that elasticsearch can handle well at the moment is 1-to-many, which you can model using parent/child mechanism. However, as far as I understand, you have many-to-many relationship between your documents and clients. So, that wouldn't work, unless you can somehow denormalize it into 1-to-many, which again, doesn't seem practical in your case.

1 Like

I totally understand that it's not a relational database and I know what are the limitations of it.

Would it make sense to do it in a parent-child relationship this way?

So instead of having two separate indexes, I just build a single one with child-parent relationship like that:

PUT statistic
{
  "mappings": {
    "clients": {
      "_parent": {
        "type": "statistic"
      },
      "properties": {
        "ClientId": {
          "type": "integer"
        }
      }
    },
    "statistic": {
      "properties": {
        "StatisticID": {
          "type": "keyword"
        },
        "IndustryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "ProductId": {
          "type": "integer"
        },
        "ProductName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "CountryCodeId": {
          "type": "integer"
        },
        "CountryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "MeasureTypeId": {
          "type": "integer"
        },
        "MeasureType": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "IsPopular": {
          "type": "boolean"
        },
        "ResultTypeId": {
          "type": "byte"
        },
        "Description": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "SortOrder": {
          "type": "integer"
        }
      }
    }
  }
}

Index data like that:

PUT /statistic/statistic/561_12351_213
{
  "CountryeId": 213,
  "CountryName": "Malta",
  "IndustryName": "Fresh Food",
  "ProductId": 12351,
  "ProductName": "Pork",
  "MeasureTypeId": 561,
  "MeasureType": "Market Sizes",
  "Description": "Aggregated sales in a time series by standard data types, per capita and growth.",
  "IsPopular": true,
  "SortOrder": 100,
  "ResultTypeId": 1
}

PUT statistic/clients/1235143_561_12351_213?parent=561_12351_213
{
  "ClientId": 1235143
}

And finally query it like that:

GET /statistic/statistic/_search
{
  "query": {
    "has_child": {
      "type": "clients",
      "query": {
        "bool": {
          "filter": {
            "term": {
              "ClientId": 1235143
            }
          }
        }
      }
    }
  }
}

This seems to work with this kind of example, but is this a good enough approach?

Yes, it might be a good approach if you will never need to do searches like "find all documents accessible by clients in certain industry. Otherwise, we are going back to the issue of finding all clients first, extracting their ids and then creating a large ids query. It is probably the best approach if you want to minimize indexing load at the expense of search load. But you need to consider that you will have to store as many records as you have relationships, so your search performance might be affected. I would advice testing this model, to make sure that it is performant enough from the search perspective.

I wouldn't need searches like that.

ClientId is something that has always to be passed (a tenant), so this should be a good candidate for routing key too?

And if I would fully denormalize things, I would have to store as many records as there are relationships too, right?

If you want to store clients as children of statistics, you will be using statistic's id (parent's id) as the routing key. One of the limitations of parent/child relationships is that all children need to resides in the same shard as their parent. That's that main reason it can only handle 1-to-many relationships.

I see that!

Is that a performance limitation, so that children/parent look-ups would be done on local shard in order to avoid distributed join?

Yes, we need to have both parents and children on the same shard to make it fast. Otherwise, we would need a lot of shard-to-shard communication to perform a join, making it non-scalable in many cases.

That makes total sense @Igor_Motov. Thank you for all the help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.