Modeling data that is shared among clients

buinauskas_evaldas · October 27, 2017, 11:49am

I've got document as follows:

{
  "CountryeId":213,
  "CountryName":"Malta",
  "IndustryName":"Fresh Food",
  "ProductId":12351,
  "ProductName":"Pork",
  "MeasureTypeId":561,
  "MeasureType":"Market Sizes",
  "Description":"Aggregated sales in a time series by standard data types, per capita and growth.",
  "IsPopular":true,
  "SortOrder":100,
  "ResultTypeId":1
}

There are approximately 2,000,000 documents like that and then we have about 5,000 clients that can have access to any of these documents.

What would be the most appropriate way to return only documents that client has access to?

Common properties between this document and a client are:

MeasureTypeId
ProductId
CountryId

Initially I was thinking to build a string delimited by an underscore to store these properties in a single field, such as : 561_12351_213 and then store such string in client document and do terms lookup queries:

GET /statistic/_search
{
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "common_property": {
            "index": "clients",
            "type": "statistic",
            "id": "1235143",
            "path": "common_property"
          }
        }
      }
    }
  }
}

But having up to 2,000,000 strings in an array doesn't seem like the best approach so I'm looking for alternatives.

Igor_Motov · October 27, 2017, 11:59pm

Not sure I follow. Why do you need to store 2,000,000 strings per document if you only have 5,000 clients?

buinauskas_evaldas · October 28, 2017, 10:50am

I'll speak better in code. That was my initial idea:

Statistic index:

PUT statistic
{
  "mappings": {
    "statistic": {
      "properties": {
        "StatisticID": {
          "type": "keyword"
        },
        "IndustryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "ProductId": {
          "type": "integer"
        },
        "ProductName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "CountryCodeId": {
          "type": "integer"
        },
        "CountryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "MeasureTypeId": {
          "type": "integer"
        },
        "MeasureType": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "IsPopular": {
          "type": "boolean"
        },
        "ResultTypeId": {
          "type": "byte"
        },
        "Description": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "SortOrder": {
          "type": "integer"
        }
      }
    }
  }
}

Clients index:

PUT /clients
{
  "mappings": {
    "statistic": {
      "properties": {
        "StatisticIDs": {
          "type": "keyword"
        }
      }
    }
  }
}

Dummy data:

PUT /statistic/statistic/561_12351_213
{
  "CountryeId": 213,
  "CountryName": "Malta",
  "IndustryName": "Fresh Food",
  "ProductId": 12351,
  "ProductName": "Pork",
  "MeasureTypeId": 561,
  "MeasureType": "Market Sizes",
  "Description": "Aggregated sales in a time series by standard data types, per capita and growth.",
  "IsPopular": true,
  "SortOrder": 100,
  "ResultTypeId": 1,
  "StatisticID": "561_12351_213"
}

PUT /clients/statistic/1235143
{
  "StatisticIDs": [
    "561_12351_213",
    "561_12351_214"
    /* up to 2,000,000 other possible combinations */
  ]
}

Actual search query:

GET /statistic/_search
{
  "query": {
    "bool": {
      "filter": {
        "terms": {
          "StatisticID": {
            "index": "clients",
            "type": "statistic",
            "id": "1235143", <--- this is client_id
            "path": "StatisticIDs"
          }
        }
      }
    }
  }
}

I wanted to have separate indexes because statistic index is being updated rarely - once in a week, while the client index is updated regularly.

In a relational world it makes sense because you can query this quite efficiently using a simple join between two tables:

SELECT S.*
FROM Elasticsearch.Statistic AS S
INNER JOIN Elasticsearch.Client AS C
    ON C.MeasureTypeId = S.MeasureTypeId
    AND C.ProductId = S.ProductId
    AND C.CountryId = S.CountryId
WHERE C.ClientId = @ClientId;

This would work quite well in Elasticsearch too if the amount of StatisticIDs in client index wouldn't be that big.

This also simplifies the update process because whenever there's a change in client index data i can just rewrite the whole document without having to make explicit checks what was updated.

The whole data is right now managed in SQL Server and I'm aiming to use Kafka stack to read data from SQL Server and put into Kafka topics and then push it to Elasticsearch.

Igor_Motov · October 28, 2017, 11:22am

Elasticsearch is not a SQL database. It doesn't support joins. So, if you need to denormalize and potentially have large records that you need to frequently update or you need to perform joins on the client side, which might lead to potentially very large queries, and slow performance.

The only relationship that elasticsearch can handle well at the moment is 1-to-many, which you can model using parent/child mechanism. However, as far as I understand, you have many-to-many relationship between your documents and clients. So, that wouldn't work, unless you can somehow denormalize it into 1-to-many, which again, doesn't seem practical in your case.

buinauskas_evaldas · October 28, 2017, 11:41am

I totally understand that it's not a relational database and I know what are the limitations of it.

Would it make sense to do it in a parent-child relationship this way?

So instead of having two separate indexes, I just build a single one with child-parent relationship like that:

PUT statistic
{
  "mappings": {
    "clients": {
      "_parent": {
        "type": "statistic"
      },
      "properties": {
        "ClientId": {
          "type": "integer"
        }
      }
    },
    "statistic": {
      "properties": {
        "StatisticID": {
          "type": "keyword"
        },
        "IndustryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "ProductId": {
          "type": "integer"
        },
        "ProductName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "CountryCodeId": {
          "type": "integer"
        },
        "CountryName": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "MeasureTypeId": {
          "type": "integer"
        },
        "MeasureType": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "IsPopular": {
          "type": "boolean"
        },
        "ResultTypeId": {
          "type": "byte"
        },
        "Description": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "SortOrder": {
          "type": "integer"
        }
      }
    }
  }
}

Index data like that:

PUT /statistic/statistic/561_12351_213
{
  "CountryeId": 213,
  "CountryName": "Malta",
  "IndustryName": "Fresh Food",
  "ProductId": 12351,
  "ProductName": "Pork",
  "MeasureTypeId": 561,
  "MeasureType": "Market Sizes",
  "Description": "Aggregated sales in a time series by standard data types, per capita and growth.",
  "IsPopular": true,
  "SortOrder": 100,
  "ResultTypeId": 1
}

PUT statistic/clients/1235143_561_12351_213?parent=561_12351_213
{
  "ClientId": 1235143
}

And finally query it like that:

GET /statistic/statistic/_search
{
  "query": {
    "has_child": {
      "type": "clients",
      "query": {
        "bool": {
          "filter": {
            "term": {
              "ClientId": 1235143
            }
          }
        }
      }
    }
  }
}

This seems to work with this kind of example, but is this a good enough approach?

Igor_Motov · October 28, 2017, 3:30pm

Yes, it might be a good approach if you will never need to do searches like "find all documents accessible by clients in certain industry. Otherwise, we are going back to the issue of finding all clients first, extracting their ids and then creating a large ids query. It is probably the best approach if you want to minimize indexing load at the expense of search load. But you need to consider that you will have to store as many records as you have relationships, so your search performance might be affected. I would advice testing this model, to make sure that it is performant enough from the search perspective.

buinauskas_evaldas · October 28, 2017, 4:54pm

I wouldn't need searches like that.

ClientId is something that has always to be passed (a tenant), so this should be a good candidate for routing key too?

And if I would fully denormalize things, I would have to store as many records as there are relationships too, right?

Igor_Motov · October 28, 2017, 5:00pm

If you want to store clients as children of statistics, you will be using statistic's id (parent's id) as the routing key. One of the limitations of parent/child relationships is that all children need to resides in the same shard as their parent. That's that main reason it can only handle 1-to-many relationships.

buinauskas_evaldas · October 28, 2017, 5:49pm

I see that!

Is that a performance limitation, so that children/parent look-ups would be done on local shard in order to avoid distributed join?

Igor_Motov · October 28, 2017, 10:19pm

Yes, we need to have both parents and children on the same shard to make it fast. Otherwise, we would need a lot of shard-to-shard communication to perform a join, making it non-scalable in many cases.

buinauskas_evaldas · October 29, 2017, 7:57am

That makes total sense @Igor_Motov. Thank you for all the help!

system · November 26, 2017, 7:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Performance of Elasticsearch for number of documents vs number of properties in a document Elasticsearch	4	515	September 3, 2021
Is it worse to have very large documents/indices with a ton of data, or split that data into smaller documents/indices using a relational paradigm? Elasticsearch	2	431	January 31, 2022
Index design for large lists of document references Elasticsearch	1	432	August 3, 2020
Multi tenancy with varying document mappings Elasticsearch	4	320	February 6, 2019
Advice on Index and Cluster Structure? Elasticsearch	4	1023	July 5, 2017

Modeling data that is shared among clients

Related topics