Question about large arrays query and update performance

Hello,
I have a question about performance of arrays that contain a large number of items (scalar values and or objects) and updating them.

We store products in ES and each product can have multiple prices as the products are from different suppliers. So we need to store prices for each supplier. So each product can have hundreds to lower thousands of prices.
The prices change quite often (several times a day) With these prices we need to be able to filter (gte, lte), so we idex them as integers ( an array of integers).

We were thinking about how we will index it:

  1. The first option is to store only the array of prices e.g.

"prices": [999, 1000, 1099, 1129, ...]

But here we don't see an easy way to update the price once a supplier changes the price - we have to query all the prices in the database and index the whole array of prices again.

  1. The second option is to store the prices as an object with the supplier ID.
"prices": [
  {
    "supplierId": "<uuid1>",
    "price": 999
  },
  {
    "supplierId": "<uuid2>",
    "price": 1000
  },
  {
    "supplierId": "<uuid3>",
    "price": 1099
  }
]

We wanted to avoid indexing prices as nested, since the prices field will already be in one nested object (product variant).
We understand that if this field is not indexed as nested then the data will be denormalized.

"prices.supplierId": ["<uuid1>", "<uuid2>", "<uuid3>", ...],
"prices.price": [999, 1000, 1099, ...]

Is there any way to update the data (one specific price by partner ID) with respect to performance?
Or is there a more appropriate approach for these cases?

Thank you so much for your answers and your time!

1 Like

Elasticsearch stores data in immutable segments so when you index or update a document the full document is reindexed. You cannot perform in-place updates even if you make small changes, so the full large document will always be reindexed.

If you are using nested mappings there is even more overhead as each nested document is stored as a separate document behind the scenes. This mens that updating a single subdocument in a document with 1000 subdocuments causes 1001 documents to be reindexed.

If updates are frequent and this is an issue there are two options you may want to consider and test.

  1. Denormalise you model and store each price together with the product data in separate documents. This should give you smaller documents for frequent updates. Updates to the common product data will however require a lot of documents to be updated, but this may be an acceptable tradeoff if they are rare. As a lot of data is duplicated it may also take up more space on disk but as there are a lot of common fields it may compress quite well so may be acceptable.
  2. You can create documents containing the product data and store individual prices in separare documents linked to the product by a join field. This removed the data duplication but does make queries more complex and slower. All updates will be related to a single small document, so is efficient. I have seen this be a good option where the common data is very large and queries can be allowed the additional overhead and latency associated with the join feature.
2 Likes

Adding to @Christian_Dahlqvist excellent answer that you will be soon able to use ES|QL Lookup join when 8.18 will be out.

1 Like

Thanks for your answers!