Issues in scalling elastic search, version conflicts

I will describe my problem as simple as possible, so we have a index which stores product details, but we have 2 primary keys one is product id (unique for every product) and other is city id (we need city id because we store price in city level). '
''
Document structure {ProductId, cityId, specs, price}
'''
Because of this structure search is working very fast but issue is rising while updating the index, we use queue to maintain our update events and thing is we cannot go parallel because of version conflicts (we tried it when specs were updating simultaneously a price update came and this resulted in conflict), we have to use only one queue because of this issue and this queue contains arround a million events which takes very long to get consumed and this gets worse in addition of more daily events. Also specs data is getting replicated for each city.

We were thinking to restructure this index and make two indices of city and product and join the result using linq. But search will not be as fast as before.

So anyone got any better solution or any suggestions in this solution, any suggestions is appreciated.

We tried parallel queue updates but they resulted in conflicts and creation of dlqs.

Assuming your documents have a unique ID, e.g. product_id, you should be able to use a number of parallel queues, but it requires that you always send all updates for a single document to the same queue. This way you serialize updates to every document and get the same behaviour as if you used only a single queue. You can do this e.g. by hashing the product_id and use that to determine the queue to send it to.

If you need to maintain proces by two different unique IDs it may make sense to have different indices for these so you can use each unique id as document ID. I am not sure I understand exactly how you are currently structuring and querying your data so additional details and examples would be useful.

I will explain issue with a simplified example

Suppose there are 2 cars and 2 cities so there will be 4 documents

{
CarId : 1,
Mileage : 200,
Cityid : 11,
Price : 50000
}

{
CarId : 1,
Mileage : 200,
Cityid : 22,
Price : 60000
}

{
CarId : 2,
Mileage : 400,
Cityid : 11,
Price : 70000
}

{
CarId : 2,
Mileage : 400,
Cityid : 22,
Price : 80000
}

Price of cars vary city by city because of taxes but specs will be same for all places

So we use update by query async i.e. if someone updates mileage of car 1 we use update by query async to update mileage where car id is 1

And other group takes care of prices and they use a job to update prices daily here query will be update price of car 1 in city 11

So there are arround 500k docs in real data and this process happens through single queue, we tried multiple queues to update specs from different queue and price from different but we ran into conflicts as someone updated price of car 1 and some one from specs team updated spec of same car.

Catch here is this index is better for searching as searching happens on single index, we was not sure if we divide index and use a multisearch query or a sequential call to get data and join them using Linq it would make our search slow.

I know specs doesn't change very frequently but mileage is just an example there are parameters which change daily such as car popularity, etc

It sounds like the unique id of a document is a combination of car id and city id, so this is what I would use as document id. This will allow you to directly update documents if you know the car id and the city id and not have to rely on update by query which is a lot less efficient and adds overhead. If you want to update all documents related to a specific car id and do not know the city ids it exists in you can still use update by query, although I would try to avoid this if possible.

I would further make sure that all updates from all teams are written to a message queue, e.g. Kafka. This can have multiple partitions to increase throughput and parallelism, but all changes related to a specific car ID (irrespective of city) need to always go to the same partition. This allows you to perform unrelated updates in parallel but limits the risk of version conflicts.

Correct this is one way to solve problem that when ever I receive an spec update I will fetch all city ids and push events city vise in queue, in that case we can solve issue of version conflict but reason we was thinking to divide indices because of data duplication because specs data is same for a car for all cities and we have arround 15 fields of spec and some of them are in form of array i.e. nested specs, in this case what do you suggest

  • solve version conflict issue by city and car id in cost of data duplication
  • solve data duplication issue by dividing indices in the cost of search time
  • Or any third way

Assuming you serve more queries than updates, it is very common to optimize for search speed in Elasticsearch, and this is generally done by denormalising data the way I described. This allows for faster and simpler queries and allows query load to scale well. This does often take up more disk space and make updates more expensive, but if you serve a lot more queries than you perform updates that is generally the best tradeoff.

If you want to reduce disk usage and optimise updates you will generally have to instead pay the price at query time, and for use cases with infrequent queries where longer latencies are acceptable this can be the right tradeoff.

2 Likes

Thank you :slight_smile: it was a nice discussions will discuss this with my team

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.