I have a bunch (millions, but ideally scalable to billions/trillions) of documents in an ElasticSearch 7.13.0 index, and I am successfully performing MLT queries on them. Each document has an id, a text field, and a "cluster" number. There are N unique cluster numbers through the entire database, but N could potentially be very small or very large, and I'd like to devise a solution that performs well in either circumstance.
I want to perform an MLT query, but only among a given cluster number. Is there a recommended way to accomplish this?
Potential things I've thought of:
- Perform the query normally, then filter them afterward. It's my understanding that, performance reasons, you want to maximize the amount of filtering done as part of the actual query rather than afterward, so I'm a little unsure about this.
- Keep a separate index for each cluster number. Then, perform the query on just the respective cluster's index. While this seems feasible, my system may soon need to support a large amount of unique cluster numbers, and I'm unsure of how scalable having a potentially very large amount of indices would be.
- Keep cluster number as a field (as currently exists) and somehow work that as a "strict" requirement for the query. I couldn't find any clear way to do this from the documentation.
- Add the cluster id to the "like" part of the query and every other cluster id to the "unlike" part of the query . I do not think this would work well, as to do this I believe I'd have to put the cluster number with the rest of the text, and thus it would barely be weighted and could mess up associations with the rest of the text.
- Work with the Term Vectors API . I expect this would require a lot of custom code and may not end up being as efficient as the already-implemented MLT query, but would give us longer-term flexibility.
-
Use the
per_field_analyzer
attribute of the "like" attribute as outlined here . Any documentation I could find for this was pretty old, so I'm unsure if it's still in the language, else I'd try it out.
Thank you for any input. I'm pretty new to Elasticsearch and it's been very impressive to see the query times I'm getting on large datasets.