I'm trying to understand how should I implement following search feature.
User can add companies to a control group (for example 100 companies) and search similar companies (max 300) based on
criteria groups like location (region, municipality), industry, revenue and some other fields.
Location documents contain official code (string) + name.
Financial fields like revenue are numeric.
Result should contain scores from different criteria groups so that data can be used in the application' s UI. User should how well location/industry/revenue is matched to the control group and total score.
Could somebody point me to the right direction how this could be done?
Assuming that the control group and other companies are in the same index (or different indexes with the same alias), you could index flags like is_control_group as a field in your document alongside your criteria fields like location, industry, etc. This way you could easily exclude your control documents from your search results.
From the sounds of it, you may be interested in experimenting with the More Like This query. You can use the More Like This query to specify specific fields (your criteria) that you want to pull similar results for, and even compare them with specific documents in your index.
I'm not quite sure how feasible it would be to get a lot of different scores in a single API response call though - at least not in any way that would be remotely performant. It might be easier to return the search results with their criteria values, and let your UI use rules to highlight the criteria values that matched in a way that makes sense.
Good luck, I'd be interested to know if this strategy works for you!
Using more_like_this - query was one of my first idea how to solve the problem.
According to the documentation I got the impression that it works mainly with string -fields and not with numeric field (+range condition) so I rejected this idea.
Another idea was to make a search per category (location, industry, revenue etc) since number of categories is not that high (max 4) -> it sounded feasible.
Problem here was that I couldn't figure out how to combine these results / how to get the union (max 300 best results), so I rejected this idea.
So far my best idea is to do one query/search using minimum_should_match. It should return best results + do category scoring without Elasticsearch at the back-end side.
It sounded like you ended up to a similar resolution.
I really appreciated your comment. Elasticsearch's documentation is quite vast and I wasn't sure if missed something important
minimum_should_match should work to return only documents that meet your minimum allowed set of matched fields - and you're definitely on the right track that combining results of different searches is a very hard problem as the scores are very different.
The other thing you could potentially find useful is using function_score to boost, say, revenue closest to the average of the control group (but that would probably have to be known at query time).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.