I would like to apply a simple math formula to all documents in Elasticsearch DB.
The tricky thing is that each variables consists of multiply values which are at the same time person attributes.
And now, I would like to calculate the following formula for all documents:
5 * 1 (if A has value 2) or 5 * 0 (if A has not value 2) + 4 * 1 (if A has value 3) or 4 * 0 * (if A has not value 3)
In SQL it would be very simple:
SELECT 5 * A_2 + 4 * A_3 FROM table.
As in SQL table each attribute would be a separate column but in Elasticsearch we keep all information about a person (from all tables/sources) in one document.
I think it is feasible using CASE statements but I would prefer to do this in more elegant way, especially as sometimes we need to use 200 or more attributes in one equation and I do not want to have > 200 CASE statements.
GET /index/_search
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"params": {
"a": 5,
"b": 4
},
"source": """
double sum = 0.0;
double val = 0;
for (item in doc['A'
]) {
try {
val = params.get(Long.toString(item))
}
catch (NullPointerException e){
val = 0
} if (val != 0) {
sum += val
}
}
return sum
"""
}
}
}
}
}
It works but the performance is not perfect. I get a timeout error after 5 minutes for 250 MM documents. Do you know how to achieve the same thing while reducing elapsed time?
I'm not going to be able to give you much assistance on the script side of things sorry.
My suggestion would be to move this to being calculated at ingest time, which is a much more efficient approach.
It looks like all documents in the index will need to be scored using the script, which can be slow. As processing of the query per shard is single-threaded you may improve performance by increasing the number of primary shards to match the number of cores you have. If you however expect to have multiple queries of this type running concurrently that may not be optimal and I would recommend you benchmark using real data and queries.
If you can add a filter to the query to directly filter out poor candidates you could potentially also speed it up. This might e.g. be a filter removing documents that does not have at least one of the parameters provided and therefore would get very poor scores.
Yes, we expect to have multiply queries like this one at the same time but even though multi-threading is worth to try.
Excluding poor candidates sounds reasonable but we loose some good candidates that way. Let's say a candidate have 29 from 30 the best variables. We would like to still have this person included. Probably something like: let's filter out all people that have less than 80% best 30 variables would work but how to do that in an efficient way...
That is something you need to know the data to determine. If you do not filter some way you do not benefit from the indexing that makes Elasticsearch fadt and you likely have a query that does scale badly.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.