I'm currently evaluating ElasticSearch to be used by a "selection-engine"
at my company. The selection engine will be used to answer questions like
"how many people are there between 20 and 30 years old in the city of
Stockholm". The critical thing for this system is to give fast feedback on
the counts (within a second), the extraction of the identities is not as
time critical.
One requirement of the system is to make multiple queries but still keep
unique identities. E.g: you should be able to make two queries, for example
"all people in stockholm" and "all males between 20 and 25" and then the
second query should not include anyone living in stockholm. We have solved
this by negating the filter of the first query and using it in the second,
and because of ElasticSearch filter caching this gives us really nice
performance.
Now to the real challenge: Any of these queries can contain a limit so the
above example can be "all people in Stockholm limited to 10000" and "all
males between 20 and 25". In this case the result of the second query
should contain documents that is "selected" by the first query but is not
among the 10000 chosen by that limit. Now we cannot rely on negated filters
any more because now we have to investigate the result set to find out what
documents actually "hit" the first query. And because one query can hit
millions of documents, this is, of course, really slow.
Have anyone of you considered this kind of requirement before, and do you
have a suggestion to how we can solve it with reasonable performance?
My team will now examine the possibility of creating this functionality in
ElasticSearch. We would like to be able to start a "transaction" in ES that
keeps track of all document identities that has been selected by any query
within the transaction. Then we can always exclude these identities from a
query to create the described "uniqueness". Do any of you know if this is
feasible, and do you have some suggestions for our implementation?
I'm currently evaluating Elasticsearch to be used by a "selection-engine"
at my company. The selection engine will be used to answer questions like
"how many people are there between 20 and 30 years old in the city of
Stockholm". The critical thing for this system is to give fast feedback on
the counts (within a second), the extraction of the identities is not as
time critical.
One requirement of the system is to make multiple queries but still keep
unique identities. E.g: you should be able to make two queries, for example
"all people in stockholm" and "all males between 20 and 25" and then the
second query should not include anyone living in stockholm. We have solved
this by negating the filter of the first query and using it in the second,
and because of Elasticsearch filter caching this gives us really nice
performance.
Now to the real challenge: Any of these queries can contain a limit so the
above example can be "all people in Stockholm limited to 10000" and "all
males between 20 and 25". In this case the result of the second query
should contain documents that is "selected" by the first query but is not
among the 10000 chosen by that limit. Now we cannot rely on negated filters
any more because now we have to investigate the result set to find out what
documents actually "hit" the first query. And because one query can hit
millions of documents, this is, of course, really slow.
Have anyone of you considered this kind of requirement before, and do you
have a suggestion to how we can solve it with reasonable performance?
My team will now examine the possibility of creating this functionality in
Elasticsearch. We would like to be able to start a "transaction" in ES that
keeps track of all document identities that has been selected by any query
within the transaction. Then we can always exclude these identities from a
query to create the described "uniqueness". Do any of you know if this is
feasible, and do you have some suggestions for our implementation?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.