I'm using elastic search to index messages in a chat system. Messages work
sort of like facebook: users talk to other users, and for a given pair of
users their exists at most one thread. So we have threads with potentially
very many messages, with up to 100 of messages posted per second, which we
would like to index in nearly realtime. Given that Lucene can't really do
updates, indexing threads is out of the question.
Consequently, we are indexing messages. However, we would like to return
the result as "threads".
According to a bunch of searches I did, grouping is not yet possible with
ES and the best thing I can do if I want to, say, get 50 threads for the
search query "hello" my best bet is to fetch ~100, group them myself and
hope that I'll get sufficiently many.
However that does not solve the "relevance" issue: if thread A contains
separate messages "cat" and "dog", I want to see it when I search for "cat
AND dog". Or at least I want it to be more relevant then a thread which
only contains the word "dog".
Another solution I've found is to make "messages" objects children of
"threads" objects and use a has-child filter.
However, that doesn't solve the relevance problem, because as far as I
understand has-child, it will just give me the first thread which satisfies
the requirements.
Any better suggestions? Am I doing something unusual?
I'm using Elasticsearch to index messages in a chat system. Messages work
sort of like facebook: users talk to other users, and for a given pair of
users their exists at most one thread. So we have threads with potentially
very many messages, with up to 100 of messages posted per second, which we
would like to index in nearly realtime. Given that Lucene can't really do
updates, indexing threads is out of the question.
Consequently, we are indexing messages. However, we would like to return
the result as "threads".
According to a bunch of searches I did, grouping is not yet possible with
ES and the best thing I can do if I want to, say, get 50 threads for the
search query "hello" my best bet is to fetch ~100, group them myself and
hope that I'll get sufficiently many.
However that does not solve the "relevance" issue: if thread A contains
separate messages "cat" and "dog", I want to see it when I search for "cat
AND dog". Or at least I want it to be more relevant then a thread which
only contains the word "dog".
Another solution I've found is to make "messages" objects children of
"threads" objects and use a has-child filter.
However, that doesn't solve the relevance problem, because as far as I
understand has-child, it will just give me the first thread which satisfies
the requirements.
Any better suggestions? Am I doing something unusual?
I'm using Elasticsearch to index messages in a chat system. Messages
work sort of like facebook: users talk to other users, and for a given pair
of users their exists at most one thread. So we have threads with
potentially very many messages, with up to 100 of messages posted per
second, which we would like to index in nearly realtime. Given that Lucene
can't really do updates, indexing threads is out of the question.
Consequently, we are indexing messages. However, we would like to return
the result as "threads".
According to a bunch of searches I did, grouping is not yet possible with
ES and the best thing I can do if I want to, say, get 50 threads for the
search query "hello" my best bet is to fetch ~100, group them myself and
hope that I'll get sufficiently many.
However that does not solve the "relevance" issue: if thread A contains
separate messages "cat" and "dog", I want to see it when I search for "cat
AND dog". Or at least I want it to be more relevant then a thread which
only contains the word "dog".
Another solution I've found is to make "messages" objects children of
"threads" objects and use a has-child filter.
However, that doesn't solve the relevance problem, because as far as I
understand has-child, it will just give me the first thread which satisfies
the requirements.
Any better suggestions? Am I doing something unusual?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.