Search for similar documents

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
elastic search/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?

Kostya,

what about this

and this Elasticsearch Platform — Find real-time answers at scale | Elastic
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin
konstantin.selivanov@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
Elasticsearch/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?

also have a look into flt:

and

On Jul 5, 2:52 pm, Stefan Matheis matheis.ste...@googlemail.com
wrote:

Kostya,

what about thishttp://www.elasticsearch.org/guide/reference/api/more-like-this.html
and thishttp://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin

konstantin.seliva...@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
Elasticsearch/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?

Thanks for the links!

btw I found the article that describes how mlt works:
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

So, I guess mlt is a possible solution for me.
I could use fuzzy-like-this as well but it seems that I will get a lot
of false positives.

Thanks.

On Jul 5, 8:42 pm, Karussell tableyourt...@googlemail.com wrote:

also have a look into flt:

Elasticsearch Platform — Find real-time answers at scale | Elastic...

Duplicates Detection with ElasticSearch » Andrei Zmievski

and

https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jet...

On Jul 5, 2:52 pm, Stefan Matheis matheis.ste...@googlemail.com
wrote:

Kostya,

what about thishttp://www.elasticsearch.org/guide/reference/api/more-like-this.html
and thishttp://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin

konstantin.seliva...@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
Elasticsearch/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?