Search for similar documents


(konstantin) #1

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
elastic search/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?


(Stefan Matheis) #2

Kostya,

what about this
http://www.elasticsearch.org/guide/reference/api/more-like-this.html
and this http://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin
konstantin.selivanov@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
elastic search/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?


(Karussell) #3

also have a look into flt:

http://www.elasticsearch.org/guide/reference/query-dsl/flt-field-query.html

http://zmievski.org/2011/03/duplicates-detection-with-elasticsearch

and

On Jul 5, 2:52 pm, Stefan Matheis matheis.ste...@googlemail.com
wrote:

Kostya,

what about thishttp://www.elasticsearch.org/guide/reference/api/more-like-this.html
and thishttp://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin

konstantin.seliva...@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
elastic search/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?


(konstantin) #4

Thanks for the links!

btw I found the article that describes how mlt works:
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

So, I guess mlt is a possible solution for me.
I could use fuzzy-like-this as well but it seems that I will get a lot
of false positives.

Thanks.

On Jul 5, 8:42 pm, Karussell tableyourt...@googlemail.com wrote:

also have a look into flt:

http://www.elasticsearch.org/guide/reference/query-dsl/flt-field-quer...

http://zmievski.org/2011/03/duplicates-detection-with-elasticsearch

and

https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jet...

On Jul 5, 2:52 pm, Stefan Matheis matheis.ste...@googlemail.com
wrote:

Kostya,

what about thishttp://www.elasticsearch.org/guide/reference/api/more-like-this.html
and thishttp://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin

konstantin.seliva...@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

  1. query is too large (thus I expect reduce in performance)
  2. I have to use query_string query type so I need to use my own query
    parser (I merge all terms from all the fields in one query and boost
    the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
elastic search/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?


(system) #5