Search for similar documents

konstantin · July 5, 2011, 12:47pm

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

query is too large (thus I expect reduce in performance)
I have to use query_string query type so I need to use my own query
parser (I merge all terms from all the fields in one query and boost
the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
elastic search/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?

Stefan_Matheis · July 5, 2011, 12:52pm

Kostya,

what about this

and this Elasticsearch Platform — Find real-time answers at scale | Elastic
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin
konstantin.selivanov@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

query is too large (thus I expect reduce in performance)

I have to use query_string query type so I need to use my own query
parser (I merge all terms from all the fields in one query and boost
the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
Elasticsearch/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?

Karussell1 · July 5, 2011, 4:42pm

also have a look into flt:

and

github.com

karussell/Jetwick/blob/master/src/main/java/de/jetwick/es/ElasticTweetSearch.java#L971


      
              } catch (Exception ex) {
                  throw new RuntimeException(ex);
              }
          }
          
          public Collection<String> searchTrends(JetwickQuery q, int limit) {
              try {
                  q.addFacetField(TAG);
                  SearchResponse rsp = query(q);
                  Facets facets = rsp.facets();
                  if (facets == null)
                      return Collections.emptyList();
          
                  Set<String> set = new LinkedHashSet<String>();
                  for (Facet facet : facets.facets()) {
                      if (facet instanceof TermsFacet) {
                          TermsFacet ff = (TermsFacet) facet;
                          for (TermsFacet.Entry e : ff.entries()) {
                              if (e.count() > limit)
                                  set.add(e.getTerm());
                          }

On Jul 5, 2:52 pm, Stefan Matheis matheis.ste...@googlemail.com
wrote:

Kostya,

what about thishttp://www.elasticsearch.org/guide/reference/api/more-like-this.html
and thishttp://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin

konstantin.seliva...@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

query is too large (thus I expect reduce in performance)

I have to use query_string query type so I need to use my own query
parser (I merge all terms from all the fields in one query and boost
the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
Elasticsearch/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?

konstantin · July 6, 2011, 11:19am

Thanks for the links!

btw I found the article that describes how mlt works:
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/

So, I guess mlt is a possible solution for me.
I could use fuzzy-like-this as well but it seems that I will get a lot
of false positives.

Thanks.

On Jul 5, 8:42 pm, Karussell tableyourt...@googlemail.com wrote:

also have a look into flt:

Elasticsearch Platform — Find real-time answers at scale | Elastic...

Duplicates Detection with ElasticSearch » Andrei Zmievski

and

https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jet...

On Jul 5, 2:52 pm, Stefan Matheis matheis.ste...@googlemail.com
wrote:

Kostya,

what about thishttp://www.elasticsearch.org/guide/reference/api/more-like-this.html
and thishttp://www.elasticsearch.org/guide/reference/query-dsl/mlt-query.html
?

Regards
Stefan

On Tue, Jul 5, 2011 at 2:47 PM, konstantin

konstantin.seliva...@gmail.com wrote:

Hi guys,

I have a task. Given document I have to find the set of similar
documents.
The document has title and content fields. I'd like to use some kind
of cosine similarity.

My current approach is to represent input document as boolean query
constructed for each term with OR conjunction.

So there are some shortcomings

query is too large (thus I expect reduce in performance)

I have to use query_string query type so I need to use my own query
parser (I merge all terms from all the fields in one query and boost
the terms that belongs to the title)

My questions are what is the best way to solve this task? Is the
Elasticsearch/lucene good for that kind of searching?

I would be much obliged,
-Kostya

ps what is "MoreLikeThis" function? Are there any description how it
works?

Topic		Replies	Views
Cosine Similarity ElasticSearch Elasticsearch	5	5729	July 6, 2017
How to find Similar documents Elasticsearch	4	2528	July 5, 2017
LIKE Query Elasticsearch	2	289	July 6, 2017
Finding similar documents with Elasticsearch Elasticsearch	4	398	July 6, 2017
Elasticsearch more_like_this Elasticsearch	1	663	July 5, 2017

Search for similar documents

Related topics