Finding similar documents with Elasticsearch


(Zoran Jeremic) #1

Hi guys,

I'm trying to develop service that will store uploaded files as attachment
(file is one field in document). This part works fine as I can search these
files using like_text as input. However, the second part of this service
should compare the file that is just uploaded with the existing files in
order to find duplicates or very similar files. The problem is that I
always get the same results regardless the input I'm using, and these
results are wrong as exactly the same file has smallest score very often.
It looks that like_text extracted from uploaded file is always the same,
and none of the documents has expected score, which should be I believe 1
in case of identical documents. The scores I get are always less then 0.2.
Could you please check if there is something wrong with my code?

String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/documents-mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse =
client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)

.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)

.field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
client.admin().indices().refresh(refreshRequest()).actionGet();

  MoreLikeThisRequestBuilder mltRequestBuilder=new 

MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS,
ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
SearchResponse response =
client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
Iterator hitsIter=searchHits.iterator();
while(hitsIter.hasNext()){
SearchHit searchHit=hitsIter.next();
System.out.println("FOUND DOCUMENT:"+searchHit.getId()+"
title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
}

And this is the mapping I was using

{
"document":{
"properties":{
"title":{
"type":"string",
"store":true
},
"description":{
"type":"string",
"store":"yes"
},
"contentType":{
"type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
"visibility": {
"store":"yes",
"type":"string"
},
"ownerId": {
"type": "long",
"store":"yes"
},
"relatedToType": {
"type": "string",
"store":"yes"
},
"relatedToId": {
"type": "long",
"store":"yes"
},
"file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}
}
}
}

}
}

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/79e29c89-62ea-42f3-be93-3e215a75860a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Pilato) #2

Do you get same results if you compare against file.file field?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 19 janv. 2014 à 07:13, Zoran Jeremic zoran.jeremic@gmail.com a écrit :

Hi guys,

I'm trying to develop service that will store uploaded files as attachment (file is one field in document). This part works fine as I can search these files using like_text as input. However, the second part of this service should compare the file that is just uploaded with the existing files in order to find duplicates or very similar files. The problem is that I always get the same results regardless the input I'm using, and these results are wrong as exactly the same file has smallest score very often. It looks that like_text extracted from uploaded file is always the same, and none of the documents has expected score, which should be I believe 1 in case of identical documents. The scores I get are always less then 0.2.
Could you please check if there is something wrong with my code?

String mapping = copyToStringFromClasspath("/org/prosolo/services/indexing/documents-mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();
client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse = client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)
.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)
.field("contentType",DocumentType.DOCUMENT.name().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
client.admin().indices().refresh(refreshRequest()).actionGet();

      MoreLikeThisRequestBuilder mltRequestBuilder=new MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS,    ESIndexTypes.DOCUMENT, iResponse.getId());
	mltRequestBuilder.setField("file");
	 SearchResponse response = client.moreLikeThis(mltRequestBuilder.request()).actionGet();
	SearchHits searchHits= response.getHits();
	System.out.println("getTotalHits:"+searchHits.getTotalHits());
	 Iterator<SearchHit> hitsIter=searchHits.iterator();
	 while(hitsIter.hasNext()){
		 SearchHit searchHit=hitsIter.next();
		 System.out.println("FOUND DOCUMENT:"+searchHit.getId()+" title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
	 }

And this is the mapping I was using

{
"document":{
"properties":{
"title":{
"type":"string",
"store":true
},
"description":{
"type":"string",
"store":"yes"
},
"contentType":{
"type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
"visibility": {
"store":"yes",
"type":"string"
},
"ownerId": {
"type": "long",
"store":"yes"
},
"relatedToType": {
"type": "string",
"store":"yes"
},
"relatedToId": {
"type": "long",
"store":"yes"
},
"file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}

			}
		}
	}
    
}

}

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/79e29c89-62ea-42f3-be93-3e215a75860a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/E60D0A89-10B2-46C1-83F4-5F18191EBB55%40pilato.fr.
For more options, visit https://groups.google.com/groups/opt_out.


(Zoran Jeremic) #3

Hi David,

When I try to compare against file.file, I got exception:

Exception in thread "Thread-62" org.elasticsearch.ElasticSearchException:
No fields found to fetch the 'likeText' from
at
org.elasticsearch.action.mlt.TransportMoreLikeThisAction$1.onResponse(TransportMoreLikeThisAction.java:176)
at
org.elasticsearch.action.mlt.TransportMoreLikeThisAction$1.onResponse(TransportMoreLikeThisAction.java:128)
at
org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:93)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Zoran

On Saturday, 18 January 2014 23:42:39 UTC-8, David Pilato wrote:

Do you get same results if you compare against file.file field?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 19 janv. 2014 à 07:13, Zoran Jeremic <zoran....@gmail.com <javascript:>>
a écrit :

Hi guys,

I'm trying to develop service that will store uploaded files as attachment
(file is one field in document). This part works fine as I can search these
files using like_text as input. However, the second part of this service
should compare the file that is just uploaded with the existing files in
order to find duplicates or very similar files. The problem is that I
always get the same results regardless the input I'm using, and these
results are wrong as exactly the same file has smallest score very often.
It looks that like_text extracted from uploaded file is always the same,
and none of the documents has expected score, which should be I believe 1
in case of identical documents. The scores I get are always less then 0.2.
Could you please check if there is something wrong with my code?

String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/documents-mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();

client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse =
client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)

.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)
.field("contentType",DocumentType.DOCUMENT.name
().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
client.admin().indices().refresh(refreshRequest()).actionGet();

  MoreLikeThisRequestBuilder mltRequestBuilder=new 

MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS,
ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
SearchResponse response =
client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
Iterator hitsIter=searchHits.iterator();
while(hitsIter.hasNext()){
SearchHit searchHit=hitsIter.next();
System.out.println("FOUND DOCUMENT:"+searchHit.getId()+"
title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
}

And this is the mapping I was using

{
"document":{
"properties":{
"title":{
"type":"string",
"store":true
},
"description":{
"type":"string",
"store":"yes"
},
"contentType":{
"type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
"visibility": {
"store":"yes",
"type":"string"
},
"ownerId": {
"type": "long",
"store":"yes"
},
"relatedToType": {
"type": "string",
"store":"yes"
},
"relatedToId": {
"type": "long",
"store":"yes"
},
"file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}
}
}
}

}
}

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/79e29c89-62ea-42f3-be93-3e215a75860a%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e9fdcee1-37c3-4a0a-a474-c84d40c22393%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Zoran Jeremic) #4

Does anyone have an idea what could be the problem here?

On Sunday, 19 January 2014 14:30:32 UTC-8, Zoran Jeremic wrote:

Hi David,

When I try to compare against file.file, I got exception:

Exception in thread "Thread-62" org.elasticsearch.ElasticSearchException:
No fields found to fetch the 'likeText' from
at
org.elasticsearch.action.mlt.TransportMoreLikeThisAction$1.onResponse(TransportMoreLikeThisAction.java:176)
at
org.elasticsearch.action.mlt.TransportMoreLikeThisAction$1.onResponse(TransportMoreLikeThisAction.java:128)
at
org.elasticsearch.action.support.TransportAction$ThreadedActionListener$1.run(TransportAction.java:93)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)

Zoran

On Saturday, 18 January 2014 23:42:39 UTC-8, David Pilato wrote:

Do you get same results if you compare against file.file field?

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 19 janv. 2014 à 07:13, Zoran Jeremic zoran....@gmail.com a écrit :

Hi guys,

I'm trying to develop service that will store uploaded files as
attachment (file is one field in document). This part works fine as I can
search these files using like_text as input. However, the second part of
this service should compare the file that is just uploaded with the
existing files in order to find duplicates or very similar files. The
problem is that I always get the same results regardless the input I'm
using, and these results are wrong as exactly the same file has smallest
score very often. It looks that like_text extracted from uploaded file is
always the same, and none of the documents has expected score, which should
be I believe 1 in case of identical documents. The scores I get are always
less then 0.2.
Could you please check if there is something wrong with my code?

String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/documents-mapping.json");
byte[] txt = org.elasticsearch.common.io.Streams.copyToByteArray(file);
Client client = ElasticSearchFactory.getClient();

client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();
IndexResponse iResponse =
client.index(indexRequest(indexName).type(indexType)
.source(jsonBuilder()
.startObject()
.field("file", txt)
.field("title",title)

.field("visibility",visibilityType.name().toLowerCase())
.field("ownerId",ownerId)
.field("description",description)
.field("contentType",DocumentType.DOCUMENT.name
().toLowerCase())
.field("dateCreated",dateCreated)
.field("url",link)
.field("relatedToType",relatedToType)
.field("relatedToId",relatedToId)
.endObject()))
.actionGet();
client.admin().indices().refresh(refreshRequest()).actionGet();

  MoreLikeThisRequestBuilder mltRequestBuilder=new 

MoreLikeThisRequestBuilder(client, ESIndexNames.INDEX_DOCUMENTS,
ESIndexTypes.DOCUMENT, iResponse.getId());
mltRequestBuilder.setField("file");
SearchResponse response =
client.moreLikeThis(mltRequestBuilder.request()).actionGet();
SearchHits searchHits= response.getHits();
System.out.println("getTotalHits:"+searchHits.getTotalHits());
Iterator hitsIter=searchHits.iterator();
while(hitsIter.hasNext()){
SearchHit searchHit=hitsIter.next();
System.out.println("FOUND DOCUMENT:"+searchHit.getId()+"
title:"+searchHit.getSource().get("title")+" score:"+searchHit.score());
}

And this is the mapping I was using

{
"document":{
"properties":{
"title":{
"type":"string",
"store":true
},
"description":{
"type":"string",
"store":"yes"
},
"contentType":{
"type":"string",
"store":"yes"
},
"dateCreated":{
"store":"yes",
"type":"date"
},
"url":{
"store":"yes",
"type":"string"
},
"visibility": {
"store":"yes",
"type":"string"
},
"ownerId": {
"type": "long",
"store":"yes"
},
"relatedToType": {
"type": "string",
"store":"yes"
},
"relatedToId": {
"type": "long",
"store":"yes"
},
"file":{
"path": "full",
"type":"attachment",
"fields":{
"author": {
"type": "string"
},
"title": {
"store": true,
"type": "string"
},
"keywords": {
"type": "string"
},
"file": {
"store": true,
"term_vector": "with_positions_offsets",
"type": "string"
},
"name": {
"type": "string"
},
"content_length": {
"type": "integer"
},
"date": {
"format": "dateOptionalTime",
"type": "date"
},
"content_type": {
"type": "string"
}
}
}
}

}
}

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/79e29c89-62ea-42f3-be93-3e215a75860a%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/069f14b7-417c-4e0c-b87b-3ab5c60a7600%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5