MoreLikeThis can't identify that 2 documents with exactly same attachments are duplicates


(Zoran Jeremic) #1

Hi guys,

I have a document that stores a content of html file, pdf, doc or other
textual document in one of it's fields as byte array using attachment
plugin. Mapping is as follows:

{ "document":{
"properties":{
"title":{"type":"string","store":true },
"description":{"type":"string","store":"yes"},
"contentType":{"type":"string","store":"yes"},
"url":{"store":"yes", "type":"string"},
"visibility": { "store":"yes", "type":"string"},
"ownerId": {"type": "long", "store":"yes" },
"relatedToType": { "type": "string", "store":"yes" },
"relatedToId": {"type": "long", "store":"yes" },
"file":{
"path": "full","type":"attachment",
"fields":{
"author": { "type": "string" },
"title": { "store": true,"type": "string" },
"keywords": { "type": "string" },
"file": { "store": true, "term_vector":
"with_positions_offsets","type": "string" },
"name": { "type": "string" },
"content_length": { "type": "integer" },
"date": { "format": "dateOptionalTime", "type":
"date" },
"content_type": { "type": "string" }
}
}}
And the code I'm using to store the document is:

VisibilityType.PUBLIC

These files seems to be stored fine and I can search content. However, I
need to identify if there are duplicates of web pages or files stored in
ES, so I don't return the same documents to the user as search or
recommendation result. My expectation was that I could use MoreLikeThis
after the document was indexed to identify if there are duplicates of that
document and accordingly to mark it as duplicate. However, results look
weird for me, or I don't understand very well how MoreLikeThis works.

For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics 3
times, and all 3 documents in ES have exactly the same binary content under
file. Then for the following query:
http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score 0.48509508
...

For some other examples, scores for the same documents are much lower, and
sometimes (though not that often) I don't get duplicates on the first
positions. I would expect here to have score 1.0 or higher for documents
that are exactly the same, but it's not the case, and I can't figure out
how could I identify if there are duplicates in the Elasticsearch index.

I would appreciate if somebody could explain if this is expected behaviour
or I didn't use it properly.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3c5fd0da-e192-4c54-85d5-63c84f3acafc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alex Ksikes) #2

Hi Zoran,

Using the attachment type, you can text search over the attached document
meta-data, but not its actual content, as it is base 64 encoded. So I would
adjust the mlt_fields accordingly, and possibly extract the relevant
portions of texts manually. Also set percent_terms_to_match = 0, to ensure
that all boolean clauses match. Let me know how this works out for you.

Cheers,

Alex

On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:

Hi guys,

I have a document that stores a content of html file, pdf, doc or other
textual document in one of it's fields as byte array using attachment
plugin. Mapping is as follows:

{ "document":{
"properties":{
"title":{"type":"string","store":true },
"description":{"type":"string","store":"yes"},
"contentType":{"type":"string","store":"yes"},
"url":{"store":"yes", "type":"string"},
"visibility": { "store":"yes", "type":"string"},
"ownerId": {"type": "long", "store":"yes" },
"relatedToType": { "type": "string", "store":"yes" },
"relatedToId": {"type": "long", "store":"yes" },
"file":{
"path": "full","type":"attachment",
"fields":{
"author": { "type": "string" },
"title": { "store": true,"type": "string" },
"keywords": { "type": "string" },
"file": { "store": true, "term_vector":
"with_positions_offsets","type": "string" },
"name": { "type": "string" },
"content_length": { "type": "integer" },
"date": { "format": "dateOptionalTime", "type":
"date" },
"content_type": { "type": "string" }
}
}}
And the code I'm using to store the document is:

VisibilityType.PUBLIC

These files seems to be stored fine and I can search content. However, I
need to identify if there are duplicates of web pages or files stored in
ES, so I don't return the same documents to the user as search or
recommendation result. My expectation was that I could use MoreLikeThis
after the document was indexed to identify if there are duplicates of that
document and accordingly to mark it as duplicate. However, results look
weird for me, or I don't understand very well how MoreLikeThis works.

For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 times, and all 3 documents in ES have exactly the same binary content
under file. Then for the following query:

http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score
0.48509508
...

For some other examples, scores for the same documents are much lower, and
sometimes (though not that often) I don't get duplicates on the first
positions. I would expect here to have score 1.0 or higher for documents
that are exactly the same, but it's not the case, and I can't figure out
how could I identify if there are duplicates in the Elasticsearch index.

I would appreciate if somebody could explain if this is expected behaviour
or I didn't use it properly.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7a98b6da-7ff9-4e7a-ab4e-a43d79bb0a50%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Zoran Jeremic) #3

Hi Alex,

Thank you for your explanation. It makes sense now. However, I'm not sure I
understood your proposal.

So I would adjust the mlt_fields accordingly, and possibly extract the
relevant portions of texts manually
What do you mean by adjusting mlt_fields? The only shared field that is
guaranteed to be same is file. Different users could add different titles
to documents, but attach same or almost the same documents. If I compare
documents based on the other fields, it doesn't mean that it will match,
even though attached files are exactly the same.
I'm also not sure what did you mean by extract the relevant portions of
text manually. How would I do that and what to do with it?

Thanks,
Zoran

On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote:

Hi Zoran,

Using the attachment type, you can text search over the attached document
meta-data, but not its actual content, as it is base 64 encoded. So I would
adjust the mlt_fields accordingly, and possibly extract the relevant
portions of texts manually. Also set percent_terms_to_match = 0, to ensure
that all boolean clauses match. Let me know how this works out for you.

Cheers,

Alex

On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:

Hi guys,

I have a document that stores a content of html file, pdf, doc or other
textual document in one of it's fields as byte array using attachment
plugin. Mapping is as follows:

{ "document":{
"properties":{
"title":{"type":"string","store":true },
"description":{"type":"string","store":"yes"},
"contentType":{"type":"string","store":"yes"},
"url":{"store":"yes", "type":"string"},
"visibility": { "store":"yes", "type":"string"},
"ownerId": {"type": "long", "store":"yes" },
"relatedToType": { "type": "string", "store":"yes" },
"relatedToId": {"type": "long", "store":"yes" },
"file":{
"path": "full","type":"attachment",
"fields":{
"author": { "type": "string" },
"title": { "store": true,"type": "string" },
"keywords": { "type": "string" },
"file": { "store": true, "term_vector":
"with_positions_offsets","type": "string" },
"name": { "type": "string" },
"content_length": { "type": "integer" },
"date": { "format": "dateOptionalTime", "type":
"date" },
"content_type": { "type": "string" }
}
}}
And the code I'm using to store the document is:

VisibilityType.PUBLIC

These files seems to be stored fine and I can search content. However, I
need to identify if there are duplicates of web pages or files stored in
ES, so I don't return the same documents to the user as search or
recommendation result. My expectation was that I could use MoreLikeThis
after the document was indexed to identify if there are duplicates of that
document and accordingly to mark it as duplicate. However, results look
weird for me, or I don't understand very well how MoreLikeThis works.

For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 times, and all 3 documents in ES have exactly the same binary content
under file. Then for the following query:

http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score
0.48509508
...

For some other examples, scores for the same documents are much lower,
and sometimes (though not that often) I don't get duplicates on the first
positions. I would expect here to have score 1.0 or higher for documents
that are exactly the same, but it's not the case, and I can't figure out
how could I identify if there are duplicates in the Elasticsearch index.

I would appreciate if somebody could explain if this is expected
behaviour or I didn't use it properly.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c127e04c-006f-44f6-8a0f-af05e5c46688%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alex Ksikes) #4

Hi Zoran,

If you are looking for exact duplicates then hashing the file content, and
doing a search for that hash would do the job. If you are looking for near
duplicates, then I would recommend extracting whatever text you have in
your html, pdf, doc, indexing that and running more like this with
like_text set to that content. Additionally you can perform a mlt search on
more fields including the meta-data fields extracted with the attachment
plugin. Hope this helps.

Alex

On Monday, May 5, 2014 8:08:30 PM UTC+2, Zoran Jeremic wrote:

Hi Alex,

Thank you for your explanation. It makes sense now. However, I'm not sure
I understood your proposal.

So I would adjust the mlt_fields accordingly, and possibly extract the
relevant portions of texts manually
What do you mean by adjusting mlt_fields? The only shared field that is
guaranteed to be same is file. Different users could add different titles
to documents, but attach same or almost the same documents. If I compare
documents based on the other fields, it doesn't mean that it will match,
even though attached files are exactly the same.
I'm also not sure what did you mean by extract the relevant portions of
text manually. How would I do that and what to do with it?

Thanks,
Zoran

On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote:

Hi Zoran,

Using the attachment type, you can text search over the attached document
meta-data, but not its actual content, as it is base 64 encoded. So I would
adjust the mlt_fields accordingly, and possibly extract the relevant
portions of texts manually. Also set percent_terms_to_match = 0, to ensure
that all boolean clauses match. Let me know how this works out for you.

Cheers,

Alex

On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:

Hi guys,

I have a document that stores a content of html file, pdf, doc or other
textual document in one of it's fields as byte array using attachment
plugin. Mapping is as follows:

{ "document":{
"properties":{
"title":{"type":"string","store":true },
"description":{"type":"string","store":"yes"},
"contentType":{"type":"string","store":"yes"},
"url":{"store":"yes", "type":"string"},
"visibility": { "store":"yes", "type":"string"},
"ownerId": {"type": "long", "store":"yes" },
"relatedToType": { "type": "string", "store":"yes" },
"relatedToId": {"type": "long", "store":"yes" },
"file":{
"path": "full","type":"attachment",
"fields":{
"author": { "type": "string" },
"title": { "store": true,"type": "string" },
"keywords": { "type": "string" },
"file": { "store": true, "term_vector":
"with_positions_offsets","type": "string" },
"name": { "type": "string" },
"content_length": { "type": "integer" },
"date": { "format": "dateOptionalTime", "type":
"date" },
"content_type": { "type": "string" }
}
}}
And the code I'm using to store the document is:

VisibilityType.PUBLIC

These files seems to be stored fine and I can search content. However, I
need to identify if there are duplicates of web pages or files stored in
ES, so I don't return the same documents to the user as search or
recommendation result. My expectation was that I could use MoreLikeThis
after the document was indexed to identify if there are duplicates of that
document and accordingly to mark it as duplicate. However, results look
weird for me, or I don't understand very well how MoreLikeThis works.

For example, I indexed web page http://en.wikipedia.org/wiki/Linguistics3 times, and all 3 documents in ES have exactly the same binary content
under file. Then for the following query:

http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score
0.48509508
...

For some other examples, scores for the same documents are much lower,
and sometimes (though not that often) I don't get duplicates on the first
positions. I would expect here to have score 1.0 or higher for documents
that are exactly the same, but it's not the case, and I can't figure out
how could I identify if there are duplicates in the Elasticsearch index.

I would appreciate if somebody could explain if this is expected
behaviour or I didn't use it properly.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3f93c682-8f64-463c-95c9-007c63560370%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Zoran Jeremic) #5

Hi Alex,

If you are looking for exact duplicates then hashing the file content, and
doing a search for that hash would do the job.
This trick won't work for me as these are not exact duplicates. For
example, I have 10 students working on the same 100 pages long word
document. Each of these students could change only one sentence and upload
a document. The hash will be different, but it's 99,99 % same documents.
I have the other service that uses mlt_like_text to recommend some relevant
documents, and my problem is if this document has best score, then all
duplicates will be among top hits and instead recommending users with
several most relevant documents I will recommend 10 instances of same
document.

If you are looking for near duplicates, then I would recommend extracting
whatever text you have in your html, pdf, doc, indexing that and running
more like this with like_text set to that content.
I tried that as well, and results are very disappointing, though I'm not
sure if that would be good idea having in mind that long textual documents
could be used. For testing purpose, I made a simple test with 10 web pages.
Maybe I'm making some mistake there. What I did is to index 10 web pages
and store it in document as attachment. Content is stored as byte[]. Then
I'm using the same 10 pages, extract content using Jsoup, and try to find
similar web pages. Here is the code that I used to find similar web pages
to the provided one:
System.out.println("Duplicates for link:"+link);
System.out.println(
"************************************************");
String indexName=ESIndexNames.INDEX_DOCUMENTS;
String indexType=ESIndexTypes.DOCUMENT;
String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/document-mapping.json");
client.admin().indices().putMapping(putMappingRequest(indexName
).type(indexType).source(mapping)).actionGet();
URL url = new URL(link);
org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
String html=doc.html(); //doc.text();
QueryBuilder qb = null;
// create the query
qb = QueryBuilders.moreLikeThisQuery("file")
.likeText(html).minTermFreq(0).minDocFreq(0);
SearchResponse sr = client.prepareSearch(ESIndexNames.
INDEX_DOCUMENTS)
.setQuery(qb).addFields("url", "title", "contentType")
.setFrom(0).setSize(5).execute().actionGet();
if (sr != null) {
SearchHits searchHits = sr.getHits();
Iterator hitsIter = searchHits.iterator();
while (hitsIter.hasNext()) {
SearchHit searchHit = hitsIter.next();
System.out.println("Duplicate:" + searchHit.getId()
+ " title:"+searchHit.getFields().get("url").
getValue()+" score:" + searchHit.getScore());
}
}

And results of the execution of this for each of 10 urls is:

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic


Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.3335998
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.16319205
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://en.wikipedia.org/wiki/Formal_science
score:0.13035104
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Star
score:0.12292466
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.117023855

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics


Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.1570246
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://en.wikipedia.org/wiki/Mathematical_statistics
score:0.1498403
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.09323166
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Star
score:0.09279101
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://en.wikipedia.org/wiki/Formal_science
score:0.08606046

Duplicates for link:http://en.wikipedia.org/wiki/Formal_science


Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://en.wikipedia.org/wiki/Formal_science
score:0.12439237
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.11299215
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.107585154
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.07795183
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://en.wikipedia.org/wiki/Mathematical_statistics
score:0.076521285

Duplicates for link:http://en.wikipedia.org/wiki/Star


Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Star
score:0.21684575
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.15316588
Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:http://en.wikipedia.org/wiki/Cosmology
score:0.123572096
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.1177105
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.11373919

Duplicates for link:http://en.wikipedia.org/wiki/Chemistry


Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.13033955
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.121021904
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://en.wikipedia.org/wiki/Formal_science
score:0.10888695
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.09392845
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://en.wikipedia.org/wiki/Mathematical_statistics
score:0.059616603

Duplicates for link:http://en.wikipedia.org/wiki/Analytical_chemistry


Duplicate:WTP_O31LQZuc7yVfw6I14g URL:http://en.wikipedia.org/wiki/Analytical_chemistry
score:0.2502811
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.17005277
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.14266267
Duplicate:u52_jEt6T3iTc3rKqYcW6w URL:http://en.wikipedia.org/wiki/Biochemistry
score:0.13861096
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.11827134

Duplicates for link:http://en.wikipedia.org/wiki/Crystallography


Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.219179
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.131886
Duplicate:WTP_O31LQZuc7yVfw6I14g URL:http://en.wikipedia.org/wiki/Analytical_chemistry
score:0.1229952
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.119422816
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Star
score:0.11649642

Duplicates for link:http://en.wikipedia.org/wiki/Cosmology


Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:http://en.wikipedia.org/wiki/Cosmology
score:0.24200413
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.16012353
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.13237886
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Star
score:0.12711151
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://en.wikipedia.org/wiki/Formal_science
score:0.11265415

Duplicates for link:http://en.wikipedia.org/wiki/Biochemistry


Duplicate:u52_jEt6T3iTc3rKqYcW6w URL:http://en.wikipedia.org/wiki/Biochemistry
score:0.15852709
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.14897613
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://en.wikipedia.org/wiki/Mathematical_logic
score:0.11837863
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://en.wikipedia.org/wiki/Formal_science
score:0.115204185
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.10719262

Duplicates for link:http://en.wikipedia.org/wiki/Astrochemistry


Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://en.wikipedia.org/wiki/Crystallography
score:0.17403923
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://en.wikipedia.org/wiki/Chemistry
score:0.15946785
Duplicate:O9d4ZfcHSJWfTjdrb7N68g URL:http://en.wikipedia.org/wiki/Astrochemistry
score:0.15717393
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Star
score:0.15676315
Duplicate:u52_jEt6T3iTc3rKqYcW6w URL:http://en.wikipedia.org/wiki/Biochemistry
score:0.13150169

If I extract text from html page, results are better, but still not that
good (between 0.29 and 0.44). I suppose there is some problem with my code
here and that's not the best elasticsearch can do :frowning:

On Tuesday, 6 May 2014 02:17:57 UTC-7, Alex Ksikes wrote:

Hi Zoran,

If you are looking for exact duplicates then hashing the file content, and
doing a search for that hash would do the job. If you are looking for near
duplicates, then I would recommend extracting whatever text you have in
your html, pdf, doc, indexing that and running more like this with
like_text set to that content. Additionally you can perform a mlt search on
more fields including the meta-data fields extracted with the attachment
plugin. Hope this helps.

Alex

On Monday, May 5, 2014 8:08:30 PM UTC+2, Zoran Jeremic wrote:

Hi Alex,

Thank you for your explanation. It makes sense now. However, I'm not sure
I understood your proposal.

So I would adjust the mlt_fields accordingly, and possibly extract the
relevant portions of texts manually
What do you mean by adjusting mlt_fields? The only shared field that is
guaranteed to be same is file. Different users could add different titles
to documents, but attach same or almost the same documents. If I compare
documents based on the other fields, it doesn't mean that it will match,
even though attached files are exactly the same.
I'm also not sure what did you mean by extract the relevant portions of
text manually. How would I do that and what to do with it?

Thanks,
Zoran

On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote:

Hi Zoran,

Using the attachment type, you can text search over the attached
document meta-data, but not its actual content, as it is base 64 encoded.
So I would adjust the mlt_fields accordingly, and possibly extract the
relevant portions of texts manually. Also set percent_terms_to_match = 0,
to ensure that all boolean clauses match. Let me know how this works out
for you.

Cheers,

Alex

On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:

Hi guys,

I have a document that stores a content of html file, pdf, doc or
other textual document in one of it's fields as byte array using attachment
plugin. Mapping is as follows:

{ "document":{
"properties":{
"title":{"type":"string","store":true },
"description":{"type":"string","store":"yes"},
"contentType":{"type":"string","store":"yes"},
"url":{"store":"yes", "type":"string"},
"visibility": { "store":"yes", "type":"string"},
"ownerId": {"type": "long", "store":"yes" },
"relatedToType": { "type": "string", "store":"yes" },
"relatedToId": {"type": "long", "store":"yes" },
"file":{
"path": "full","type":"attachment",
"fields":{
"author": { "type": "string" },
"title": { "store": true,"type": "string" },
"keywords": { "type": "string" },
"file": { "store": true, "term_vector":
"with_positions_offsets","type": "string" },
"name": { "type": "string" },
"content_length": { "type": "integer" },
"date": { "format": "dateOptionalTime", "type":
"date" },
"content_type": { "type": "string" }
}
}}
And the code I'm using to store the document is:

VisibilityType.PUBLIC

These files seems to be stored fine and I can search content. However,
I need to identify if there are duplicates of web pages or files stored in
ES, so I don't return the same documents to the user as search or
recommendation result. My expectation was that I could use MoreLikeThis
after the document was indexed to identify if there are duplicates of that
document and accordingly to mark it as duplicate. However, results look
weird for me, or I don't understand very well how MoreLikeThis works.

For example, I indexed web page
http://en.wikipedia.org/wiki/Linguistics 3 times, and all 3 documents
in ES have exactly the same binary content under file. Then for the
following query:

http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score
0.48509508
...

For some other examples, scores for the same documents are much lower,
and sometimes (though not that often) I don't get duplicates on the first
positions. I would expect here to have score 1.0 or higher for documents
that are exactly the same, but it's not the case, and I can't figure out
how could I identify if there are duplicates in the Elasticsearch index.

I would appreciate if somebody could explain if this is expected
behaviour or I didn't use it properly.

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c455030b-4256-4214-b713-7925698966ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alex Ksikes) #6

Hi Zoran,

In a nutshell 'more like this' creates a large boolean disjunctive query of
'max_query_terms' number of interesting terms from a text specified in
'like_text'. The interesting terms are picked up with respect to the their
tf-idf scores in the whole corpus. These later parameters could be tuned
with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The
number of boolean clauses that must match is controlled by
'percent_terms_to_match'. In the case of specifying only one field in
'fields', the analyzer used to pick up the terms in 'like_text' is the one
associated with the field, unless specified specified by 'analyzer'. So as
an example, the default is to create a boolean query of 25 interesting
terms where only 30% of the should clauses must match.

On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote:

Hi Alex,

If you are looking for exact duplicates then hashing the file content, and
doing a search for that hash would do the job.
This trick won't work for me as these are not exact duplicates. For
example, I have 10 students working on the same 100 pages long word
document. Each of these students could change only one sentence and upload
a document. The hash will be different, but it's 99,99 % same documents.
I have the other service that uses mlt_like_text to recommend some
relevant documents, and my problem is if this document has best score, then
all duplicates will be among top hits and instead recommending users with
several most relevant documents I will recommend 10 instances of same
document.

Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?

If you are looking for near duplicates, then I would recommend extracting
whatever text you have in your html, pdf, doc, indexing that and running
more like this with like_text set to that content.
I tried that as well, and results are very disappointing, though I'm not
sure if that would be good idea having in mind that long textual documents
could be used. For testing purpose, I made a simple test with 10 web pages.
Maybe I'm making some mistake there. What I did is to index 10 web pages
and store it in document as attachment. Content is stored as byte[]. Then
I'm using the same 10 pages, extract content using Jsoup, and try to find
similar web pages. Here is the code that I used to find similar web pages
to the provided one:
System.out.println("Duplicates for link:"+link);
System.out.println(
"************************************************");
String indexName=ESIndexNames.INDEX_DOCUMENTS;
String indexType=ESIndexTypes.DOCUMENT;
String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/document-mapping.json");
client.admin().indices().putMapping(putMappingRequest(
indexName).type(indexType).source(mapping)).actionGet();
URL url = new URL(link);
org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
String html=doc.html(); //doc.text();
QueryBuilder qb = null;
// create the query
qb = QueryBuilders.moreLikeThisQuery("file")
.likeText(html).minTermFreq(0).minDocFreq(0);
SearchResponse sr = client.prepareSearch(ESIndexNames.
INDEX_DOCUMENTS)
.setQuery(qb).addFields("url", "title", "contentType"
)
.setFrom(0).setSize(5).execute().actionGet();
if (sr != null) {
SearchHits searchHits = sr.getHits();
Iterator hitsIter = searchHits.iterator();
while (hitsIter.hasNext()) {
SearchHit searchHit = hitsIter.next();
System.out.println("Duplicate:" + searchHit.getId()
+ " title:"+searchHit.getFields().get("url").
getValue()+" score:" + searchHit.getScore());
}
}

And results of the execution of this for each of 10 urls is:

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic


Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.3335998
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.16319205
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
en.wikipedia.org/wiki/Formal_science score:0.13035104
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.12292466
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.117023855

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics


Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.1570246
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://
en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.09323166
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.09279101
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
en.wikipedia.org/wiki/Formal_science score:0.08606046

Duplicates for link:http://en.wikipedia.org/wiki/Formal_science


Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
en.wikipedia.org/wiki/Formal_science score:0.12439237
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.11299215
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.107585154
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.07795183
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://
en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285

Duplicates for link:http://en.wikipedia.org/wiki/Star


Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.21684575
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.15316588
Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:http://
en.wikipedia.org/wiki/Cosmology score:0.123572096
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.1177105
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.11373919

Duplicates for link:http://en.wikipedia.org/wiki/Chemistry


Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.13033955
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.121021904
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo

Here you should probably strip the html tags, and solely index the text in
its own field.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c30400c5-ce33-4cb7-9335-759b3923ae14%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Zoran Jeremic) #7

Hi Alex,

Thank you for this explanation. This really helped me to understand how it
works, and now I managed to get results I was expecting just after setting
max_query_terms value to be 0 or some very high value. With these results
in my tests I was able to identify duplicates. I noticed couple of things
though.

  • I got much better results with web pages when I indexed attachment as
    html source and use text extracted by Jsoup in query, then when I indexed
    text extracted from web page as attachment and used text in query. I
    suppose that difference is related to the fact that Jsoup did not extract
    text in the same way as Tika parser used by ES did.
  • There was significant improvement in the results in the second test when
    I have indexed 50 web pages, then in first test when I indexed 10 web
    pages. I deleted index before each test. I suppose that this is related to
    the tf*idf.
    If so, does it make sense to provide some training set for elasticsearch
    that will be used to populate index before system is started to be used?

Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?
I have the service that recommends documents to the students based on their
current learning context. It creates tokenized string from titles,
descriptions and keywords of the course lessons student is working at the
moment. I'm using this string as input to the mlt_like_text to find some
interesting resources that could help them.
I want to avoid having duplicates (or very similar documents) among top
documents that are recommended.
My idea was that during the documents uploading (before I index it with
elasticsearch) I find if there already exists it's duplicate, and store
this information as ES document field. Later, in query I can specify that
duplicates are not recommended.

Here you should probably strip the html tags, and solely index the text in
its own field.
As I already mentioned this didn't give me good results for some reason.

Do you think this approach would work fine with large textual documents,
e.g. pdf documents having couple of hundred of pages? My main concern is
related to performances of these queries using like_text, so that's why I
was trying to avoid this approach and use mlt with document id as input.

Thanks,
Zoran

On Wednesday, 7 May 2014 06:14:56 UTC-7, Alex Ksikes wrote:

Hi Zoran,

In a nutshell 'more like this' creates a large boolean disjunctive query
of 'max_query_terms' number of interesting terms from a text specified in
'like_text'. The interesting terms are picked up with respect to the their
tf-idf scores in the whole corpus. These later parameters could be tuned
with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The
number of boolean clauses that must match is controlled by
'percent_terms_to_match'. In the case of specifying only one field in
'fields', the analyzer used to pick up the terms in 'like_text' is the one
associated with the field, unless specified specified by 'analyzer'. So as
an example, the default is to create a boolean query of 25 interesting
terms where only 30% of the should clauses must match.

On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote:

Hi Alex,

If you are looking for exact duplicates then hashing the file content,
and doing a search for that hash would do the job.
This trick won't work for me as these are not exact duplicates. For
example, I have 10 students working on the same 100 pages long word
document. Each of these students could change only one sentence and upload
a document. The hash will be different, but it's 99,99 % same documents.
I have the other service that uses mlt_like_text to recommend some
relevant documents, and my problem is if this document has best score, then
all duplicates will be among top hits and instead recommending users with
several most relevant documents I will recommend 10 instances of same
document.

Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?

If you are looking for near duplicates, then I would recommend extracting
whatever text you have in your html, pdf, doc, indexing that and running
more like this with like_text set to that content.
I tried that as well, and results are very disappointing, though I'm not
sure if that would be good idea having in mind that long textual documents
could be used. For testing purpose, I made a simple test with 10 web pages.
Maybe I'm making some mistake there. What I did is to index 10 web pages
and store it in document as attachment. Content is stored as byte[]. Then
I'm using the same 10 pages, extract content using Jsoup, and try to find
similar web pages. Here is the code that I used to find similar web pages
to the provided one:
System.out.println("Duplicates for link:"+link);
System.out.println(
"************************************************");
String indexName=ESIndexNames.INDEX_DOCUMENTS;
String indexType=ESIndexTypes.DOCUMENT;
String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/document-mapping.json");
client.admin().indices().putMapping(putMappingRequest(
indexName).type(indexType).source(mapping)).actionGet();
URL url = new URL(link);
org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
String html=doc.html(); //doc.text();
QueryBuilder qb = null;
// create the query
qb = QueryBuilders.moreLikeThisQuery("file")
.likeText(html).minTermFreq(0).minDocFreq(0);
SearchResponse sr = client.prepareSearch(ESIndexNames.
INDEX_DOCUMENTS)
.setQuery(qb).addFields("url", "title",
"contentType")
.setFrom(0).setSize(5).execute().actionGet();
if (sr != null) {
SearchHits searchHits = sr.getHits();
Iterator hitsIter = searchHits.iterator();
while (hitsIter.hasNext()) {
SearchHit searchHit = hitsIter.next();
System.out.println("Duplicate:" + searchHit.getId()
+ " title:"+searchHit.getFields().get("url"
).getValue()+" score:" + searchHit.getScore());
}
}

And results of the execution of this for each of 10 urls is:

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic


Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.3335998
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.16319205
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
en.wikipedia.org/wiki/Formal_science score:0.13035104
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.12292466
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.117023855

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics


Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.1570246
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://
en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.09323166
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.09279101
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
en.wikipedia.org/wiki/Formal_science score:0.08606046

Duplicates for link:http://en.wikipedia.org/wiki/Formal_science


Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:http://
en.wikipedia.org/wiki/Formal_science score:0.12439237
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.11299215
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.107585154
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.07795183
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:http://
en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285

Duplicates for link:http://en.wikipedia.org/wiki/Star


Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.21684575
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.15316588
Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:http://
en.wikipedia.org/wiki/Cosmology score:0.123572096
Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.1177105
Duplicate:Crwk_36bTUCEso1ambs0bA URL:http://
en.wikipedia.org/wiki/Mathematical_logic score:0.11373919

Duplicates for link:http://en.wikipedia.org/wiki/Chemistry


Duplicate:--3l-WRuQL2osXg71ixw7A URL:http://
en.wikipedia.org/wiki/Chemistry score:0.13033955
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:http://
en.wikipedia.org/wiki/Crystallography score:0.121021904
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo

Here you should probably strip the html tags, and solely index the text in
its own field.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a92beaad-05bf-431b-9a37-f51512f50aa8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Alex Ksikes) #8

On May 8, 2014 8:09 AM, "Zoran Jeremic" zoran.jeremic@gmail.com wrote:

Hi Alex,

Thank you for this explanation. This really helped me to understand how
it works, and now I managed to get results I was expecting just after
setting max_query_terms value to be 0 or some very high value. With these
results in my tests I was able to identify duplicates. I noticed couple of
things though.

  • I got much better results with web pages when I indexed attachment as
    html source and use text extracted by Jsoup in query, then when I indexed
    text extracted from web page as attachment and used text in query. I
    suppose that difference is related to the fact that Jsoup did not extract
    text in the same way as Tika parser used by ES did.
  • There was significant improvement in the results in the second test
    when I have indexed 50 web pages, then in first test when I indexed 10 web
    pages. I deleted index before each test. I suppose that this is related to
    the tf*idf.
    If so, does it make sense to provide some training set for elasticsearch
    that will be used to populate index before system is started to be used?

Perhaps you are asking for a background dataset to bias the selection of
interesting terms. This could make sense depending on your application.

Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?
I have the service that recommends documents to the students based on
their current learning context. It creates tokenized string from titles,
descriptions and keywords of the course lessons student is working at the
moment. I'm using this string as input to the mlt_like_text to find some
interesting resources that could help them.
I want to avoid having duplicates (or very similar documents) among top
documents that are recommended.
My idea was that during the documents uploading (before I index it with
elasticsearch) I find if there already exists it's duplicate, and store
this information as ES document field. Later, in query I can specify that
duplicates are not recommended.

Here you should probably strip the html tags, and solely index the text
in its own field.
As I already mentioned this didn't give me good results for some reason.

Do you think this approach would work fine with large textual documents,
e.g. pdf documents having couple of hundred of pages? My main concern is
related to performances of these queries using like_text, so that's why I
was trying to avoid this approach and use mlt with document id as input.

I don't think this approach would work well in this case, but you should
try. I think what you are after is to either extract good features for your
PDF documents and search on that, or finger printing. This could be
achieved by playing with analyzers.

Thanks,
Zoran

On Wednesday, 7 May 2014 06:14:56 UTC-7, Alex Ksikes wrote:

Hi Zoran,

In a nutshell 'more like this' creates a large boolean disjunctive query
of 'max_query_terms' number of interesting terms from a text specified in
'like_text'. The interesting terms are picked up with respect to the their
tf-idf scores in the whole corpus. These later parameters could be tuned
with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq' parameters. The
number of boolean clauses that must match is controlled by
'percent_terms_to_match'. In the case of specifying only one field in
'fields', the analyzer used to pick up the terms in 'like_text' is the one
associated with the field, unless specified specified by 'analyzer'. So as
an example, the default is to create a boolean query of 25 interesting
terms where only 30% of the should clauses must match.

On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote:

Hi Alex,

If you are looking for exact duplicates then hashing the file content,
and doing a search for that hash would do the job.

This trick won't work for me as these are not exact duplicates. For
example, I have 10 students working on the same 100 pages long word
document. Each of these students could change only one sentence and upload
a document. The hash will be different, but it's 99,99 % same documents.

I have the other service that uses mlt_like_text to recommend some
relevant documents, and my problem is if this document has best score, then
all duplicates will be among top hits and instead recommending users with
several most relevant documents I will recommend 10 instances of same
document.

Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?

If you are looking for near duplicates, then I would recommend
extracting whatever text you have in your html, pdf, doc, indexing that and
running more like this with like_text set to that content.

I tried that as well, and results are very disappointing, though I'm
not sure if that would be good idea having in mind that long textual
documents could be used. For testing purpose, I made a simple test with 10
web pages. Maybe I'm making some mistake there. What I did is to index 10
web pages and store it in document as attachment. Content is stored as
byte[]. Then I'm using the same 10 pages, extract content using Jsoup, and
try to find similar web pages. Here is the code that I used to find similar
web pages to the provided one:

System.out.println("Duplicates for link:"+link);

System.out.println("************************************************");

         String indexName=ESIndexNames.INDEX_DOCUMENTS;
         String indexType=ESIndexTypes.DOCUMENT;
         String mapping =

copyToStringFromClasspath("/org/prosolo/services/indexing/document-mapping.json");

client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();

         URL url = new URL(link);
        org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
          String html=doc.html(); //doc.text();
         QueryBuilder qb = null;
         // create the query
         qb = QueryBuilders.moreLikeThisQuery("file")
                 .likeText(html).minTermFreq(0).minDocFreq(0);
         SearchResponse sr =

client.prepareSearch(ESIndexNames.INDEX_DOCUMENTS)

                 .setQuery(qb).addFields("url", "title",

"contentType")

                 .setFrom(0).setSize(5).execute().actionGet();
         if (sr != null) {
             SearchHits searchHits = sr.getHits();
             Iterator<SearchHit> hitsIter = searchHits.iterator();
             while (hitsIter.hasNext()) {
                 SearchHit searchHit = hitsIter.next();
                 System.out.println("Duplicate:" + searchHit.getId()
                         + "

title:"+searchHit.getFields().get("url").getValue()+" score:" +
searchHit.getScore());

                  }
         }

And results of the execution of this for each of 10 urls is:

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic


Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.3335998

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.16319205

Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.13035104

Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.12292466
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.117023855

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_statistics


Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.1570246

Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
http://en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.09323166

Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.09279101
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.08606046

Duplicates for link:http://en.wikipedia.org/wiki/Formal_science


Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.12439237

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.11299215

Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.107585154

Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.07795183

Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
http://en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285

Duplicates for link:http://en.wikipedia.org/wiki/Star


Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.21684575
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.15316588

Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:
http://en.wikipedia.org/wiki/Cosmology score:0.123572096

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.1177105

Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.11373919

Duplicates for link:http://en.wikipedia.org/wiki/Chemistry


Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.13033955

Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.121021904

Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo

Here you should probably strip the html tags, and solely index the text
in its own field.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/rc580pOzYCs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a92beaad-05bf-431b-9a37-f51512f50aa8%40googlegroups.com
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAMrXmPct6J%3DXwPzRwtXM1ngwpdzXxxhkQFGHxwL%3DNtsRcg11GA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Zoran Jeremic) #9

Thank you Alex. At the moment it works fine even with large documents, but
I'll test if I can reach similar results with interesting terms.

Best,
Zoran

On Thursday, 8 May 2014 02:02:24 UTC-7, Alex Ksikes wrote:

On May 8, 2014 8:09 AM, "Zoran Jeremic" <zoran....@gmail.com <javascript:>>
wrote:

Hi Alex,

Thank you for this explanation. This really helped me to understand how
it works, and now I managed to get results I was expecting just after
setting max_query_terms value to be 0 or some very high value. With these
results in my tests I was able to identify duplicates. I noticed couple of
things though.

  • I got much better results with web pages when I indexed attachment as
    html source and use text extracted by Jsoup in query, then when I indexed
    text extracted from web page as attachment and used text in query. I
    suppose that difference is related to the fact that Jsoup did not extract
    text in the same way as Tika parser used by ES did.
  • There was significant improvement in the results in the second test
    when I have indexed 50 web pages, then in first test when I indexed 10 web
    pages. I deleted index before each test. I suppose that this is related to
    the tf*idf.
    If so, does it make sense to provide some training set for elasticsearch
    that will be used to populate index before system is started to be used?

Perhaps you are asking for a background dataset to bias the selection of
interesting terms. This could make sense depending on your application.

Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?
I have the service that recommends documents to the students based on
their current learning context. It creates tokenized string from titles,
descriptions and keywords of the course lessons student is working at the
moment. I'm using this string as input to the mlt_like_text to find some
interesting resources that could help them.
I want to avoid having duplicates (or very similar documents) among top
documents that are recommended.
My idea was that during the documents uploading (before I index it with
elasticsearch) I find if there already exists it's duplicate, and store
this information as ES document field. Later, in query I can specify that
duplicates are not recommended.

Here you should probably strip the html tags, and solely index the text
in its own field.
As I already mentioned this didn't give me good results for some reason.

Do you think this approach would work fine with large textual documents,
e.g. pdf documents having couple of hundred of pages? My main concern is
related to performances of these queries using like_text, so that's why I
was trying to avoid this approach and use mlt with document id as input.

I don't think this approach would work well in this case, but you should
try. I think what you are after is to either extract good features for your
PDF documents and search on that, or finger printing. This could be
achieved by playing with analyzers.

Thanks,
Zoran

On Wednesday, 7 May 2014 06:14:56 UTC-7, Alex Ksikes wrote:

Hi Zoran,

In a nutshell 'more like this' creates a large boolean disjunctive
query of 'max_query_terms' number of interesting terms from a text
specified in 'like_text'. The interesting terms are picked up with respect
to the their tf-idf scores in the whole corpus. These later parameters
could be tuned with 'min_term_freq', 'min_doc_freq', and 'min_doc_freq'
parameters. The number of boolean clauses that must match is controlled by
'percent_terms_to_match'. In the case of specifying only one field in
'fields', the analyzer used to pick up the terms in 'like_text' is the one
associated with the field, unless specified specified by 'analyzer'. So as
an example, the default is to create a boolean query of 25 interesting
terms where only 30% of the should clauses must match.

On Wednesday, May 7, 2014 5:14:11 AM UTC+2, Zoran Jeremic wrote:

Hi Alex,

If you are looking for exact duplicates then hashing the file content,
and doing a search for that hash would do the job.

This trick won't work for me as these are not exact duplicates. For
example, I have 10 students working on the same 100 pages long word
document. Each of these students could change only one sentence and upload
a document. The hash will be different, but it's 99,99 % same documents.

I have the other service that uses mlt_like_text to recommend some
relevant documents, and my problem is if this document has best score, then
all duplicates will be among top hits and instead recommending users with
several most relevant documents I will recommend 10 instances of same
document.

Could you please define "relevant" in your setting? In a corpus of very
similar documents, is your goal to find the ones which are oddly different?
Have you looked into ES significant terms?

If you are looking for near duplicates, then I would recommend
extracting whatever text you have in your html, pdf, doc, indexing that and
running more like this with like_text set to that content.

I tried that as well, and results are very disappointing, though I'm
not sure if that would be good idea having in mind that long textual
documents could be used. For testing purpose, I made a simple test with 10
web pages. Maybe I'm making some mistake there. What I did is to index 10
web pages and store it in document as attachment. Content is stored as
byte[]. Then I'm using the same 10 pages, extract content using Jsoup, and
try to find similar web pages. Here is the code that I used to find similar
web pages to the provided one:

System.out.println("Duplicates for link:"+link);

System.out.println("************************************************");

         String indexName=ESIndexNames.INDEX_DOCUMENTS;
         String indexType=ESIndexTypes.DOCUMENT;
         String mapping = 

copyToStringFromClasspath("/org/prosolo/services/indexing/document-mapping.json");

client.admin().indices().putMapping(putMappingRequest(indexName).type(indexType).source(mapping)).actionGet();

         URL url = new URL(link);
        org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
          String html=doc.html(); //doc.text();
         QueryBuilder qb = null;
         // create the query
         qb = QueryBuilders.moreLikeThisQuery("file")
                 .likeText(html).minTermFreq(0).minDocFreq(0);
         SearchResponse sr = 

client.prepareSearch(ESIndexNames.INDEX_DOCUMENTS)

                 .setQuery(qb).addFields("url", "title", 

"contentType")

                 .setFrom(0).setSize(5).execute().actionGet();
         if (sr != null) {
             SearchHits searchHits = sr.getHits();
             Iterator<SearchHit> hitsIter = searchHits.iterator();
             while (hitsIter.hasNext()) {
                 SearchHit searchHit = hitsIter.next();
                 System.out.println("Duplicate:" + 

searchHit.getId()

                         + " 

title:"+searchHit.getFields().get("url").getValue()+" score:" +
searchHit.getScore());

                  }
         }

And results of the execution of this for each of 10 urls is:

Duplicates for link:http://en.wikipedia.org/wiki/Mathematical_logic


Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.3335998

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.16319205

Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.13035104

Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.12292466
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.117023855

Duplicates for link:
http://en.wikipedia.org/wiki/Mathematical_statistics


Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.1570246

Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
http://en.wikipedia.org/wiki/Mathematical_statistics score:0.1498403

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.09323166

Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.09279101
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.08606046

Duplicates for link:http://en.wikipedia.org/wiki/Formal_science


Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:
http://en.wikipedia.org/wiki/Formal_science score:0.12439237

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.11299215

Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.107585154

Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.07795183

Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:
http://en.wikipedia.org/wiki/Mathematical_statistics score:0.076521285

Duplicates for link:http://en.wikipedia.org/wiki/Star


Duplicate:1APeDW0KQnWRv_8mihrz4A URL:http://en.wikipedia.org/wiki/Starscore:0.21684575
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.15316588

Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:
http://en.wikipedia.org/wiki/Cosmology score:0.123572096

Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.1177105

Duplicate:Crwk_36bTUCEso1ambs0bA URL:
http://en.wikipedia.org/wiki/Mathematical_logic score:0.11373919

Duplicates for link:http://en.wikipedia.org/wiki/Chemistry


Duplicate:--3l-WRuQL2osXg71ixw7A URL:
http://en.wikipedia.org/wiki/Chemistry score:0.13033955

Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:
http://en.wikipedia.org/wiki/Crystallography score:0.121021904

Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:<span style="colo

Here you should probably strip the html tags, and solely index the text
in its own field.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/rc580pOzYCs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a92beaad-05bf-431b-9a37-f51512f50aa8%40googlegroups.com
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1842c492-29d8-4339-b490-2c7235535495%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #10