Hi Alex,
If you are looking for exact duplicates then hashing the file content, and
doing a search for that hash would do the job.
This trick won't work for me as these are not exact duplicates. For
example, I have 10 students working on the same 100 pages long word
document. Each of these students could change only one sentence and upload
a document. The hash will be different, but it's 99,99 % same documents.
I have the other service that uses mlt_like_text to recommend some relevant
documents, and my problem is if this document has best score, then all
duplicates will be among top hits and instead recommending users with
several most relevant documents I will recommend 10 instances of same
document.
If you are looking for near duplicates, then I would recommend extracting
whatever text you have in your html, pdf, doc, indexing that and running
more like this with like_text set to that content.
I tried that as well, and results are very disappointing, though I'm not
sure if that would be good idea having in mind that long textual documents
could be used. For testing purpose, I made a simple test with 10 web pages.
Maybe I'm making some mistake there. What I did is to index 10 web pages
and store it in document as attachment. Content is stored as byte. Then
I'm using the same 10 pages, extract content using Jsoup, and try to find
similar web pages. Here is the code that I used to find similar web pages
to the provided one:
System.out.println("Duplicates for link:"+link);
System.out.println(
"************************************************");
String indexName=ESIndexNames.INDEX_DOCUMENTS;
String indexType=ESIndexTypes.DOCUMENT;
String mapping = copyToStringFromClasspath(
"/org/prosolo/services/indexing/document-mapping.json");
client.admin().indices().putMapping(putMappingRequest(indexName
).type(indexType).source(mapping)).actionGet();
URL url = new URL(link);
org.jsoup.nodes.Document doc=Jsoup.connect(link).get();
String html=doc.html(); //doc.text();
QueryBuilder qb = null;
// create the query
qb = QueryBuilders.moreLikeThisQuery("file")
.likeText(html).minTermFreq(0).minDocFreq(0);
SearchResponse sr = client.prepareSearch(ESIndexNames.
INDEX_DOCUMENTS)
.setQuery(qb).addFields("url", "title", "contentType")
.setFrom(0).setSize(5).execute().actionGet();
if (sr != null) {
SearchHits searchHits = sr.getHits();
Iterator hitsIter = searchHits.iterator();
while (hitsIter.hasNext()) {
SearchHit searchHit = hitsIter.next();
System.out.println("Duplicate:" + searchHit.getId()
+ " title:"+searchHit.getFields().get("url").
getValue()+" score:" + searchHit.getScore());
}
}
And results of the execution of this for each of 10 urls is:
Duplicates for link:Mathematical logic - Wikipedia
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.3335998
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.16319205
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:Formal science - Wikipedia
score:0.13035104
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:Star - Wikipedia
score:0.12292466
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.117023855
Duplicates for link:Mathematical statistics - Wikipedia
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.1570246
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:Mathematical statistics - Wikipedia
score:0.1498403
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.09323166
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:Star - Wikipedia
score:0.09279101
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:Formal science - Wikipedia
score:0.08606046
Duplicates for link:Formal science - Wikipedia
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:Formal science - Wikipedia
score:0.12439237
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.11299215
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.107585154
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.07795183
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:Mathematical statistics - Wikipedia
score:0.076521285
Duplicates for link:Star - Wikipedia
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:Star - Wikipedia
score:0.21684575
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.15316588
Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:Cosmology - Wikipedia
score:0.123572096
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.1177105
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.11373919
Duplicates for link:Chemistry - Wikipedia
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.13033955
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.121021904
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:Formal science - Wikipedia
score:0.10888695
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.09392845
Duplicate:pPJdo7TAQhWzTdMAHyPWkA URL:Mathematical statistics - Wikipedia
score:0.059616603
Duplicates for link:Analytical chemistry - Wikipedia
Duplicate:WTP_O31LQZuc7yVfw6I14g URL:Analytical chemistry - Wikipedia
score:0.2502811
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.17005277
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.14266267
Duplicate:u52_jEt6T3iTc3rKqYcW6w URL:Biochemistry - Wikipedia
score:0.13861096
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.11827134
Duplicates for link:Crystallography - Wikipedia
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.219179
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.131886
Duplicate:WTP_O31LQZuc7yVfw6I14g URL:Analytical chemistry - Wikipedia
score:0.1229952
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.119422816
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:Star - Wikipedia
score:0.11649642
Duplicates for link:Cosmology - Wikipedia
Duplicate:vFf9IdJyQ-yfPnqzYRm9Ig URL:Cosmology - Wikipedia
score:0.24200413
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.16012353
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.13237886
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:Star - Wikipedia
score:0.12711151
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:Formal science - Wikipedia
score:0.11265415
Duplicates for link:Biochemistry - Wikipedia
Duplicate:u52_jEt6T3iTc3rKqYcW6w URL:Biochemistry - Wikipedia
score:0.15852709
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.14897613
Duplicate:Crwk_36bTUCEso1ambs0bA URL:Mathematical logic - Wikipedia
score:0.11837863
Duplicate:8dDa6HsBS12HrI0XgFVLvA URL:Formal science - Wikipedia
score:0.115204185
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.10719262
Duplicates for link:http://en.wikipedia.org/wiki/Astrochemistry
Duplicate:2NElV2ULQxqcbFhd2pVy0w URL:Crystallography - Wikipedia
score:0.17403923
Duplicate:--3l-WRuQL2osXg71ixw7A URL:Chemistry - Wikipedia
score:0.15946785
Duplicate:O9d4ZfcHSJWfTjdrb7N68g URL:http://en.wikipedia.org/wiki/Astrochemistry
score:0.15717393
Duplicate:1APeDW0KQnWRv_8mihrz4A URL:Star - Wikipedia
score:0.15676315
Duplicate:u52_jEt6T3iTc3rKqYcW6w URL:Biochemistry - Wikipedia
score:0.13150169
If I extract text from html page, results are better, but still not that
good (between 0.29 and 0.44). I suppose there is some problem with my code
here and that's not the best elasticsearch can do
On Tuesday, 6 May 2014 02:17:57 UTC-7, Alex Ksikes wrote:
Hi Zoran,
If you are looking for exact duplicates then hashing the file content, and
doing a search for that hash would do the job. If you are looking for near
duplicates, then I would recommend extracting whatever text you have in
your html, pdf, doc, indexing that and running more like this with
like_text set to that content. Additionally you can perform a mlt search on
more fields including the meta-data fields extracted with the attachment
plugin. Hope this helps.
Alex
On Monday, May 5, 2014 8:08:30 PM UTC+2, Zoran Jeremic wrote:
Hi Alex,
Thank you for your explanation. It makes sense now. However, I'm not sure
I understood your proposal.
So I would adjust the mlt_fields accordingly, and possibly extract the
relevant portions of texts manually
What do you mean by adjusting mlt_fields? The only shared field that is
guaranteed to be same is file. Different users could add different titles
to documents, but attach same or almost the same documents. If I compare
documents based on the other fields, it doesn't mean that it will match,
even though attached files are exactly the same.
I'm also not sure what did you mean by extract the relevant portions of
text manually. How would I do that and what to do with it?
Thanks,
Zoran
On Monday, 5 May 2014 01:23:49 UTC-7, Alex Ksikes wrote:
Hi Zoran,
Using the attachment type, you can text search over the attached
document meta-data, but not its actual content, as it is base 64 encoded.
So I would adjust the mlt_fields accordingly, and possibly extract the
relevant portions of texts manually. Also set percent_terms_to_match = 0,
to ensure that all boolean clauses match. Let me know how this works out
for you.
Cheers,
Alex
On Monday, May 5, 2014 5:50:07 AM UTC+2, Zoran Jeremic wrote:
Hi guys,
I have a document that stores a content of html file, pdf, doc or
other textual document in one of it's fields as byte array using attachment
plugin. Mapping is as follows:
{ "document":{
"properties":{
"title":{"type":"string","store":true },
"description":{"type":"string","store":"yes"},
"contentType":{"type":"string","store":"yes"},
"url":{"store":"yes", "type":"string"},
"visibility": { "store":"yes", "type":"string"},
"ownerId": {"type": "long", "store":"yes" },
"relatedToType": { "type": "string", "store":"yes" },
"relatedToId": {"type": "long", "store":"yes" },
"file":{
"path": "full","type":"attachment",
"fields":{
"author": { "type": "string" },
"title": { "store": true,"type": "string" },
"keywords": { "type": "string" },
"file": { "store": true, "term_vector":
"with_positions_offsets","type": "string" },
"name": { "type": "string" },
"content_length": { "type": "integer" },
"date": { "format": "dateOptionalTime", "type":
"date" },
"content_type": { "type": "string" }
}
}}
And the code I'm using to store the document is:
VisibilityType.PUBLIC
These files seems to be stored fine and I can search content. However,
I need to identify if there are duplicates of web pages or files stored in
ES, so I don't return the same documents to the user as search or
recommendation result. My expectation was that I could use MoreLikeThis
after the document was indexed to identify if there are duplicates of that
document and accordingly to mark it as duplicate. However, results look
weird for me, or I don't understand very well how MoreLikeThis works.
For example, I indexed web page
http://en.wikipedia.org/wiki/Linguistics 3 times, and all 3 documents
in ES have exactly the same binary content under file. Then for the
following query:
http://localhost:9200/documents/document/WpkcK-ZjSMi_l6iRq0Vuhg/_mlt?mlt_fields=file&min_doc_freq=1
where ID is id of one of these documents I got these results:
http://en.wikipedia.org/wiki/Linguistics with score 0.6633003
http://en.wikipedia.org/wiki/Linguistics with score 0.6197818
http://en.wikipedia.org/wiki/Computational_linguistics with score
0.48509508
...
For some other examples, scores for the same documents are much lower,
and sometimes (though not that often) I don't get duplicates on the first
positions. I would expect here to have score 1.0 or higher for documents
that are exactly the same, but it's not the case, and I can't figure out
how could I identify if there are duplicates in the Elasticsearch index.
I would appreciate if somebody could explain if this is expected
behaviour or I didn't use it properly.
Thanks,
Zoran
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c455030b-4256-4214-b713-7925698966ad%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.