I'm trying to do similar document search with ES.
Let's say I have the following documents:
Document 1: 0 0 0 0 1 1 3 3 4 4 4
Document 2: 0 1 1 3 4
Document 1: 0 0 0 0 0 0 0 0 0 0 1 1 3 3 4 4 4
Query Document: 0 0 0 0 0 1 1 3 3 4 4 4
If I use "Query Document" to search, the first result should be Document 1
because their "bag of words" (term vectors) representation is the closest.
I was wondering whether this was possible with ES?
So far, this is what I've tried without much success:
Here's how I create the index:
settings = {
'index': {
'similarity': {
'my_similarity': {
'type': 'classic',
},
},
},
'analysis': {
'filter': {},
'analyzer': {
'visual_word': {
'type': 'custom',
'tokenizer': 'whitespace',
'filter': [],
'similarity': 'my_similarity',
}
}
}
}
mappings = {
doc_type: {
'_source': {
'enabled': False,
},
'properties': {
'filename': {
'type': 'keyword',
# 'index': 'not_analyzed',
},
'visual_words': {
'type': 'text',
'analyzer': 'visual_word',
'term_vector': 'yes',
},
},
}
}
es.indices.create(
index = 'myindex',
body = {
'mappings': mappings,
'settings': settings,
},
)
Here's how I do my search query:
query = {
'query': {
'match': {
'visual_words': {
'query': '0 0 0 0 0 1 1 3 3 4 4 4',
}
}
}
}
res = es.search(
index = 'myindex',
doc_type = 'image',
body = query,
explain = True,
)
That strategy gives unexpected results. For example, querying an exact copy of a document in the index won't even return it as a first result.