Similar document search?

I'm trying to do similar document search with ES.

Let's say I have the following documents:

Document 1:     0 0 0 0 1 1 3 3 4 4 4
Document 2:     0 1 1 3 4
Document 1:     0 0 0 0 0 0 0 0 0 0 1 1 3 3 4 4 4
Query Document: 0 0 0 0 0 1 1 3 3 4 4 4

If I use "Query Document" to search, the first result should be Document 1 because their "bag of words" (term vectors) representation is the closest.

I was wondering whether this was possible with ES?

So far, this is what I've tried without much success:

Here's how I create the index:

        settings = {
            'index': {
                'similarity': {
                    'my_similarity': {
                        'type': 'classic',
                    },
                },
            },
            'analysis': {
                'filter': {},
                'analyzer': {
                    'visual_word': {
                        'type': 'custom',
                        'tokenizer': 'whitespace',
                        'filter': [],
                        'similarity': 'my_similarity',
                    }
                }
            }
        }
        mappings = {
                doc_type: {
                    '_source': {
                        'enabled': False,
                    },
                    'properties': {
                        'filename': {
                            'type': 'keyword',
                            # 'index': 'not_analyzed',
                        },
                        'visual_words': {
                            'type': 'text',
                            'analyzer': 'visual_word',
                            'term_vector': 'yes',
                        },
                    },
                }
        }

            es.indices.create(
                    index = 'myindex',
                    body = {
                        'mappings': mappings,
                        'settings': settings,
                    },
            )

Here's how I do my search query:

        query = {
            'query': {
                'match': {
                    'visual_words': {
                        'query': '0 0 0 0 0 1 1 3 3 4 4 4',
                    }
                }
            }
        }
        res = es.search(
                index = 'myindex',
                doc_type = 'image',
                body = query,
                explain = True,
        )

That strategy gives unexpected results. For example, querying an exact copy of a document in the index won't even return it as a first result.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.