Elasticsearch not returning expected results at end of query


(Ben Squire) #1

Hi All,

A little help/understanding would be greatly appreciated :slight_smile:

Here is my index:

    [
        'index' => 'proof',
        'body' => [
            'settings' => [
                'analysis' => [
                    'tokenizer' => [
                        'ngram_tokenizer' => [
                            'type' => 'nGram',
                            'min_gram' => 1,
                            'max_gram' => 20,
                            'token_chars' => ['letter', 'digit'],
                        ],
                    ],
                    'analyzer' => [
                        'ngram_tokenizer_analyzer' => [
                            'type' => 'custom',
                            'tokenizer' => 'ngram_tokenizer',
                            'filter' => ['lowercase'],
                        ]
                    ]
                ]
            ],
            'mappings' => [
                'proof_page' => [
                    'properties' => [
                        'content' => [
                            'type' => 'multi_field',
                            'path' => 'just_name',
                            'fields' => [
                                'content' => [
                                    'type' => 'string',
                                    'analyzer' => 'ngram_tokenizer_analyzer',
                                ],
                                'untouched' => [
                                    'type' => 'string'
                                ]
                            ]
                        ],
                        'page_number' => [
                            'type' => 'integer',
                            'index' => 'not_analyzed',
                        ],
                        'proof_id' => [
                            'type' => 'string',
                            'index' => 'not_analyzed',
                        ],
                    ]
                ]
            ]
        ]
    ]

and here is a sample query:

[
    'index' => 'proof',
    'type' => 'proof_page',
    'body' => [
        'query' => [
            'filtered' => [
                'query' => [
                    'match_phrase' => [
                        'content' => [
                            'query' => 'Lorem Ipsum is simply dum',
                            'slop' => 0,
                        ],
                    ],
                ],
                'filter' => [
                    'term' => [
                        'proof_id' => '56ebea535f5e8841038b4569',
                    ],
                ],
            ],
        ],
        '_source' => false,
        'fields' => [
            'proof_id',
            'proof_name',
            'project_id',
            'project_name',
            'page_number',
        ],
        'highlight' => [
            'fields' => [
                'content' => [
                    'type' => 'plain',
                    'fragment_size' => 100,
                    'number_of_fragments' => 100,
                    'fragmenter' => 'simple',
                ]
            ]
        ],
        'from' => 0,
        'size' => 10,
        'sort' => [
            'page_number' => [
                'order' => 'asc',
            ]
        ]
    ]
]

and lets assume that one of my documents that matches proof_id: 56ebea535f5e8841038b4569 containes something like:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s.

What I'd expect to see is the result returning a snippet with the following highlighted:

Lorem Ipsum is simply dum

but it doesn't return any matches, the same is the case for:

Lorem Ipsum is simply du
Lorem Ipsum is simply dumm

However it does return matches for:

Lorem Ipsum is simply d
Lorem Ipsum is simply dummy

which makes no sense to me as I can see every variation of "dummy" as vector terms (the ngram is big enough to cover all variations).

Its worth pointing out that this only happens with terms at the end of the search string. So for example:

m Ipsum is simply d
em Ipsum is simply d
rem Ipsum is simply d
orem Ipsum is simply d
Lorem Ipsum is simply d

all highlight as expected.

Thanks all!

Ben


(system) #2